Last Thursday, staging looked perfect at 6:10 PM. By 8:40 PM, production was “mostly fine” but memory climbed on one service, an old ConfigMap had quietly returned, and a hotfix applied by kubectl at noon had been overwritten by a late pipeline run. Nothing exploded. Everything just became uncertain.
That was the day we stopped treating GitOps as “auto-deploy from Git” and started treating it as an operational control system. This guide is the runbook we now use for gitops drift detection in real teams: Argo CD auto-sync with safe pruning, pull-request environments for pre-merge proof, and Kubernetes policy guardrails to block risky config at admission time.
If you are tightening release safety in parallel, this pairs well with our earlier deep dives on OIDC and CI secret hygiene, deployment guardrails in GitHub Actions, and backend reliability practices.
Why drift happens even in “mature” Kubernetes setups
Most drift is not malicious. It is normal operational behavior:
- An engineer patches a Deployment during an incident and forgets to back-port the change to Git.
- A chart upgrade renames values, so defaults sneak in where overrides used to apply.
- An old object remains live because pruning is disabled “for safety,” and now two versions coexist.
- Preview environments are created quickly, but lifecycle cleanup is inconsistent.
The tradeoff is uncomfortable but real: strict reconciliation can feel scary during incidents, while loose reconciliation silently accumulates risk. The answer is not “more manual care.” The answer is explicit policy and predictable automation.
The control model that works in practice
We use a simple three-layer control model:
- Desired state in Git: every deployable object is declarative, versioned, and reviewable.
- Continuous reconciliation: Argo CD keeps cluster state aligned, including drift correction.
- Admission guardrails: cluster policy blocks unsafe changes before they become drift debt.
This aligns with OpenGitOps principles, especially declarative state, pull-based updates, and continuous reconciliation. It also removes a common failure mode: CI systems needing broad cluster credentials just to deploy.
Layer 1: Argo CD automated sync, pruning, and self-heal
Start with one service and configure argo cd auto sync explicitly. Do not rely on defaults your team does not remember.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payments-api-prod
namespace: argocd
spec:
project: payments
source:
repoURL: https://github.com/acme/platform-config.git
targetRevision: main
path: services/payments/prod
destination:
server: https://kubernetes.default.svc
namespace: payments
syncPolicy:
automated:
enabled: true
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
- PruneLast=true
- RespectIgnoreDifferences=true
- ApplyOutOfSyncOnly=true
revisionHistoryLimit: 10
Why each switch matters:
- prune: true removes ghost resources that survive forever otherwise.
- selfHeal: true corrects live-cluster edits that bypass Git.
- allowEmpty: false prevents accidental “delete everything” from a bad path or generator output.
Teams often fear pruning first. That fear is valid. Roll it out namespace by namespace, with dry-run checks and alerting around large deletions.
Layer 2: pull-request environments that prove changes before merge
Preview environments are where many drift problems can be caught early, if they are generated consistently and destroyed reliably. Argo CD ApplicationSet with a pull-request generator gives you that repeatable path.
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: checkout-preview
namespace: argocd
spec:
goTemplate: true
generators:
- pullRequest:
github:
owner: acme-org
repo: checkout-service
labels:
- preview
tokenRef:
secretName: github-token
key: token
requeueAfterSeconds: 900
template:
metadata:
name: 'checkout-pr-{{.number}}'
labels:
env: preview
pr: '{{.number}}'
spec:
project: preview
source:
repoURL: https://github.com/acme-org/platform-config.git
targetRevision: main
path: previews/checkout
helm:
parameters:
- name: image.tag
value: 'pr-{{.number}}'
destination:
server: https://kubernetes.default.svc
namespace: 'checkout-pr-{{.number}}'
syncPolicy:
automated:
enabled: true
prune: true
selfHeal: true
This keeps preview lifecycle attached to PR lifecycle, reducing orphaned namespaces and stale test data. It also gives reviewers a stable URL and repeatable environment shape, which improves bug reproduction.
Layer 3: admission policy to stop unsafe config before it lands
Drift control is stronger when bad states are rejected, not merely repaired later. Kubernetes ValidatingAdmissionPolicy (CEL-based, in-process) is now stable and useful for high-signal rules.
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: require-prod-guardrails
spec:
failurePolicy: Fail
matchConstraints:
resourceRules:
- apiGroups: ["apps"]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["deployments"]
validations:
- expression: "object.metadata.namespace.startsWith('prod-') ? has(object.spec.template.spec.securityContext) : true"
message: "Prod deployments must define pod securityContext"
- expression: "object.metadata.namespace.startsWith('prod-') ? object.spec.replicas <= 20 : true"
message: "Replica count above 20 in prod requires approved override"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: bind-require-prod-guardrails
spec:
policyName: require-prod-guardrails
validationActions: [Deny, Audit]
kubernetes validating admission policy is not a replacement for code review, but it is an excellent final safety net when pressure is high and humans are tired.
Operational playbook: detecting and resolving drift fast
- Watch Argo CD out-of-sync count and reconciliation failures as first-line signals.
- If drift appears, identify source: Git change, live patch, or generator mismatch.
- For emergency live patches, create a "reconcile debt" ticket with SLA (for example: 24 hours).
- Back-port live fix into Git, then let reconcile close the loop.
- Review post-incident diff and add a policy/test so the same drift class is blocked next time.
For teams running mixed web and async systems, use the same discipline across workloads. Our self-healing batch incident write-up shows why consistency matters beyond synchronous APIs: read the batch pipeline case study.
Troubleshooting
Argo CD keeps showing OutOfSync for fields we do not care about
Cause: mutable fields or controllers rewriting defaults. Fix: define ignoreDifferences deliberately and keep the list short, reviewed, and documented.
Auto-prune scares the team because of accidental mass deletes
Cause: generator/path mistakes can create empty desired state. Fix: keep allowEmpty: false, add pre-merge manifest validation, and alert on unusual prune volume.
Preview namespaces pile up after PRs are closed
Cause: PR generator polling lag, permissions, or failed cleanup hooks. Fix: enforce labels, run scheduled orphan cleanup, and monitor age of preview namespaces.
Policy blocks legitimate emergency fix
Cause: policy is strict but no exception path exists. Fix: add controlled break-glass process with audit trail and expiry, then reconcile to compliant state quickly.
FAQ
Should we enable self-heal in production immediately?
Enable it service by service. Start with low-risk apps, instrument drift metrics, then expand once teams trust the behavior and rollback routine.
Do we still need CI deployment jobs if Argo CD auto-sync is on?
Yes, but CI should build, test, scan, and update Git state, not push directly to the cluster. This reduces blast radius of CI credentials.
Is admission policy enough for platform security?
No. It is one layer. You still need identity controls, secret handling, image verification, and runtime observability. Policy helps enforce the minimum floor reliably.
Actionable takeaways
- Pick one production service this week and enable automated sync with prune + self-heal.
- Adopt PR-based preview generation so environment shape is reproducible before merge.
- Create 2-3 high-value admission rules for production namespaces first.
- Track drift MTTR as an SRE metric, not just deployment frequency.
- Document a break-glass path that is fast, audited, and time-bound.
Sources reviewed while preparing this guide
- OpenGitOps principles (v1.0)
- Argo CD documentation: Automated Sync Policy
- Argo CD documentation: ApplicationSet Pull Request Generator
- Kubernetes documentation: ValidatingAdmissionPolicy
Drift is rarely one dramatic mistake. It is usually ten small exceptions that become normal. GitOps works when you make desired state explicit, reconciliation continuous, and exceptions temporary. That combination is what turns "it deployed" into "it is trustworthy."

Leave a Reply