You defined your infrastructure in Terraform. You reviewed the pull request. You ran terraform plan, verified the output, and applied it. The resources came up exactly as declared. Your state file matched reality. Everything was clean.
Six months later, someone SSH'd into a production box during an incident and changed a config file. A teammate opened the AWS console and tweaked a security group rule to unblock a deploy. An IAM policy got widened to fix a permissions error at 2 AM. None of these changes made it back into code.
Your declared state and your actual state are now different. You have drift. And unless you go looking for it, you won't know until something breaks.
What Drift Looks Like
Drift rarely announces itself. It doesn't throw errors. It doesn't trigger alerts. It accumulates quietly, one small divergence at a time, until your infrastructure-as-code repository is a work of fiction.
In practice, drift takes familiar forms:
- Manual hotfixes applied during incidents that never get backported to Terraform, Pulumi, or CloudFormation
- Console changes made under pressure because the pipeline was too slow or too broken to use
- Permission escalations that were "temporary" and are now permanent fixtures nobody remembers creating
- Config files edited directly on production servers, bypassing the deployment process entirely
- Resource tags, scaling parameters, or networking rules adjusted by hand and never documented
Each of these individually is minor. Collectively, they erode the entire premise of infrastructure as code: that your repository represents the truth of what's running. Once that trust is broken, your IaC isn't a source of truth. It's a suggestion.
Why Drift Happens
The instinct is to blame the people who made manual changes. Don't. Drift is almost always a rational response to a broken or hostile workflow.
People go around the pipeline when:
- It's faster to fix in place. If applying a one-line change through your IaC pipeline takes 45 minutes of CI, review, and plan/apply cycles, engineers will SSH in and fix it in 30 seconds. The math isn't hard.
- Incident pressure overrides process. When the site is down and customers are affected, nobody is going to wait for a PR review. They're going to do whatever restores service, and they're right to do so.
- New team members don't know the workflow. If your IaC process isn't documented, discoverable, and obvious, people will default to the tools they know: the console, the CLI, the SSH session.
- The pipeline is too brittle. If
terraform applyfails half the time due to state lock issues, provider bugs, or flaky CI runners, engineers stop trusting it. Distrust breeds workarounds. - The feedback loop is too long. If you can't see the result of a change for 20 minutes, you lose the ability to iterate. People switch to direct manipulation because the feedback is immediate.
Drift isn't a discipline problem. It's a tooling problem, an incentive problem, and sometimes an organizational design problem. Treating it as anything else guarantees it will continue.
The Compound Effect
A single manual change is harmless. One security group rule added through the console. One environment variable set directly on a container. Trivial. You'll backport it tomorrow. (You won't.)
Now multiply that by every engineer on the team, across every service, over twelve months. A hundred manual changes across a hundred systems. Some of them conflict with each other. Some of them depend on other manual changes that were also never codified. Some of them were made by people who have since left the company.
You now have an environment that nobody fully understands. Your Terraform state says one thing. AWS says another. The running configuration says a third. When you try to run terraform plan, the diff is so large and so incomprehensible that nobody is willing to apply it. The IaC repo becomes read-only in practice -- everyone is afraid to touch it because nobody knows what will happen.
This is the compound effect of drift. It doesn't break your system on day one. It makes your system unmanageable on day three hundred.
Detection Before Enforcement
The worst response to drift is to immediately lock everything down. If you jump straight to enforcement without understanding the scope and causes of existing drift, you'll break things, frustrate your team, and undermine the credibility of the entire effort.
Start with detection. You can't fix what you can't see.
Drift detection should be continuous and automated. Run terraform plan on a schedule -- not just when someone opens a PR, but constantly. Compare declared state to actual state every hour, or at minimum every day. Pipe the results into your monitoring system. Alert on divergence the same way you alert on elevated error rates or resource exhaustion.
The goal at this stage isn't to prevent drift. It's to make drift visible. When a team can see that their environment has drifted in fourteen places, and they can see exactly what changed and when, two things happen: they understand the scope of the problem, and they start self-correcting because nobody wants their name next to a growing list of unreconciled changes.
Visibility changes behavior before enforcement does.
Enforcement Strategies
Once you have reliable detection and your team understands the baseline level of drift, you can start enforcing. There are several strategies, and the right one depends on your architecture and organizational maturity.
Immutable Infrastructure
Don't fix servers. Replace them. If no one can SSH into a production instance because there's nothing to SSH into -- because instances are replaced wholesale on every deploy -- drift becomes structurally impossible at the compute layer. Containers, AMIs baked in CI, and serverless functions all push in this direction. The less mutable surface area you have, the less drift you get.
GitOps
The Git repository is the single source of truth. Always. Changes flow in one direction: from the repo to the environment, never the reverse. Tools like Flux and ArgoCD implement this pattern for Kubernetes. The principle applies everywhere: if it's not in the repo, it shouldn't be in production. If it's in production but not in the repo, it gets reverted automatically.
Policy as Code
Use tools like Open Policy Agent (OPA) or HashiCorp Sentinel to define what valid infrastructure looks like and reject anything that doesn't match. This catches drift at the point of change -- whether that change comes through the pipeline or through the console. Policy as code turns governance from a document people ignore into a gate people can't bypass.
Continuous Reconciliation
Controllers that continuously compare actual state to desired state and converge the two. This is how Kubernetes works at its core: you declare a desired state, and the control plane relentlessly drives actual state toward it. The same pattern can be applied to cloud resources, configuration, and permissions. The system doesn't just detect drift -- it fixes it automatically.
The Human Side
Technical controls are necessary but insufficient. If your engineers consistently go around the pipeline, the pipeline has a problem. Locking it down harder won't fix the underlying issue. It'll just create more creative workarounds.
Before you enforce, ask why people aren't using the prescribed workflow:
- Is the pipeline too slow? Speed it up. Parallelize plans. Cache providers. Use targeted applies instead of full-stack plans.
- Is the pipeline unreliable? Fix the flaky tests, the state lock contention, the provider version conflicts. Reliability is a prerequisite for trust.
- Is the pipeline inaccessible? Make sure every engineer can run a plan locally. Provide clear documentation. Lower the barrier to entry until using the pipeline is easier than not using it.
- Is the feedback loop too long? Give engineers immediate visibility into what their change will do. Fast plan previews in PR comments. Dry-run modes. The closer the feedback, the more likely people are to use the tool.
Good systems make the right thing the easy thing. If doing it the right way is slower, harder, or more confusing than doing it the wrong way, you have a systems design problem, not a people problem. Fix the system.
Starting Point
If drift is a known or suspected problem in your environment, here's where to begin:
- Run a drift audit today. Execute
terraform planacross every workspace and environment. Compare declared state to actual state. Document every divergence. - Identify why drift happened in each case. Was it an incident? A workflow gap? A missing resource in code? Understanding the cause matters more than counting the instances.
- Fix the tooling or workflow gaps that caused it. If the pipeline is slow, speed it up. If it's unreliable, stabilize it. If people don't know how to use it, teach them.
- Implement continuous drift detection. Scheduled plans, automated comparisons, alerts on divergence. Make drift visible as a permanent practice, not a one-time audit.
- Graduate to enforcement once detection is stable. Immutable infrastructure where possible. GitOps for deployment flow. Policy as code for guardrails. Continuous reconciliation for convergence.
Don't try to do all five at once. Each step builds on the previous one. Detection without understanding the causes is noise. Enforcement without reliable detection is reckless. Start at step one and move forward when each layer is solid.
Drift is a symptom, not the disease. The disease is systems that allow desired state and actual state to diverge silently, without detection, without accountability, and without correction.
Configuration drift doesn't happen because your engineers are careless or undisciplined. It happens because every system without active enforcement will diverge from its intended state over time. This is not a failure of people. It is a property of complex systems. Entropy is the default. Order requires energy.
The teams that maintain clean, trustworthy infrastructure aren't the ones with the strictest policies or the most obedient engineers. They're the ones who built pipelines worth using, detection that never sleeps, and enforcement that makes drift structurally difficult rather than merely prohibited. They treated the gap between desired state and actual state as a first-class operational concern -- and they automated the work of closing it.