ESC
← Back to blog

Drift Is a Feature of Neglect

· X min read
IaC Governance Operations
AI Summary

You defined your infrastructure in Terraform. You reviewed the pull request. You ran terraform plan, verified the output, and applied it. The resources came up exactly as declared. Your state file matched reality. Everything was clean.

Six months later, someone SSH'd into a production box during an incident and changed a config file. A teammate opened the AWS console and tweaked a security group rule to unblock a deploy. An IAM policy got widened to fix a permissions error at 2 AM. None of these changes made it back into code.

Your declared state and your actual state are now different. You have drift. And unless you go looking for it, you won't know until something breaks.

What Drift Looks Like

Drift rarely announces itself. It doesn't throw errors. It doesn't trigger alerts. It accumulates quietly, one small divergence at a time, until your infrastructure-as-code repository is a work of fiction.

In practice, drift takes familiar forms:

Each of these individually is minor. Collectively, they erode the entire premise of infrastructure as code: that your repository represents the truth of what's running. Once that trust is broken, your IaC isn't a source of truth. It's a suggestion.

Why Drift Happens

The instinct is to blame the people who made manual changes. Don't. Drift is almost always a rational response to a broken or hostile workflow.

People go around the pipeline when:

Drift isn't a discipline problem. It's a tooling problem, an incentive problem, and sometimes an organizational design problem. Treating it as anything else guarantees it will continue.

The Compound Effect

A single manual change is harmless. One security group rule added through the console. One environment variable set directly on a container. Trivial. You'll backport it tomorrow. (You won't.)

Now multiply that by every engineer on the team, across every service, over twelve months. A hundred manual changes across a hundred systems. Some of them conflict with each other. Some of them depend on other manual changes that were also never codified. Some of them were made by people who have since left the company.

You now have an environment that nobody fully understands. Your Terraform state says one thing. AWS says another. The running configuration says a third. When you try to run terraform plan, the diff is so large and so incomprehensible that nobody is willing to apply it. The IaC repo becomes read-only in practice -- everyone is afraid to touch it because nobody knows what will happen.

This is the compound effect of drift. It doesn't break your system on day one. It makes your system unmanageable on day three hundred.

Detection Before Enforcement

The worst response to drift is to immediately lock everything down. If you jump straight to enforcement without understanding the scope and causes of existing drift, you'll break things, frustrate your team, and undermine the credibility of the entire effort.

Start with detection. You can't fix what you can't see.

Drift detection should be continuous and automated. Run terraform plan on a schedule -- not just when someone opens a PR, but constantly. Compare declared state to actual state every hour, or at minimum every day. Pipe the results into your monitoring system. Alert on divergence the same way you alert on elevated error rates or resource exhaustion.

The goal at this stage isn't to prevent drift. It's to make drift visible. When a team can see that their environment has drifted in fourteen places, and they can see exactly what changed and when, two things happen: they understand the scope of the problem, and they start self-correcting because nobody wants their name next to a growing list of unreconciled changes.

Visibility changes behavior before enforcement does.

Enforcement Strategies

Once you have reliable detection and your team understands the baseline level of drift, you can start enforcing. There are several strategies, and the right one depends on your architecture and organizational maturity.

Immutable Infrastructure

Don't fix servers. Replace them. If no one can SSH into a production instance because there's nothing to SSH into -- because instances are replaced wholesale on every deploy -- drift becomes structurally impossible at the compute layer. Containers, AMIs baked in CI, and serverless functions all push in this direction. The less mutable surface area you have, the less drift you get.

GitOps

The Git repository is the single source of truth. Always. Changes flow in one direction: from the repo to the environment, never the reverse. Tools like Flux and ArgoCD implement this pattern for Kubernetes. The principle applies everywhere: if it's not in the repo, it shouldn't be in production. If it's in production but not in the repo, it gets reverted automatically.

Policy as Code

Use tools like Open Policy Agent (OPA) or HashiCorp Sentinel to define what valid infrastructure looks like and reject anything that doesn't match. This catches drift at the point of change -- whether that change comes through the pipeline or through the console. Policy as code turns governance from a document people ignore into a gate people can't bypass.

Continuous Reconciliation

Controllers that continuously compare actual state to desired state and converge the two. This is how Kubernetes works at its core: you declare a desired state, and the control plane relentlessly drives actual state toward it. The same pattern can be applied to cloud resources, configuration, and permissions. The system doesn't just detect drift -- it fixes it automatically.

The Human Side

Technical controls are necessary but insufficient. If your engineers consistently go around the pipeline, the pipeline has a problem. Locking it down harder won't fix the underlying issue. It'll just create more creative workarounds.

Before you enforce, ask why people aren't using the prescribed workflow:

Good systems make the right thing the easy thing. If doing it the right way is slower, harder, or more confusing than doing it the wrong way, you have a systems design problem, not a people problem. Fix the system.

Starting Point

If drift is a known or suspected problem in your environment, here's where to begin:

  1. Run a drift audit today. Execute terraform plan across every workspace and environment. Compare declared state to actual state. Document every divergence.
  2. Identify why drift happened in each case. Was it an incident? A workflow gap? A missing resource in code? Understanding the cause matters more than counting the instances.
  3. Fix the tooling or workflow gaps that caused it. If the pipeline is slow, speed it up. If it's unreliable, stabilize it. If people don't know how to use it, teach them.
  4. Implement continuous drift detection. Scheduled plans, automated comparisons, alerts on divergence. Make drift visible as a permanent practice, not a one-time audit.
  5. Graduate to enforcement once detection is stable. Immutable infrastructure where possible. GitOps for deployment flow. Policy as code for guardrails. Continuous reconciliation for convergence.

Don't try to do all five at once. Each step builds on the previous one. Detection without understanding the causes is noise. Enforcement without reliable detection is reckless. Start at step one and move forward when each layer is solid.

Drift is a symptom, not the disease. The disease is systems that allow desired state and actual state to diverge silently, without detection, without accountability, and without correction.

Configuration drift doesn't happen because your engineers are careless or undisciplined. It happens because every system without active enforcement will diverge from its intended state over time. This is not a failure of people. It is a property of complex systems. Entropy is the default. Order requires energy.

The teams that maintain clean, trustworthy infrastructure aren't the ones with the strictest policies or the most obedient engineers. They're the ones who built pipelines worth using, detection that never sleeps, and enforcement that makes drift structurally difficult rather than merely prohibited. They treated the gap between desired state and actual state as a first-class operational concern -- and they automated the work of closing it.

Comments