ESC
← Back to blog

Feature Flags Are Infrastructure

· X min read
Operations Architecture Automation
AI Summary

Most teams think of feature flags as a developer convenience. A way to merge code to main without exposing unfinished features to users. Wrap the new code path in an if statement, flip the flag when it's ready, move on. That's the starting point -- and for a lot of organizations, it's also the ending point. They never get past using flags as a glorified code comment.

This misses the real value entirely. Feature flags -- when treated as infrastructure rather than developer tooling -- become one of the most powerful operational mechanisms in your arsenal. They're a deployment safety net. An incident response kill switch. A progressive rollout engine. A way to decouple deployment from release, which is one of the most important separations in modern software delivery.

Deployment Is Not Release

The most consequential shift in how you think about feature flags starts here: deploying code and releasing functionality are two different operations. Deployment is getting new code onto production servers. Release is making that code active for users. Without feature flags, these are the same event. You deploy, and whatever is in the build is immediately live. If it's broken, you roll back the deployment. The blast radius is everyone.

Feature flags decouple these operations. You deploy code that's wrapped in a flag. The code is on the server but not active. You release the feature by enabling the flag -- to a subset of users, to a specific region, to internal testers first, then gradually to everyone. If something goes wrong, you disable the flag. No redeployment. No rollback. No waiting for CI/CD pipelines. The bad code path is disabled in seconds.

This distinction matters enormously for incident response. When a deployment introduces a regression, the question is: how fast can you stop the bleeding? If you need to roll back a deployment, you're looking at minutes at best -- build the previous artifact, push it through your pipeline, wait for instances to drain and restart. With a feature flag, you're looking at seconds. Toggle the flag, the new code path is off, the old code path is active. The incident is contained while you figure out what went wrong.

Kill Switches: Your Fastest Incident Response Tool

Every critical feature in production should have a kill switch. Not a feature flag for gradual rollout -- a dedicated, permanent flag whose sole purpose is to turn off that feature if it becomes a problem. This is especially important for features that interact with external systems, process payments, or handle data that can't be easily recovered.

I've been in incidents where a third-party API started returning errors at a rate that overwhelmed our retry logic, which cascaded into queue backups, which increased latency across the entire platform. The fix was turning off the feature that called that API. With a kill switch, that's a 10-second operation. Without one, it's a deployment -- in the middle of an active incident, under pressure, with engineers who are already context-switching between diagnosis and mitigation.

Kill switches are infrastructure. They should be:

Progressive Rollouts: Reducing Blast Radius

The default deployment model is binary. Code is either live for everyone or live for nobody. Feature flags make the rollout a gradient. You can release to 1% of users, observe the metrics, then go to 5%, then 25%, then 100%. At any point, you can halt or roll back.

This is not just a safety mechanism -- it's a fundamentally different way of thinking about risk. Binary deployments treat every release as an all-or-nothing bet. Progressive rollouts treat every release as an experiment with a tunable blast radius. If the new payment flow has a bug, it affects 1% of users instead of all of them. You catch it faster, you fix it with less damage, and you have real production data -- not just what your staging environment told you.

The mechanics of progressive rollout matter. Sticky assignment is critical -- a user who sees the new experience should continue seeing it on subsequent requests. Random assignment per request creates inconsistent experiences and makes debugging impossible. Hash the user ID against the flag to determine assignment, and the experience is consistent for each user while still being randomly distributed across the population.

Metric gates are equally important. Define the metrics that determine whether the rollout proceeds: error rate, latency, conversion rate, whatever matters for that feature. If the metrics degrade beyond your threshold at any rollout percentage, halt automatically. Don't rely on a human noticing a dashboard. Wire the metrics into the flag system so the rollout pauses itself.

The Danger of Permanent Flags

Here's where feature flags become dangerous: when they stay forever. A flag is created for a rollout. The rollout completes. The feature is live for 100% of users. And the flag -- the if/else branch in the code -- stays. Nobody removes it. The code now has two paths: the feature-on path that runs, and the feature-off path that doesn't but is still maintained, still compiled, still tested (maybe), and still occupying mental space in every developer who reads that code.

This is flag debt, and it accumulates exactly like technical debt. One stale flag is harmless. Twenty stale flags are a maintenance burden. Two hundred stale flags are a codebase where nobody can confidently reason about what's actually running. I've seen codebases where flags referenced features that were launched two years ago. The "off" path was dead code that had drifted so far from the current architecture that it wouldn't even work if someone accidentally disabled the flag. That's not a safety net. That's a trap.

Flag hygiene is non-negotiable. Every flag should have:

Centralized Flag Management

If your flags are defined in config files scattered across repositories, you don't have a flag system. You have scattered conditionals. A proper flag system is centralized, runtime-evaluated, and observable.

Centralized means there's one place to see all flags, their current state, who changed them, and when. Runtime-evaluated means changing a flag doesn't require a deployment -- the application reads the flag state from the central system on each evaluation (with appropriate caching). Observable means you can see which flags are being evaluated, how often each path is taken, and what the performance impact is.

There are solid options here. LaunchDarkly is the most established commercial platform. Unleash and Flagsmith are strong open-source alternatives. OpenFeature is working on a standard API so you can switch providers without rewriting your flag evaluations. The specific tool matters less than the principles: centralized management, runtime evaluation, audit logging, and a clear lifecycle model.

What you should not do is build your own flag system from scratch. I've seen this repeatedly -- a team decides that flags are "just config" and builds a key-value store backed by a database table with a simple API. Six months later, they've reinvented targeting rules, percentage rollouts, audit logging, multi-environment support, and they're maintaining a homegrown system that has none of the reliability or features of the off-the-shelf options. This is a category where buying (or adopting open-source) beats building almost every time.

Flags for Experimentation

Feature flags and A/B testing are natural partners. A flag determines which code path runs. An experiment framework determines which users get which code path and measures the outcome. The flag is the mechanism; the experiment is the methodology.

The key difference between a rollout flag and an experiment flag is intent. A rollout flag exists to gradually expose a feature that you've already decided to launch. An experiment flag exists to determine whether you should launch it at all. The metrics are different -- a rollout flag watches for regressions; an experiment flag measures for improvements.

Experiment flags need statistical rigor that rollout flags don't. Sample size calculations. Proper control groups. Statistical significance thresholds. Duration requirements to avoid novelty effects. This is where flag management and data science intersect, and doing it well requires both engineering and analytical discipline.

The operational implication is that experiment flags tend to live longer than rollout flags, which makes their cleanup even more important. An experiment that concluded three months ago but whose flag is still in the code is worse than a stale rollout flag -- it's dead code that was specifically designed to create divergent behavior. Clean it up immediately after the experiment concludes and the decision is made.

The Operational Maturity Model

You can gauge a team's operational maturity by how they use feature flags. Here's the progression I typically see:

Level 1: Code Hiding

Flags are used to hide unfinished features in main. They're hardcoded booleans or environment variables. There's no central management. Flag cleanup is ad hoc. This is where most teams start and many stay.

Level 2: Controlled Rollout

Flags support percentage-based rollouts and user targeting. There's a central system to manage flag state. The team uses flags for every production release and has a process for cleaning up after rollout completes.

Level 3: Operational Tooling

Flags are integrated with incident response. Kill switches exist for critical features. On-call engineers know how to use the flag system. Flag changes are audit-logged and tied to incident timelines. Flag state is part of the system's operational runbook.

Level 4: Automated Safety

Flag rollouts are automated with metric gates. The system monitors error rates, latency, and business metrics during rollout and automatically halts or rolls back if thresholds are exceeded. Stale flags trigger automated alerts. Flag lifecycle is enforced by policy.

Most teams are at Level 1 or 2. The jump from Level 2 to Level 3 -- treating flags as operational infrastructure rather than development tooling -- is where the biggest return on investment lives. It's the difference between flags being a convenience and flags being a core part of how you operate production systems safely.

Getting This Right

If you're starting from scratch or cleaning up a messy flag system, here's the practical path:

  1. Adopt a centralized flag system. Stop managing flags in config files and environment variables. Use a purpose-built tool -- commercial or open-source -- that gives you runtime evaluation, targeting, audit logging, and a management UI.
  2. Classify your flags. Every flag should be tagged as release, ops, experiment, or permission. The classification determines the lifecycle rules. Release flags expire. Ops flags are permanent. Experiment flags are bounded by the experiment duration.
  3. Implement kill switches for critical features. Identify the features whose failure would cause the most damage. Add kill switches. Test them. Make them accessible to on-call.
  4. Enforce flag hygiene. Set expiration dates. Alert when flags are past their expected lifecycle. Track flag count as a metric -- if the number is growing monotonically, you have a cleanup problem.
  5. Wire flags into your incident response. When an incident occurs, the first question should be: "Did any flags change recently?" Flag change history should be immediately accessible to responders, correlated with deployment history and metric changes.
A feature flag is the smallest unit of operational control you can add to your system. It costs almost nothing to implement and provides enormous leverage during the moments that matter most -- deployments, incidents, and experiments.

Feature flags are infrastructure in the same way that load balancers, monitoring, and deployment pipelines are infrastructure. They're not optional developer tooling. They're a core mechanism for operating production systems safely and confidently. The teams that treat them accordingly -- with lifecycle management, operational integration, and disciplined hygiene -- ship faster, respond to incidents faster, and break things less. The teams that treat them as an afterthought accumulate flag debt, lose the ability to reason about their code, and miss the most powerful lever they have for managing risk in production.

Comments