ESC
← Back to blog

Chaos Engineering Is Not Breaking Things

· X min read
SRE Reliability Testing
AI Summary

Mention chaos engineering in a leadership meeting and you'll get one of two reactions. Either someone's eyes light up because they read about Netflix's Chaos Monkey in a blog post five years ago, or someone's face goes pale because they think you're proposing to randomly break production. Both reactions are wrong, and they stem from the same misunderstanding: conflating chaos engineering with destruction.

Chaos engineering is not about breaking things. It's about learning things. Specifically, it's about learning how your system behaves under stress, failure, and unexpected conditions -- before those conditions find you on a Saturday night. It's the scientific method applied to distributed systems. You form a hypothesis, design an experiment, control the variables, limit the blast radius, run the experiment, and analyze the results. That's not sabotage. That's engineering.

The Scientific Method for Systems

Real chaos engineering follows a disciplined process. It starts with defining steady state -- the normal, healthy behavior of your system expressed in measurable terms. Request latency is under 200ms at the 99th percentile. Error rate is below 0.1%. Orders are processing successfully. These are your baselines.

Next, you form a hypothesis. "If we lose one of three database replicas, the system will failover to a healthy replica and request latency will increase by no more than 50ms during the failover window." This is a specific, testable, falsifiable prediction. It's what separates chaos engineering from "let's see what happens when we unplug this."

Then you design the experiment. What's the injection? How long does it run? What are you measuring? What constitutes a pass or fail? What's the abort criteria -- the point at which you stop the experiment because the impact exceeds what you intended? Every chaos experiment should have a clearly defined abort button before it starts.

You run the experiment, observe the results, and compare them to your hypothesis. Did the system behave as expected? If yes, you've increased confidence in your resilience. If no, you've discovered a weakness before your customers did. Either outcome is valuable. Both outcomes are the result of deliberate, controlled experimentation -- not random destruction.

Chaos Monkey Was the Beginning, Not the Whole Story

Netflix's Chaos Monkey is the most famous chaos engineering tool, and it's also the most misunderstood. Chaos Monkey randomly terminates instances in production. This sounds reckless until you understand the context: Netflix's architecture was specifically designed for instance failure. Every service was built to be stateless, horizontally scalable, and tolerant of individual instance loss. Chaos Monkey wasn't testing whether the system could handle failure. It was continuously validating that the system still handled failure, because architectures drift and assumptions erode.

But most organizations aren't Netflix. Most systems aren't designed with the same degree of failure tolerance. Copying Chaos Monkey without copying the architectural foundations that make it safe is a recipe for self-inflicted outages. The lesson from Netflix isn't "randomly kill things in production." It's "build systems that tolerate failure, then continuously verify that they do."

The chaos engineering discipline has matured significantly since Chaos Monkey. Modern practice emphasizes controlled experiments with graduated blast radius, not random destruction. You start small, validate your hypotheses, build confidence, and gradually expand the scope. The trajectory looks like careful science, not reckless experimentation.

Blast Radius Is Everything

The most important concept in chaos engineering -- the one that separates it from "just turning stuff off" -- is blast radius control. Every experiment should have a defined, limited blast radius. You're not injecting failure into your entire production environment. You're injecting a specific failure into a specific component while monitoring the impact on a specific set of metrics.

Start with the smallest possible blast radius. Inject 100ms of latency into network calls between two services in a single availability zone. Terminate a single instance in a stateless service that has ten instances running. Simulate a DNS resolution failure for one downstream dependency. These are contained experiments that test specific resilience properties without risking customer-visible impact.

As you build confidence, you can expand. Move from single instance to single AZ. From network latency to network partition. From one dependency to multiple. But each expansion is deliberate, based on evidence from smaller experiments, and always with clearly defined abort criteria.

I've seen teams skip this progression. They read about chaos engineering, get excited, and their first experiment is killing an entire availability zone in production during peak traffic. The system -- predictably -- doesn't handle it well. Users are impacted. Leadership gets upset. The chaos engineering program gets shut down before it ever really started. The problem wasn't chaos engineering. The problem was skipping the fundamentals.

The Difference Between Chaos and Destruction

There's a clear line between chaos engineering and just breaking things. It comes down to five properties:

If your "chaos engineering" doesn't have all five of these properties, you're not doing chaos engineering. You're just causing incidents with extra steps.

Game Days: Where Teams and Systems Get Tested

Game days are structured chaos engineering events where teams simulate real-world failure scenarios in a controlled setting. They combine technical experimentation with human response validation. It's not just about whether the system handles the failure. It's about whether the team handles the failure.

A well-run game day has a facilitator who introduces failure scenarios, observers who monitor system behavior and team response, and participants who respond to the failures as they would in a real incident. The facilitator controls the pace and can escalate or de-escalate based on how things are going.

Game days test things that automated chaos can't: communication patterns, escalation procedures, decision-making under pressure, knowledge gaps. Does the on-call engineer know how to failover the database? Does the team know who to contact when the payment processor is down? Can the incident commander make prioritization decisions when multiple things are failing simultaneously?

I've facilitated game days that revealed more about organizational resilience than any automated experiment ever could. In one case, the system handled a simulated region failover perfectly -- auto-scaling kicked in, traffic shifted, health checks passed. But the team froze. Nobody knew the failover had happened automatically because the alerting was configured to page on the symptom that had already been remediated. The technology worked. The human layer didn't. That's a finding you only get from game days.

Automated Chaos: Continuous Validation

Game days are valuable but expensive. They require coordination, preparation, and dedicated time from multiple teams. You can't run them daily. Automated chaos fills the gap -- continuously running small, controlled experiments in production to validate that resilience properties hold over time.

Tools like Gremlin, Litmus, and Chaos Mesh allow you to define experiments as code, schedule them to run continuously, and integrate the results into your observability pipeline. A well-implemented automated chaos program might run dozens of small experiments per day: terminating instances, injecting latency, simulating dependency failures. Each experiment validates a specific hypothesis and alerts if the system's behavior deviates from expectations.

The key word is "well-implemented." Automated chaos without proper guardrails is automated destruction. Every automated experiment needs the same controls as a manual one: defined hypothesis, measured metrics, blast radius limits, and automatic abort. The automation makes it scalable. The controls make it safe.

Getting Organizational Buy-In

The hardest part of chaos engineering isn't the technology. It's the politics. Telling leadership you want to intentionally inject failures into production sounds insane if you don't frame it correctly. And honestly, it should sound concerning -- a healthy skepticism protects against reckless experimentation.

The framing that works is risk reduction. Every system has failure modes. Those failures will happen eventually -- hardware fails, networks partition, dependencies go down. The question isn't whether these failures will occur, but whether you'll discover your system's response to them on your schedule or your customers' schedule. Chaos engineering is choosing to learn on your terms.

Start with the economic argument. What does an hour of downtime cost? What does a major incident cost in engineering time, customer trust, and revenue? If a controlled experiment that takes an hour of engineering time reveals a failure mode that would have caused four hours of downtime, the ROI is immediate and obvious.

Then demonstrate safety with small wins. Run your first experiments in staging or pre-production. Document the findings. Show that the process is controlled and disciplined. Gradually introduce production experiments with the smallest possible blast radius. Build trust through transparency -- share your experimental design, your abort criteria, and your results with stakeholders before and after each experiment.

The Maturity Model

Chaos engineering maturity develops over time. Trying to jump to advanced practices without building foundations leads to the "we tried chaos engineering and broke prod" outcome that kills programs.

Level 1: Knowledge Building

The team understands the principles of chaos engineering. They've identified critical failure modes and have hypotheses about how the system handles them. No experiments have been run yet, but the intellectual framework is in place.

Level 2: Controlled Experiments in Non-Production

Experiments run in staging or pre-production environments. The team is learning the tools, refining their experimental methodology, and building confidence. Findings are documented and drive architectural improvements.

Level 3: Production Experiments with Limited Blast Radius

Experiments run in production targeting a small percentage of traffic or a single instance. Abort criteria are well-defined. The team has demonstrated the discipline to run experiments safely. Results are shared broadly.

Level 4: Automated Continuous Chaos

Small experiments run continuously and automatically in production. Results are integrated into the observability pipeline. Regressions in resilience are detected and alerted on. Chaos experiments are part of the CI/CD pipeline -- new deployments are validated against known failure modes.

Level 5: Chaos as Culture

Resilience thinking is embedded in the engineering culture. Teams design for failure from the start and use chaos experiments to validate their designs. Game days are regular events. Chaos experiments inform architecture reviews and capacity planning. The question isn't "should we test this failure mode?" but "have we tested this failure mode yet?"

Starting Small

If you haven't started with chaos engineering, here's the simplest possible first experiment: pick a stateless service with multiple instances. Form a hypothesis: "If we terminate one instance, the load balancer will route traffic to remaining instances and error rate will not increase." Terminate one instance. Watch your metrics. Did the hypothesis hold?

If it did, congratulations -- you've validated a resilience property. If it didn't, you've found a problem worth fixing before it finds you. Either outcome justifies the experiment. Either outcome is the product of careful, controlled, hypothesis-driven engineering.

The goal of chaos engineering isn't to cause outages. It's to prevent them. You're trading a small, controlled disruption now for the avoidance of a large, uncontrolled disruption later. That's not reckless. That's responsible.

Stop calling it "breaking things." Start calling it what it is: the scientific method applied to production systems. Form hypotheses. Design experiments. Control variables. Analyze results. Iterate. The systems that survive real-world failures aren't the ones that avoided chaos -- they're the ones that practiced for it.

Comments