ESC
← Back to blog

Kubernetes Won't Save You

· X min read
Kubernetes Architecture Operations
AI Summary

The pitch is seductive. Container orchestration that handles scheduling, scaling, service discovery, rolling deployments, and self-healing -- all declaratively configured and infinitely extensible. Kubernetes promises to solve the hard problems of running software in production. And it does solve some of them. But the organizations that adopt Kubernetes expecting it to fix their operational problems are in for a rude awakening.

Kubernetes is a powerful tool. It's also one of the most complex pieces of infrastructure you can adopt. And complexity has a cost -- a cost that many teams dramatically underestimate when they're evaluating the technology on a whiteboard. The gap between "we set up a demo cluster" and "we're running production traffic reliably on Kubernetes" is measured in months of engineering effort, and that's if you know what you're doing.

I'm not here to tell you Kubernetes is bad. It's not. But I've seen too many teams adopt it for the wrong reasons, underestimate the investment, and end up with orchestrated chaos instead of orchestrated containers.

The Complexity Tax

Let's start with what you're actually signing up for. A production-grade Kubernetes deployment involves: the control plane (etcd, API server, scheduler, controller manager), a container runtime, a CNI plugin for networking, an ingress controller, a service mesh (maybe), persistent storage provisioning, RBAC policies, network policies, pod security policies or their replacements, monitoring and logging infrastructure, secret management, certificate management, and a CI/CD pipeline that knows how to build images, push them to a registry, and deploy manifests.

Each of these components has its own configuration surface, its own failure modes, its own upgrade lifecycle, and its own learning curve. The CNI plugin alone -- the thing that makes pod networking work -- has a dozen serious options, each with different performance characteristics, feature sets, and operational trade-offs. Choosing between Calico, Cilium, Flannel, and Weave is a non-trivial decision with long-term implications, and it's just one of dozens of similar decisions you need to make.

This is what I call the complexity tax. It's the ongoing cost of operating a system that has more moving parts than many teams have engineers. Every component needs monitoring. Every component can fail. Every component needs upgrades. The tax isn't paid once at adoption -- it's paid continuously, in engineering hours that could be spent on your actual product.

Before Kubernetes, you had servers running applications. The operational model was straightforward: deploy code to servers, monitor the servers, fix the servers when they break. With Kubernetes, you have a distributed system managing other distributed systems. The abstraction layer is powerful, but it doesn't eliminate the underlying complexity -- it relocates it. And relocated complexity has a way of surprising you at the worst possible times.

What Kubernetes Actually Solves

To be fair, Kubernetes solves real problems. It solves container scheduling -- deciding which node runs which container based on resource requirements and constraints. It solves service discovery -- containers can find each other without hard-coded addresses. It solves scaling -- horizontal pod autoscaling responds to demand. It solves rolling deployments -- new versions are deployed incrementally with health checks. And it solves self-healing -- failed containers are restarted, failed nodes are drained.

These are genuine operational capabilities. If you're running dozens or hundreds of microservices, managing their placement across a fleet of servers, handling traffic routing between them, and coordinating deployments across all of them -- Kubernetes provides a coherent framework for all of this. That's its sweet spot.

But notice what's not on that list. Kubernetes doesn't solve your architecture. If your monolith is poorly structured, breaking it into poorly structured microservices and deploying them on Kubernetes gives you a distributed monolith -- all the complexity of microservices with none of the benefits. Kubernetes doesn't solve your deployment strategy. If you don't have health checks, readiness probes, and rollback criteria defined, the rolling deployment mechanism has nothing to work with. Kubernetes doesn't solve your observability. It generates an enormous volume of metrics and events, but making sense of them requires tooling, dashboards, alerts, and -- most importantly -- people who understand what they're looking at.

When Kubernetes Is Overkill

There's a question that gets asked far too rarely in technology selection: "Do we actually need this?" For Kubernetes specifically, the answer is often no.

If you have a small engineering team -- say, under twenty engineers -- running a handful of services, Kubernetes is almost certainly overkill. The operational overhead will consume a disproportionate share of your engineering capacity. A simple deployment pipeline pushing containers to a managed service like ECS, Cloud Run, or even a handful of well-configured VMs will get you 90% of the operational capability at 10% of the complexity.

If your workloads are relatively homogeneous -- similar tech stack, similar resource profiles, similar scaling patterns -- you don't need a general-purpose orchestrator. Simpler tools will serve you better and let you focus engineering effort on your product instead of your platform.

If your team doesn't have deep operational expertise -- if you don't have people who understand Linux networking, container runtimes, distributed consensus, and systems debugging -- Kubernetes will be a source of endless operational incidents that you lack the skills to diagnose. The abstractions leak. When they do, you need people who can reason about what's happening underneath.

Kubernetes makes sense when you have a large, heterogeneous workload portfolio that needs consistent orchestration. When you have the team size and expertise to operate it. When the complexity tax is justified by the operational leverage it provides. For a lot of organizations, that threshold is higher than they think.

The Skills Gap

The skills gap is the most underestimated cost of Kubernetes adoption. Kubernetes doesn't just require operators who know Kubernetes. It requires operators who understand distributed systems, networking, storage, security, and the intersection of all of these in a containerized environment.

When a pod isn't getting scheduled, someone needs to understand resource requests, node affinity, taints and tolerations, and scheduler behavior. When network traffic isn't reaching a service, someone needs to understand kube-proxy, iptables rules, CNI plugin behavior, and DNS resolution. When a persistent volume claim is stuck in pending, someone needs to understand storage classes, provisioners, and cloud provider APIs. These aren't beginner-level problems, and they're the routine reality of operating Kubernetes.

I've consulted with teams that adopted Kubernetes and then spent six months in a state of constant firefighting because they didn't have the expertise to debug the platform issues they encountered daily. The engineers were smart and capable, but their experience was in application development, not systems engineering. Kubernetes didn't make their applications easier to run. It added a layer of systems problems on top of their application problems.

Hiring for Kubernetes expertise is expensive and competitive. Training existing engineers takes months of dedicated investment. Neither option is cheap, and both need to happen before your production workloads are running on the platform -- not after. Adopting Kubernetes and then hiring someone who knows how to run it is like moving into a house and then looking for an electrician. By the time you find one, things are already on fire.

Managed vs. Self-Hosted

Managed Kubernetes services -- EKS, GKE, AKS -- eliminate some of the operational burden. The cloud provider manages the control plane. They handle etcd backups, API server availability, and control plane upgrades. This is genuinely helpful. The control plane is the hardest part of Kubernetes to operate reliably, and offloading it to your cloud provider is usually the right call.

But managed Kubernetes is still Kubernetes. You still own the worker nodes, the networking, the ingress, the storage, the security policies, the monitoring, and the application lifecycle. A managed Kubernetes service takes maybe 30% of the operational burden off your plate. The other 70% -- the part that involves your workloads, your configuration, your debugging -- is still yours.

I've seen teams adopt a managed Kubernetes service and assume that "managed" means "taken care of." It doesn't. It means the cloud provider handles the API server and you handle everything else. Node upgrades, cluster autoscaling configuration, ingress controller management, pod disruption budgets, resource quota policies, namespace management, RBAC configuration -- all of this is your responsibility regardless of whether the control plane is managed.

Self-hosted Kubernetes -- running your own control plane -- is an even bigger commitment. Unless you have a specific technical or compliance reason to self-host, don't. The operational cost of running and upgrading etcd clusters, managing API server certificates, and ensuring control plane high availability is significant, and managed services handle it better than most teams can in-house.

The Ecosystem Trap

The Kubernetes ecosystem is vast and fragmented. For every operational concern, there are multiple competing solutions. Service mesh? Choose between Istio, Linkerd, and Consul Connect. Monitoring? Prometheus is the standard, but then you need Thanos or Cortex for long-term storage, and Grafana for visualization, and Alertmanager for routing. Secrets management? Sealed Secrets, External Secrets, Vault integration, or native K8s secrets with encryption at rest. GitOps? Flux or ArgoCD.

Each choice adds another component to operate, another configuration surface to manage, and another potential failure point. The ecosystem encourages a sprawl of tools, each solving a narrow problem, that collectively create an operations burden greater than the sum of its parts. I've seen Kubernetes clusters where the operational tooling -- monitoring, logging, service mesh, policy enforcement -- consumes more resources and attention than the application workloads it's supposed to support.

The antidote is restraint. Don't adopt every tool that solves a theoretical problem. Start with the minimum viable platform: container runtime, basic networking, monitoring with Prometheus, logging to a central sink, and a CI/CD pipeline. Add complexity only when a concrete problem demands it. If you don't have a service mesh problem, you don't need a service mesh. If you're not doing GitOps, you don't need Flux or ArgoCD. Every tool you add should solve a problem you're actually experiencing, not a problem you might experience someday.

What You Actually Need First

Before adopting Kubernetes, make sure these foundations are in place. Without them, Kubernetes amplifies your problems rather than solving them.

The Right Reasons

Despite everything I've said, there are right reasons to adopt Kubernetes. You're running dozens or hundreds of services and need consistent orchestration. You have heterogeneous workloads that benefit from a unified scheduling and deployment model. You're operating at a scale where the efficiency gains from bin-packing and auto-scaling justify the platform overhead. You have the team size and expertise to support the operational commitment.

When these conditions are met, Kubernetes is excellent. It provides a powerful, extensible, well-supported platform for running containerized workloads at scale. The ecosystem, despite its complexity, offers solutions for nearly every operational challenge. The community is enormous and active. The portability across cloud providers is real and valuable.

But "Kubernetes is excellent when the conditions are right" is a very different statement than "Kubernetes will solve your problems." The first is true. The second is the kind of thinking that leads to twelve-month migration projects that leave teams worse off than when they started.

Kubernetes is an answer. Make sure you're asking the right question. If the question is "how do we orchestrate containers at scale?" then Kubernetes is probably the answer. If the question is "how do we fix our deployment problems?" or "how do we make our operations more reliable?" -- Kubernetes might be the answer, but only after you've addressed the foundational issues that no orchestrator can solve for you.

Don't let the technology lead the strategy. Define what you need. Evaluate whether Kubernetes is the right tool to provide it. And if you do adopt it, go in with clear eyes about the investment required. The organizations that succeed with Kubernetes are the ones that treated it as a serious infrastructure commitment, invested in the team and the tooling, and built the operational foundations before migrating their production workloads. The organizations that struggle are the ones that expected the technology to do the hard work for them. Kubernetes is powerful. But it won't save you from yourself.

Comments