Do I Need Kubernetes?

A question I often hear from teams - both new and established - is “should we host our stack on Kubernetes?”. Given all the buzz it gets in the tech world a lot of people assume so.

I’ve been working with k8s for several years - often with very powerful and complex platforms - and I think the truth is more nuanced.

Here’s my attempt at untangling that decision. It’s geared toward startups and self-sufficient teams in wider organisations with responsibility for hosting their own products. It might also have value to people in more traditional IT departments at larger organisations.

How Can it Help Me?

Kubernetes isn’t just a 2018-era buzzword. It’s a robust, highly scalable system that allows you to construct application deployments out of well-considered primitives (Pod, Service, Ingress etc) then does its damnedest to make the reality match. When applications crash it restarts them. When whole swathes of underlying machines disappear it tries to replace them. If you’re running numerous services (probably developed as a microservice architecture) and you’re looking for efficiency, resilience and a good story for deployment it can do a lot for you.

Want a basic degree of resilience in your service? (yes you do) - run multiple replicas in your deployment then balance traffic between them.

If your workload is ‘bursty’ (e.g. a lot of API traffic) you could build on this by rigging up autoscaling to grow capacity when it’s needed. This can save you a lot of money - instead of paying for peak capacity all of the time, provision a baseload amount to keep the platform ticking over and bring up more replicas when they’re needed. And if you can import the queue length into your system autoscaling works for queue-based workloads too.

Worried about your code breaking? Configure liveness/readiness probes and Kubernetes will automatically restart your services when they fail. Likewise for hardware failure - I’ve seen clusters who’ve lost half their nodes keep on trucking like nothing happened. A well-configured and maintained Kubernetes cluster can be incredibly robust. If you have the resources you might even try running a chaos-monkey-like tool to make certain your stack will tolerate regular failures.

Kubernetes has excellent features for integrating with your CI workflow. The most common pattern I see for this is a build (ideally standardised across all your projects) that pushes an image to a Docker registry then kicks your k8s cluster to load it. Depending on taste that can be done by modifying the deployment to pull a new tag, or pointing the tag to the new image and triggering k8s to reload all pods. In most cases (migrations ruin everything) deployment can be fully automated: if you trust your tests and feature flags it can be 100% automatic (aka Continuous Delivery); for the less brave a manual approval step after the build eases the pain. Either way your developers should be able to release most builds to the cluster without any help from Kubernetes specialists.

In a similar vein to CI it can help you standardise logging and monitoring for your applications. Doing this is by no means unique to Kubernetes, but having a single cluster-wide system that gathers together data about all your services in the same place can greatly ease the pain of debugging. I’ve seen particularly good results come from using fluentd to pipe json-structured logging output from applications into AWS CloudWatch and query it through Insights.

Finally it can do a great deal to increase efficiency - both at the hosting layer (i.e. jamming as many applications onto those expensive EC2 instances as possible) and also reducing the time your developers need to spend deploying software. Humans cost more than computers so for most organisations the second of these will be the biggest win. But Kubernetes isn’t magic - I’ve seen some beautiful, efficient clusters and some total gas guzzlers. You’ll only save money with Kubernetes if you invest in using it correctly.

Where Does Kubernetes Shine?

First of all think about your workload. What kind of applications do you need to run? How do they talk to each other and the outside world? From experience I’d say the following attributes make your stack a good fit:

  • Broadly speaking have you followed a microservice architecture? There’s little value in Kubernetes if your world only has one application. You’ll need to Dockerize your applications to deploy them to k8s; doing this from day 1 of any project is a good way to make yourself think about the boundaries between services.
  • Are your services exposed to each other and the outside world via HTTP(s) (probably yes, it’s 2020)? This will fit k8s’ model nicely and you can use one of the normal ingress controllers to front them.
  • Are your applications suitable to be load balanced? No local state (use PostgreSQL / Redis / whatever for that), communication over known endpoints, fast startup/shutdown. That isn’t to say you can’t keep shortlived state like a Redis cache in your cluster but for many cases you’d do better to use the boxed services a cloud provider offers.
  • Kubernetes is also well suited to headless applications like batch processing (through its Job controller) and long-lived queue-consumers.
  • Is memory (and to a lesser extent CPU) usage predictable? Kubernetes will try to host your applications on the same physical machines, so if one of them goes haywire and consumes all the RAM other workloads may be randomly killed. In my experience this is the single biggest source of instability on Kubernetes clusters. If you understand the resource usage of your applications you can declare a resources.requests and resources.limits to guarantee they’ll always get the memory they need and won’t be bad neighbours.

Conversely here are a few kinds of workload I don’t think you should use it for:

  • Static websites. Typically you’d do this by baking your content into an Nginx-derived image and exposing it via your cluster’s ingress controller. That’s a terrible way to host static content: all those little copies of Nginx need to be maintained, it’s inefficient and far less resilient / performant than it could be at the network level. Sure you could stick it behind a CDN but if you’re doing that, why not have a cloud service host the content as well?
  • Hosting untrusted code. That could mean applications supplied by your customers or third party code with a history of security problems, e.g. Wordpress or a dodgy library from NPM. By default Kubernetes’ features for isolating workloads aren’t great. You can add things like Calico to control network access but it’s easy to screw up and your security model is always going to be 100% dependent on the container runtime. The default for this (Docker, based on Linux cgroups) offers a huge attack surface to compromised applications: if a codebase your cluster is running gets hacked an attacker will not find it hard to escalate their access to the rest of the cluster. Interesting work is being doing on alternatives to cgroups (e.g. Kata Containers) but it’s not yet mainstream enough to recommend to the average user.

If your workload isn’t a good fit you may still have options for shoehorning it into Kubernetes (e.g. volumes can be used for long-lived state), but expect to spend a lot of engineering time working around the problems. A lot of them are bad practices anyhow so your resources time might be better spent designing them out of the stack altogether - they’re only going to hurt you further down the line.

No Magic Bullet

Kubernetes isn’t a magic bullet. It can help to move the complexities of hosting applications into a well-designed layer of their own but it won’t make them go away. You’re always going to have to secure and maintain your platform.

To make a cluster useful for the average workload a menagerie of add-ons will be required. Some of them almost everyone uses, others are somewhat niche. Examples include (the Nginx ingress controller, cert-manager and cluster-autoscaler to add extra nodes when you don’t have enough capacity. Having a bespoke set of software operating key features of the environment makes your cluster a unique snowflake and it needs to be managed as such. Also they need to be updated regularly and can sometimes be of questionable quality. Configuration management tools like Helm or Terraform are nigh-on mandatory: tinker with a cluster by hand at your peril, absent a declarative setup you’ll never be able to spin up another in exactly the same way. I’ve seen this lead to no end of problems when operating or replacing more mature clusters.

When running any nontrivial stack on Kubernetes a degree of curation is always going to be required. Letting anybody deploy whatever they like, however they like to your cluster can only lead to chaos. You’ll end up with mess of inconsistently named applications scattered across dozens (or worse, one) namespaces with nobody knowing how they fit together. Congratulations: you took your old spaghetti infrastructure and replaced it with a new flavour of spaghetti infrastructure, just served on a cooler kind of plate.

The most successful Kubernetes implementations I’ve seen are those where infrastructure specialist(s) work with developers to make sure workloads are well-configured, standardised, protected from one another and have defined patterns of communication. They’ll handle the initial setup of an application on an infrastructure, hopefully wired up to a build/release system so dev teams can ship new versions of the code without help. In a lot of ways this mirrors the wider culture of your organisation - if engineering is a chaotic free-for-all of bad communication and ill-defined responsibilities your hosting environment will mirror it. At best this leads to unreliability; at worst an unreliable, unmaintainable, expensive mess.

Cost of Ownership

If you’re running applications on a large scale Kubernetes can save you a lot of money. Look into features like autoscaling (both for the cluster and replica sets) and pools of spot instances (EC2) or preemptible VMs (Google). In a large environment this alone might carry the decision.

An ecosystem of excellent tools exists to help any engineer bring up a toy cluster for testing their applications. In a way that’s great - the learning curve starts gently - but it’s all too easy for Kubernetes to creep into production and become a crucial part of your business before you realise the size of the commitment. It has complex failure modes and getting the best out of it takes a lot of specialist skills. Letting an inexperienced dev team throw together a cluster in a hurry (the use of Kops is a common antipattern) is a recipe for disaster: it’ll work great for a few months but if you need to make major changes, reprovision the cluster or troubleshoot a concensus failure you’re in for a bad time.

Building a k8s cluster from scratch is like compiling your own kernels: a great rainy-day activity to learn how things work but a downright awful way to run production applications. Instead you should run one of the boxed solutions like AWS EKS or Google’s GKE. People far smarter than me invest huge amounts of time getting these right and it makes sense to leverage that effort, even if you need to pay a few hundred bucks a month for it.

Even using a boxed Kubernetes distribution you’ll need specialist skills. The control plane is Amazon’s problem but sooner or later you’ll manage to trigger some obscure bug on the nodes, typically right at the point when your business is at its busiest. You must be prepared to commit resources to running the system and be okay with paying for them. Kubernetes has a short release cycle so you’ll need to upgrade your cluster at least once a year, and with regular changes to its APIs this can be a nontrivial piece of work. Whatever add-ons you run will need maintaining too. If you’re a small shop with lightweight requirements a part-time resource will do but trust me on this, when all of your containers are blowing up with an obscure threading error at 4am you’re going to need somebody to turn to.

All of this leads me to believe there’s a threshold level of size/complexity for effective Kubernetes usage. If you’re running a small number (say <5) of simple, undemanding services it may not be worth the bother. Where it really shines is environments with a lot of deployment complexity to be managed, dynamic workloads or where a large complexity/cost saving can come from standardising your tooling.

So… Should I?

As you might expect for the conclusion of a 2,000 word essay the answer is “it depends”. If you don’t already have one drawing a diagram of your architecture will help.

If you have only a handful of services that you don’t expect to multiply there are probably easier, cheaper ways to host your tech stack. Look at AWS ECS (especially in conjunction with Fargate), rewriting your APIs or batch jobs as Lambdas / Cloud Functions or even hosting your applications with a simple PaaS provider like Heroku. And retro as it sounds, don’t overlook the value and robustness that can come from running simple low-traffic applications on a couple of well-maintained Linux boxes.

Security and compliance requirements may affect your decision. If it’s mandatory to host the workload on-premises you’re likely to see large operational overheads, and while that doesn’t preclude the use of Kubernetes a more traditional solution might fit you better. If you need to use a set of add-ons but compliance requires that you vet every piece of software you run the effort might not be realistic.

I’ve seen a lot of startups who thought they needed Kubernetes when they didn’t and ended up pouring a lot of resources into it. Think carefully about whether you need all that power and whether you can afford it to implement it well. If your requirements justify the commitment then go for it. If not, keep your options open by considering the path you might take to Kubernetes in future and factoring that into your technical decisions. Run your applications in Docker from day 1 (with docker-compose it’s as valuable for dev as it is for production) and think carefully before letting them store local state.

On the other hand it’s important to gauge the scope for future growth. If you have only a couple of simple services today you probably don’t need k8s. But are they about to mutate into dozens, and if so should your organisation start to gain the skills for managing that complexity now? You don’t want to build a 747 when a biplane would do, but on the other hand that Sopwith Camel isn’t going to be much use when 300 passengers turn up at the gate.

To sum up - infrastructure decisions are generally a function of the choices you make for software architecture. Don’t let your infrastructure be an afterthought and don’t forget that the more of it you need, the higher the cost-of-ownership. It’s okay to invest in complex systems if you need them but think carefully about the commitment before you do.