Following a lean methodology can help us significantly increase the ROI of Kubernetes, improve workload performance, and save time on maintenance and troubleshooting.

译自 What Does It Mean to Keep Clusters Lean?，作者 Ant Weiss。

As Kubernetes becomes the de facto standard for building modern software delivery platforms, effectively managing clusters is becoming increasingly important. This process gave rise to the popular slogan "Keep the cluster lean".

Are you practicing DevOps? That means you're lean! Lean principles are at the heart of the DevOps methodology. It's no coincidence that the "L" in the CALMS mindset (developed by DevOps pioneers John Willis and Damon Edwards and further developed by Jez Humble) stands for Lean Delivery Practices.

Modern DevOps practices are rooted in lean management principles. These principles, including just-in-time parts delivery and eliminating waste, were originally defined by the Toyota Production System (TPS) in post-World War II Japan. Later, in the 1990s, the West coined the term "lean" to describe these philosophies. Books such as "The Machine That Changed the World" explain TPS and how lean management has spread globally to various industries, including software development.

The book "Lean Software Development", published in 2003 by Mary and Tom Poppendieck, defines the types of waste encountered in software delivery and describes how this waste can be eliminated through a structured continuous improvement process.

Lean principles in a cloud-native world

In a nutshell, the principles of lean management are as follows:

Focus on value
Eliminate waste
Pursue continuous improvement

By following these principles, businesses can deliver value faster, reduce waste, and achieve higher production quality.

Cloud-native infrastructure is designed to deliver high-quality software quickly and optimize value streams by providing cloud resources without lengthy planning processes and underutilized server farms. Autonomous engineering teams deploy autonomous microservices that interact asynchronously and efficiently with other services.

Of course, this is the ideal situation. But in real life, most Kubernetes clusters are wasteful and unreliable. There is so much going on in them that it is impossible to manage them effectively without applying a well-defined approach.

As Kubernetes becomes the de facto standard way to run production-grade software, whether it's web services, batch processing, or data pipelines, there is a growing recognition that the only way to achieve a satisfying cloud ROI is to stay competitive in the challenging global IT market by consciously keeping your clusters lean.

Keep your cluster lean

Here are some practices to help keep your Kubernetes cluster lean.

1. Focus on value

Once we standardize our delivery platform to use Kubernetes, it makes sense to run most of our workloads there. However, our workloads and the type of value they provide can vary greatly. The reliability criteria for a long-running web service are different from those for ML model training or periodic batch jobs. In addition, environmental maturity needs to be considered. Development experiments, performance testing, CI jobs, and one-time maintenance procedures have different availability requirements and reasonable operating costs.

Node selection

Based on the level of availability you need and the engineering time you save, define the type of VM instance to use for your cluster (for example, choose preemptible instances instead of reserved capacity).

You can use the LearnK8S Kubernetes Instance Calculator for initial planning.

All cloud providers now offer optimized instances based on dedicated operating systems such as Bottlerocket OS or ARM processors.

Using such instances can make our clusters lean and cheaper, but we need to validate beforehand that they are suitable for our specific workloads.

Network topology limitations

Carefully choosing a cluster's network topology can have a significant impact on cloud bills, especially when running data-intensive workloads. When most clusters are created, the data plane is run in three Availability Zones by default to improve availability. Each byte transferred across the AZ network within the cluster costs you a few extra cents. So when availability isn't part of the value we want to achieve (e.g., for spooling), it makes sense to override the default settings and run all nodes in the same AZ.

Storage configuration

Kubernetes offers a variety of storage options, which differ in terms of durability, availability, and performance levels. Focusing on the value we provide, we can choose the most appropriate storage type for your workload.

2. Eliminate risk

The original Toyota production system pioneered the concept of andon wires – a cable that ran through the entire factory and could be pulled by any worker to stop the production line. The idea is that management trusts workers as experts in their field, so they should pull the cable when they find any issues that they feel pose a risk to the quality of production. Any lurking risk could lead to significant waste down the road. Kubernetes works the same way. Therefore, workloads with potential security and reliability risks should be addressed before considering waste.

Here are some common Kubernetes workload risks and how to mitigate them:

Undefined resource requests and limits

When requests and limits are not defined, the Kubernetes scheduler treats all pods as BestEffort pods. This means that there are no reliability guarantees. This can be prevented to some extent by using a LimitRange object, but requires constant pod resize (described in the next section) to mitigate this.

Insufficient resource requests and limits

This means that our pods are not getting the resources they need, which leads to unexpected failures and increased latency that affects the reliability of the application.

Container restarts

Containers are ephemeral and can be restarted seamlessly in the event of a failure. This is the default mode of operation for the most common types of Kubernetes workloads, such as Deployment and DaemonSet. However, frequent restarts indicate a problem. Whether it's an application error, a permission issue, or a misconfigured survivability probe, we want to troubleshoot and fix it as quickly as possible to keep the cluster lean.

There are other types of risks here as well. We need a targeted observability system to identify and alert us to reliability risks in our clusters, as well as intelligent autonomous automation to fix the most common issues.

3. Eliminate waste

As mentioned earlier, most clusters need to be utilized more efficiently. In other words, using a lot of cloud resources leads to excessive spending.

We want to provide as many resources as possible for workloads, and that's understandable – no engineer wants their application to crawl slowly like a turtle because of CPU limitations, or die tragically because OOM kills. Unfortunately, even if you give a container three to four times the resources, you can't guarantee reliability! There may be other misconfigured containers on the same node with insufficient requests and excessive throttling, resulting in our containers not being able to access resources even with our generous help.

The only way to achieve the required reliability and cost targets is to understand where the waste is coming from in the cluster and establish a process to eliminate it on an ongoing basis.

Here are some practices that every Kubernetes administrator should adopt to keep their clusters waste-free:

Pod resizing

apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
  - name: app
    image: images.my-company.example/app:v4
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

Properly defining resource requests and limits for containers in pods affects everything from basic scheduling and eviction to HPA operations and node autoscaling. The problem is, it takes a lot of work to determine. Performance testing can help with the initial definition. However, the dynamic nature of the Kubernetes environment requires us to continuously monitor runtime resource consumption and update configurations accordingly, preferably in an automated manner. This is the only way to ensure that our containers get the resources they need, when they need them.

Node utilization monitoring

Even if our container resources are optimized, we still experience additional waste because our node selection is not optimal. This may be due to not using discounted Spot Instances and Reserved Instances at the right time, or configuring instance types whose resources cannot be effectively utilized by active workloads. Lean clusters must be managed by an automated system that provides granular visibility into node utilization at the node group and type level, as well as intelligent recommendations to reconfigure node pools for further optimization.

4. Just-in-time supply

Unused infrastructure wastes our money and incurs unnecessary maintenance overhead without adding any additional value. At its core, the Lean methodology is about providing resources when they are needed and freeing them up when they are no longer needed.

Here are some ways to keep your cluster lean:

Auto-scaling

Auto-scaling makes Kubernetes truly cloud-native. However, they are optional! Defining autoscaling requires additional configuration at the pod (HPA, VPA, KEDA) or node level (cluster-autoscaler, Karpenter, etc.). Keeping your cluster lean means investing in this configuration, continuously validating the efficiency of the autoscaling algorithm and optimizing it to adapt to the changing needs of the system.

Instant node provisioning

Not all node autoscalers are the same. For example, the efficiency of a traditional cluster-autoscaler is limited by a static ASG configuration. Lean means provisioning exactly what we need, when we need it – so move Lean clusters to Karpenter (if on AWS) or node auto-provisioning (on GCP and Azure).

Dynamic environment management

A well-established Kubernetes automation setup allows us to quickly configure a new environment by creating a namespace in an existing cluster or launching a new one. This ease of use results in many resources being underutilized. Staying lean requires developing an operational strategy to manage these environments and decommissioning them when they are no longer needed. See here for an example of how to put a Kubernetes resource into hibernation during non-business hours.

5. Continuous optimization

The Lean methodology is based on the philosophy of continuous improvement – that is, always looking for additional ways to make production processes more efficient, improve quality and reduce waste.

Similarly, we want to continuously optimize our Kubernetes clusters by automating resource allocation, analyzing cost and performance trends, redefining SLOs, and relentlessly eliminating risk.

summary