laitimes

Rancher and Zhihu joint practice of hyperscale multi-cluster management

author:Flash Gene

Origin

Zhihu is a high-quality Q&A community on the Chinese Internet, where tens of millions of users share their knowledge, experience and insights every day to find their own answers. In order to meet the needs of business development at different stages, the Zhihu container platform is also constantly evolving and improving, and almost all businesses are currently running on containers.

In the past two years, Zhihu has started using Rancher to manage Kubernetes clusters, and the cluster scale has gradually reached nearly 10,000 nodes. In this article, we'll show you how Rancher tuned the performance of a large-scale cluster to deliver up to 75% faster access to the point where page access is available.

Here are a few reasons why we chose Rancher as our container management platform:

Our business is deployed on multiple public cloud Kubernetes in China, and we need a unified platform to manage these Kubernetes clusters, and Rancher is very friendly to the compatibility of public cloud Kubernetes platforms in China.

Rancher lowers the barrier to entry for deploying and using Kubernetes clusters, making it easy to onboard and use individual Kubernetes clusters with the Rancher UI. R&D students who are not very familiar with Kubernetes can easily create resources such as Deployments, Pods, and PVs.

With Rancher Pipeline, we can implement CI/CD internally, allowing R&D to focus on the development of business applications. Even though the Rancher team informed us that Pipeline had been discontinued, its simplicity was still in line with our internal vision of CI/CD.

Rancher's continuous innovation capabilities are complemented by a series of ecosystem extensions and layouts around Kubernetes, such as k3s, a lightweight Kubernetes distribution, Longhorn, a lightweight distributed block storage system for Kubernetes, Harvester, an open-source hyperconverged infrastructure (HCI) based on Kubernetes, and RKE2, Rancher's next-generation Kubernetes distribution. Following the leading innovators is also beneficial to the continuous progress of the team.

As an international container manufacturer, Rancher has a very professional R&D team in China, and communication is very convenient. Many questions can be answered in the Rancher Chinese community, which can be said to be very conscientious for an open source and free platform.

Puzzle

When we first started using Rancher to manage small and medium-sized clusters, the Rancher UI provided almost all of the features we needed, and the UI was very smooth.

However, with the increase of business volume, the scale of the cluster continues to expand, when we expand to use a Rancher to manage nearly ten clusters, nearly 10,000 nodes and hundreds of thousands of pods (the largest scale of a single cluster is nearly a few thousand nodes and hundreds of thousands of pods), the operation of the Rancher UI frequently lags and the loading time is too long, and it takes more than 20 seconds for individual pages to complete loading. In severe cases, it may cause you to be unable to connect to the downstream cluster, and the UI prompts "The current cluster is unavailable, and the function of directly interacting with the API is not available until the API is ready." ”

Rancher and Zhihu joint practice of hyperscale multi-cluster management

That's right, we're having performance issues with Rancher. In particular, super administrator users need to load a large amount of data when logging in, basically the UI is in an unavailable state, and downstream clusters are frequently disconnected.

dawn

From the above phenomenon, it seems that using Rancher to manage hyperscale clusters is experiencing performance issues that affect the user experience. Then I enlisted the help of Rancher's local technical team, and after several efficient online meetings, I basically found the root cause:

  1. Although we have a small number of clusters, the total number of nodes in all clusters is not small. When the page loads, it relies on the node list interface (a special CRD created by Rancher that is related to the actual number of nodes in the downstream cluster) to fetch data, which takes a long time to process and causes the page to load slowly.
  2. Our downstream clusters are mainly Kubernetes in the public cloud, and these clusters are imported into Rancher. In this managed mode, Rancher Server accesses the Kube API of the downstream cluster through a tunnel established with the cluster-agent, but instead of direct access, the tunnel is used to access the Kubernetes Service IP (e.g., 10.43.0.1:443) and finally load to multiple Kube-api servers. Usually, there is no problem with this mode, but due to the large amount of access and data, the SVC IP forwarding capability cannot be supported, which seriously affects the communication efficiency.

It is understood that if you want to solve this problem with the Community Edition, it will be quite complicated. However, Enterprise Rancher already has a well-established set of strategies and strategies for performance optimization, and here are some of the differences between Enterprise and Community Rancher engineers:

The main difference between the Enterprise Edition and the Community Edition of Rancher for querying resources is that for some slow query APIs, the Community Edition reads data through the Kubernetes API, and the Enterprise Edition reads data through the Cache. At the same time, it supports multiple connection policies for downstream clusters, thereby improving the management efficiency of downstream clusters. In addition, the Enterprise Edition also has certain enhancements to infrastructure capabilities such as monitoring, logging, GPU, and network, and will respond faster to bugs from local business customers.

Due to the needs of national conditions, the enterprise version is a special existence. Overseas customers can only subscribe to the open source version, while local customers can enjoy the features of the Enterprise Edition, and the Enterprise Edition is developed by a completely local R&D team. With SUSE Rancher's open philosophy, users can come and go and switch between enterprise and open source editions.

Based on the above analysis, we decided to use Enterprise Edition to optimize the cluster for the time being.

Choice

Switch to the Enterprise Edition

First of all, we switched from the community version of Rancher to the enterprise version, which is a bit more steady, and the release strategy does not strictly follow the open source version. Fortunately, the Community Edition we use has a corresponding Enterprise Edition, and it supports a smooth transition from the Community Edition to the Enterprise Edition. Basically, it is a lossless switch, and you can directly replace the image, which is quite convenient.

Optimize the parameters of the downstream cluster

Rancher engineers recommend some parameter optimization solutions for downstream Kubernetes clusters, but we don't use a lot of custom RKE clusters and mainly use public cloud Kubernetes, and the optimization of downstream cluster component parameters is related to the actual environment, only some of the more commonly used kube-apiserver parameters are listed here as a reference:

Configure parameters Configuration description
--max-requests-inflight Default value is 400; access frequency limits for read requests; 1600 or higher recommended;
--max-mutating-requests-inflight Default value is 200; access frequency limits for write requests; 800 or higher recommended;
--event-ttl Default value: 1h0m0s; It is used to control the duration of the reserved events; When there are many events in the cluster, it is recommended that 30M be used to avoid rapid growth of etcd.
--watch-cache-sizes

The system is set according to the environment heuristic; It is used for core resources such as pods/nodes/endpoints, and for other resources, see default-watch-cache-size

settings;

Starting from K8s v1.19, this parameter is dynamically set, and it is recommended to use this version.

--default-watch-cache-size Default value is 100; a cache pool for List-Watch; 1000 or more recommended;
--delete-collection-workers Default value 1; It is used to improve the cleaning speed of namesapce, which is conducive to multi-tenant scenarios. Recommendation 10;

Kernel tuning

Rancher 团队也给我们提供了一些开源社区比较成熟的调优参数:kops/sysctls.go at master · kubernetes/kops

Enable resource caching

After resource caching is enabled, some interfaces that involve reading data from local clusters will go through the cache mode, which greatly improves the performance of API list-all, and has obvious effects for scenarios with a particularly large number of nodes in our environment.

Currently, the following resources have been added to the cache:

resource paraphrase
projectRoleTemplateBindings Project permission bindings
clusterRoleTemplateBindings Cluster permissions are bound
users user
nodes Nodes (related to our usage)
projects A project concept unique to Rancher
templates Catalog 模板

Cluster connection mode

The Enterprise Edition optimizes the connection methods for downstream clusters and supports multiple connection methods for downstream clusters, including the following connection policies commonly used by Enterprise Edition users:

Policy 1: Default configuration

The default policy is to use the Community Edition policy of connecting to downstream clusters without modifying the connection mode.

By default, the timeout value of the k8s rest client in the Tunnel in the Enterprise Edition is 60s, which can effectively reduce the failure probability of downstream cluster data queries during heavy load.

Strategy 2: Optimize the tunnel link

By default, communication is done through Tunnel using the downstream cluster's Kubernetes API Service, such as 10.43.0.1. If there is a Kubernetes API LB with better performance outside the self-built cluster, you can change the API Endpoint connected to the downstream cluster to the LB ingress address, so that you can use the Kubernetes API LB to share the pressure and improve the speed of connecting to the downstream cluster.

Policy 3: Direct connection and tunnel dual-link

Before enabling Direct and Tunnel dual-link modes, you need to ensure that the downstream cluster has an apiEndpoint that is directly accessible from the Rancher Server network (see Policy 2). After optimization, the v3 interface in the Rancher API for downstream cluster requests is directly connected, and the rest is still tunneled, which can effectively disperse traffic and avoid a large amount of data being transmitted in the tunnel. Compared with strategy 2, the performance is further improved, but it relies more on basic network planning.

In the end, we chose "Strategy 2" to connect to the downstream cluster because we have a powerful Kubernetes API LB. When you switch the link connecting to the downstream cluster, the downstream cluster will be unreachable for a short time, but the service running of the downstream cluster will not be affected.

Fruits

First of all, the Rancher Enterprise UI is very convenient for our team, and the left navigation bar can easily find various functions that are suitable for Chinese people. Maybe it's because we're in infrastructure management, and we really like this minimalist UI style.

Rancher and Zhihu joint practice of hyperscale multi-cluster management

Secondly, after the above policy optimization, the timeout of the Dial Kubernetes API was improved, and the request failure caused by timeout was almost invisible. In addition, by using the LB Endpoint of the Kube api-server of the downstream cluster as the request target, the disconnection of the downstream cluster has disappeared, and it can be said that the village road has been replaced by a highway. In addition, some interfaces are supported to quickly retrieve through caching, especially for Node resources, reducing the response time from 20+ seconds to less than 5 seconds. Other important interfaces were also compared, and the average speed increased by more than 75%.

In Rancher Enterprise Edition, users can easily manage large-scale clusters by optimizing the parameters of downstream clusters, setting policies for connecting to downstream clusters, enabling caching, and more. From the above practice, it can be seen that as long as the tuning and planning are done reasonably, even a super-large-scale cluster like Zhihu can have the same user experience as a small-scale cluster.

At the time of writing, Rancher's local team is doing a secondary tuning of the performance of the Enterprise Edition, and it is said that it can manage the 10,000-level workloads in a single Project/NS from a UI perspective, which is basically enough to cover our usage limits. We look forward to working with Rancher again and sharing new performance practices.

If you also have the use case or need to manage large-scale clusters, you can contact SUSE Rancher on Chinese official website (https://www.rancher.cn) to protect your cluster!

Author: There is no wind or rain or sunshine

Source: https://zhuanlan.zhihu.com/p/453463882

Read on