laitimes

A 10,000-word long article on the construction of system stability

author:JD Cloud developer

1. Background

JD.com's midterm exam: 618 is coming, and each team is conducting mock exams before the midterm exam: military exercise stress test, fault drill, and system combing to test the stability of the system to cope with high availability, high performance, and high concurrency. We know that the stability of the system runs through the entire R&D process: the demand stage, the R&D stage, the testing stage, the launch stage, and the operation and maintenance stage; All participants in the whole process: product, R&D, testing, operation and maintenance personnel should pay attention to the stability of the system. In the process of business development and system construction, stability is the 1, and the others are 0 after 1. This article mainly talks about stability construction from the perspective of back-end R&D and online stages for the R&D stage and the online stage, hoping to play a role in throwing bricks and stones.

2. R&D stage

The main participants in the R&D stage are R&D, and the main output is the technical solution design documents and code, one is the beginning of the R&D stage, and the other is the end of the R&D stage.

2.1 Technical Solution

2.1.1 Review of technical solutions

The review of technical documents needs to be participated by the architects of the team and the R&D students of related R&D, testing, products, and upstream and downstream systems, so as to ensure the alignment of the realization of the technical solution and the product requirements to the greatest extent. What we need to do is to be open-minded and take everyone's opinions, and strictly control the quality of technical documentation;

The review of technical documents can be done in the form of questions, before the start of the meeting, you can share the technical documents with you, so that you can read 10 minutes first, all students start to ask questions, technical document designers do not need to read their own technical documents to introduce to everyone, as long as everyone's questions are answered, and can think about everyone's suggestions, after reasonable adoption, in fact, the quality of technical documents has a great guarantee, some students in the technical document review, more disgusted with everyone's questions, always feel that they are challenging themselves, Some questions can't be answered, in fact, you can think differently: some questions can't be answered, it's normal, you can adopt everyone's suggestions first, and then think about the reasonableness after the meeting; Everyone makes suggestions and suggestions for their own technical solutions, which is to ensure the quality of their technical solutions and avoid major online hidden dangers in the technical solution stage.

2.1.2 Technical Solution Concerns

When we encounter a problem, the first thing we need to think about is whether this is a new problem or an old problem, and 99.99% of the problems we encounter are old problems, because we are engaged in engineering technology, not scientific exploration; All we have to do is to take a look at the solutions of domestic and foreign counterparts to this problem, learn from best practices; Therefore, the first step of the technical solution is to benchmark and learn best practices, so that we can avoid detours;

At the same time, according to the principle of Occam's razor, we strive to have a simple technical solution, avoid over-design, for a complex problem, our technical solution is relatively complex, and the technical solution of a simple problem is relatively simple. At the same time, the technical document not only focuses on the implementation of functions, but more importantly, on architecture, performance, quality, and safety; That is, how to build a high-availability system. Building a high-availability system is the premise of system stability construction, if our system can not ensure high availability, and how to talk about the construction of system stability, the following is a common method and focus on system stability construction in our technical solutions.

2.1.2.1 Current limitation

Throttling is generally a self-protection capability provided from the perspective of the service provider, and the throttling policy can prevent our system from being overwhelmed by the surge of traffic when the traffic load exceeds the processing capacity of our system. JD.com's internal JSF, whether it is synchronous interaction or asynchronous interaction JMQ, provides the ability to limit the rate, and you can set it according to the situation of your own system; We know that the common throttling algorithms include: counter algorithm, sliding time window algorithm, funnel algorithm, token bucket algorithm, the specific algorithm can be googled online, the following is a comparison of the advantages and disadvantages of these algorithms.

A 10,000-word long article on the construction of system stability



2.1.2.2 Fuse degradation

Circuit breakers and downgrades are two different things, but they are generally used in combination. The circuit breaker is to prevent our system from being dragged down by the downstream system, such as the downstream system interface performance is seriously deteriorated, or even the downstream system hangs up; At this time, it will lead to a large number of threads accumulating, and the occupied CPU, memory and other resources cannot be released, in this case, it will not only affect the performance of the interface, but also affect the performance of other interfaces. Downgrade is a lossy operation, and we, as service providers, need to minimize this loss as much as possible, whether by returning a friendly prompt or by returning acceptable downgraded data. Downgrade subdivision is divided into manual downgrade and automatic downgrade.

Manual downgrading: Manual downgrading is generally controlled by a downgrade switch, and the company's internal configuration center Ducc is generally used to do switch downgrading, and the modification of the switch is also an online operation, which also needs to be monitored

Automatic downgrade: Auto-downgrade is the use of automated middleware such as Hystrix, the company's small shield dragon, etc.; If automatic downgrade is employed; We must be very clear about the conditions for downgrading, such as the number of failed calls, etc.;

2.1.2.3 Timeout

One of the difficulties in the distributed system: unreliable network, under the existing microservice architecture of JD Logistics, the services are synchronized through JSF network interaction, and the fastest way for us to detect whether the downstream dependent services are available is to set the timeout time. The timeout setting allows the system to fail quickly and protect itself from waiting indefinitely for downstream dependent systems, which will exhaust the system's threads and bring down the system.

How to set the timeout period is also a science, and how to set a reasonable timeout period is also a gradual iterative process, for example, the newly developed downstream interface will generally provide a TP99 based on the stress test time, and we will configure the timeout time based on this; For the old interface, the timeout period is configured based on the TP99 time consumption on the network.

The timeout period needs to be set according to the funnel principle, and the timeout period set from the upstream system to the downstream system should be gradually reduced, as shown in the following figure. Why should the funnel principle be satisfied, assuming that the funnel principle is not satisfied, for example, the timeout period of service A to retrieve service B is set to 500ms, and the timeout period of service B to call service C is set to 800ms, at this time, service A will call a large number of timeouts from service B, resulting in a decrease in the availability rate, and service B is available from its own point of view;

A 10,000-word long article on the construction of system stability



2.1.2.4 Retry

The impact of performance in distributed systems is mainly communication, whether in a distributed system or in a broken team, communication is the most expensive; For example, we all know that more than half or more of the time spent on the delivery of requirements is spent on cross-team communication, and the time for actual code writing is very small; In the distributed system, we look at the call link, in fact, the calculation time of our system itself is very small, mainly from the network interaction of external systems, whether it is downstream business systems, or middleware: Mysql, redis, es, etc.;

Therefore, in a request interaction with an external system, our system hopes to do its best to get the desired result, but often backfires, due to unreliable network reasons, when we interact with the downstream system, we will configure the number of timeout retries, hoping to get the result of a request within the acceptable SLA range, but the retry is not an infinite retry, we generally configure the limit of the number of retries, and the occasional jitter retry can improve the availability of our system, if the downstream service fails to hang up, Retries in turn increase the load on downstream systems, which in turn increases the severity of failures. In a request call, we need to know how many services are providing services behind the APIs provided externally, and if the call link is relatively long, the number of retries is set for the RPC interaction between services, and we need to be wary of retry storms. As shown in the following figure, if there is a problem with service D, the retry storm will aggravate the severity of the fault of service D. For API retry, we also need to distinguish whether the interface is a read interface or a write interface, if it is a read interface retry, it generally has no effect, and the write interface retry must do a good job of the idempotency of the interface.

A 10,000-word long article on the construction of system stability



2.1.2.5 Compatible

When we reconstruct and iterate the old system and old functions, we must do a good job of compatibility, otherwise there will be major online problems after the launch, and there are a large number of cases of capital loss inside and outside the company because of the lack of compatibility. Compatibility is divided into: forward compatibility and backward compatibility, and it is necessary to distinguish between them, as follows:

Forward compatibility: Forward compatibility refers to the ability of an older version of software or hardware to be compatible with a new version of the software or system that will be released in the future, in short, the old version of the software or system is compatible with the new data and traffic.

Backward compatibility: Backward compatibility refers to the fact that a new version of software or hardware is compatible with a previous version of the system or component, in short, the new version of the software or system is compatible with the old data and traffic.

According to the new and old systems and new and old data, we can divide the system into four quadrants: the first quadrant: the new system and new data are the state of our system after the transformation of the online, the third quadrant: the old system and the old data are the state of our system before the transformation of the online, the problems of the first quadrant and the third quadrant can generally be found and eliminated in the R&D and testing stages, and the high incidence of online faults often occurs in the second and fourth quadrants, and the second quadrant is because of the lack of forward compatibility, such as in the online process. The code is rolled back when the problem is found, but new data is generated during the rollout process, and the old system cannot process the new data generated during the rollout process, resulting in online failures. The fourth quadrant is due to the lack of backward compatibility, and the new system affects the old process after the launch. For the problem of the second quadrant, we can construct new data to verify the old system, and for the problem of the fourth quadrant, we can solve the problem by recording and playing back the traffic, recording the old traffic on the line, and verifying the new function.

A 10,000-word long article on the construction of system stability



2.1.2.6 Isolation

Isolation is an effective means to minimize the explosion radius of the fault, and in the design of the technical solution, we control the scope of influence through different levels of isolation:

2.1.2.6.1 System-level isolation

We know that the classification of systems can be divided into: online systems, offline systems (batch systems), and near-real-time systems (stream processing systems), and the following are the definitions of these systems:

Online system: The server waits for the request to arrive, after receiving the request, the service processes it as quickly as possible, and then returns a response to the client, which is usually the main measure of the performance of the online service. Most of the apps we use on mobile phones in our lives are online systems;

Offline system: or batch system, receives a large amount of input data, runs a job to process the data, and outputs output data, the job often needs to be timed, regularly run for a period of time, such as from a few minutes to a few days, so the user usually does not wait for the job to complete, throughput is the main measure of the offline system. For example, the report data we see: daily order volume, monthly order volume, daily active user number, and monthly active user number are all obtained by the batch system for a period of time;

Near-real-time system: or stream processing system, which is between the online system and the offline system, the stream processing system generally has a trigger source: user behavior operations, database write operations, sensors, etc., the trigger source as a message will be delivered through the message broker middleware: JMQ, KAFKA, etc., and consumers consume the message and then do other operations, such as building caches, indexes, notifying users, etc.;

For example, our group will deploy the online system as a separate service: jdl-uep-main, and the offline system and the near-real-time system as a service: jdl-uep-worker;

2.1.2.6.2 Isolation of the environment

From the R&D to the launch stage, we will use different environments, such as the common environments in the industry are divided into: development, testing, pre-release and online environments; R&D personnel conduct development and joint debugging in the development environment, testers conduct testing in the test environment, UAT operations and products perform UAT in the pre-release environment, and finally the delivered products are deployed to the online environment for users to use. In the R&D process, we must follow the principle of deploying from the application layer to the middleware layer to the storage layer, all in one environment, and it is strictly forbidden to call the environment, such as the test environment to call the line, the pre-release environment to call the line, etc.

A 10,000-word long article on the construction of system stability



2.1.2.6.3 Data Segregation

With the development of the business, the services we provide to the outside world tend to support multiple services and multiple tenants, so at this time, we will isolate data according to the business; For example, the logistics order data generated by our group includes JD Retail, other e-commerce platforms, ISVs, etc., in order to avoid the influence of each other, we need to isolate the data at the storage layer, and the isolation of data can be based on different granularities, the first is to distinguish through the tenant id field, all the data is stored in a table, and the other is the distinction of the database granularity, and different tenants are assigned the corresponding database separately.

A 10,000-word long article on the construction of system stability



For example, our database is divided into test library, pre-release database, online database, and full-link stress test. We isolate frequently accessed data as hot data and infrequently accessed data as cold data. Cache frequently accessed data to the cache to improve the performance of your system. Infrequently accessed data is persisted to the database or unused data is carried forward and archived to

2.1.2.6.4 Core, non-core isolation

We know that applications are hierarchical, and JD.com will divide applications into 0,1,2,3 levels of applications according to the importance of applications. The processes of the business are also divided into gold processes and non-gold processes. In a business process, core and non-core processes need to be separated for different levels of application interaction. For example, in the process of transaction business, the order system, the payment system, and the notification system will be involved, and the core system in this process is the order system and the payment system, and the importance of notification is relatively not so high, so we will invest more resources in the order system and the payment system, and give priority to ensuring the stability of these two systems, and the notification system can be decoupled and isolated from the other two systems in an asynchronous way to avoid the impact on the other two systems.

A 10,000-word long article on the construction of system stability



2.1.2.6.5 Read and write isolation

At the application level, CQRS (Command Query Responsibility Segregation), the most well-known in Domain-Driven Design (DDD), isolates write services from read services. The write service mainly processes the command write command from the client, and the read service processes the query read request from the client, so that the read and write isolation from the application level can not only improve the scalability of the system, but also improve the maintainability of the system. For example, the database will adopt a master-many-slave architecture, and read requests can be routed to the slave database to share the pressure on the master database and improve the performance and throughput of the system. Therefore, the application layer mainly solves the scalability problem through read/write isolation, and the storage layer mainly solves the performance and throughput problems.

A 10,000-word long article on the construction of system stability



2.1.2.6.6 Thread pool isolation

Threads are expensive resources, in order to improve the efficiency of threads and avoid the consumption of creation and destruction, we use pooling technology, thread pools to reuse threads, but in the process of using thread pools, we also do a good job of isolating thread pools to avoid multiple API interfaces from reusing the same thread.

A 10,000-word long article on the construction of system stability



2.2 Code Review

codeReview is the last process in the R&D stage, which plays an important role in the offline bug rate and online quality and stability.

•Form a team's code style: First of all, a team's code should form the team's code style, which can improve the efficiency of codeReview and collaboration.

•Review focus: Code review remember not to get caught up in details, mainly to review code style, if a team forms a unified code style, we can find most of the problems through the review style, while focusing on the function, and then pay attention to performance, security.

•Pair programming: In the process of code writing, we need to cultivate the habit of pair programming, so that for a certain requirement, when codeReview, colleagues who are familiar with the module control the details, and the architect controls the style.

•Control the amount of code for each review: Do not submit a large number of codes at once for review each time you submit code, but subdivide the content of the review, such as the implementation of a method, a class, etc.

•Open-mindedness: The process of review is actually a process of learning and improving, through code review, humbly accept the opinions of others, learn how to write elegant code, and improve your own code level.

3 Go-live phase

We can take a look at the faults recorded by the company's fault management platform Baihu: system failures are generally external changes to the system, which often occur in the online stage: code deployment, database changes, configuration center changes, etc.; The go-live phase is a period of high incidence of failures; It is impossible for a system to be free from online problems, and what we want to pursue is to reduce the frequency of failures on the line and shorten the failure recovery time. In response to problems in the on-line process, we know that there are well-known three-board axes in the industry: monitorable, grayscale, and rollback.

3.1 Go online with a three-plate axe

3.1.1 Monitorable

In the process of going online, our system should be monitorable, and if there is no monitoring, we will not know anything about the status of the system during the online process, which is very scary. Monitor something, in fact, what is monitored is the metric. This involves the definition of indicators, which are divided into business indicators and technical indicators, and technical indicators are divided into software and hardware. Business metrics are generally a measure that we define to observe changes in the business, such as order volume, payment volume, etc. Technical-level software metrics: availability, TP99, and invocations, and technical hardware metrics: cpu memory, disk, and network IO. At present, our second-level department is doing OpsReview, which mainly reviews the availability rate, TP99, and call volume, which correspond to the availability, performance, and concurrency of the system.

After doing a good job in the monitoring of these indicators, the next thing we need to do is to do a good job of alarming these indicators, if an indicator breaks through the set threshold, we need to notify us of the alarm, for the setting of the monitoring alarm indicator threshold, it is recommended to be strict first and then loose, that is, the initial stage of system construction is set stricter, to avoid missing alarms, online problems, and then with the iteration of system construction, we need to set a more reasonable alarm threshold to avoid the flood of alarms, resulting in the effect of wolves. In short, a period of time in the online release process is the peak of accidents and problems, and this must do a good job in indicator monitoring, log monitoring, and be sensitive to alarms.

A 10,000-word long article on the construction of system stability



3.1.2 Grayscale

During the launch process, we need to achieve grayscale, and implement changes through grayscale to limit the explosion radius and reduce the impact range, and at the same time, the grayscale process should be compatible. Grayscale is divided into different dimensions: machine dimension, computer room dimension, regional dimension, and business dimension: user, merchant, warehouse, carrier, etc.

Machine dimension: When we use Xingyun to deploy, we can deploy a part of the machines in each group for grayscale, and then deploy the remaining machines after 24 hours of grayscale, for example, after 24 hours without problems.

Data center dimension: Under the microservice architecture, our applications will be deployed in different data centers, and you can deploy and publish the code in a certain data center group according to the data center dimension, for example, first deploy the release code under a certain data center group, observe it for a period of time, and then expand the grayscale data center scope according to the scale until the full volume. For example, deploy the data center of Zhongyunxin first, and then gradually gray the computer room after a period of grayscale.

For example, Meituan's takeaway business in the industry is very suitable for remote multi-activity and unitized deployment, because the merchants, users, and riders of the takeaway business are naturally aggregated, and users in Beijing will most likely not order takeout in Shanghai, so according to the attributes of the business, during the system construction, from the application layer to the middleware layer and then to the storage layer, it can be deployed in the computer room in the Shanghai region and the computer room in the Beijing region. When a function is released, you can grayscale a certain region to achieve region-level disaster recovery.

Business dimension: During the launch process, we can also perform grayscale according to business attributes, for example, if a function or product is launched, only some users or some merchants can use the function or product according to the user dimension.

3.1.3 Rollback

When there is a problem online, we should give priority to stop loss, followed by analyzing the root cause. The fastest way to stop loss is to roll back, which is divided into code rollback and data rollback, code rollback is to restore our code to the original logic, and there are two ways to code rollback: switch control and deployment rollback. The fastest way is switch control, one key switch can be turned on or off to achieve rollback to the original logic, the lowest operating cost, the fastest stop loss. The second way is to deploy a rollback, which rolls back the code to the last stable version through a release platform, such as Xingyun. Sometimes after we roll back the code, if we don't do a good job of forward compatibility, the system application still has problems, for example, new data is generated during the roll-out process, and the code cannot process the new data after the rollback. So at this time, it involves the rollback of data, and the rollback of data involves repairing the number: invalidating the new data generated, or modifying it to the correct data, etc., when the amount of data is relatively large, the rollback of data is generally time-consuming and laborious, so it is recommended to do a good job of forward compatibility and direct code rollback.

3.2 Online Problem Response

3.2.1 Classification of Frequently Asked Questions

For online problems, our first step is to identify what the problem is, and then we can solve the problem, for a variety of online problems we can aggregate, merge and classify, for each problem to refer to the industry's treatment methods and the team's internal emergency plan, so as to be in chaos.

A 10,000-word long article on the construction of system stability



3.2.2 Issue Lifecycle

When a problem occurs, we also need to be clear about the life cycle of an online problem: from the occurrence of the problem, to the discovery of the problem, and then to respond to it, to observe whether the problem is fixed, whether the service is restored to normal, to finally review the problem, when a system problem occurs, the sooner we find the problem, the smaller the impact on the business, the whole process is shown in the following figure.

A 10,000-word long article on the construction of system stability



3.2.3 How to prevent problems

Just like a person's body is sick, when the problem is too late, we need to invest more time and energy in how to prevent it, just like Bian Que's big brother to cure the disease before it happens. According to the principle of broken windows, a problem arises, and if left unchecked, the severity of the problem will increase until it is irreparable. We can prevent it from the aspects of R&D specifications, R&D processes, and change processes.

A 10,000-word long article on the construction of system stability



3.2.4 How to find problems

For a system, if the outside world does not do work on it, according to the principle of entropy increase, it will become more and more chaotic, until there is a problem, the outside world does work on it, it involves change, because change is operated by people, due to various uncontrollable factors, it will also lead to various online problems, so we can see that it is impossible for a system to go online without problems, when there is a problem, our first step is how to quickly find the problem? For the channels of problem discovery, there are the following types of contact in the work: self-awareness, monitoring and alarm, and business feedback;

Self-awareness: Our C2 department has an important weekly OpsReview, where each C3 team reviews the availability, performance, and call volume of the team's core interface irregularities, glitches, and discovers potential online problems through this proactive, self-aware behavior. At the same time, an important part of our group's daily morning meeting: UMP monitoring global kanban review, we will analyze the availability rate of the core interface yesterday, TP99, call volume, for the availability rate is reduced, TP99 has glitches, and the non-standard traffic call will be troubled to find the problem as soon as possible, and will also review the CPU, memory usage, Mysql, redis, es and other storage of the machine.

Monitoring and alarm: This is the most commonly used channel for us to find problems, through active monitoring indicators, passive receiving alarms to find problems, alarm indicators we are divided into business indicators and technical indicators, the specific classification can be detailed in 3.1.1 monitorable part

Business feedback: This way of finding problems is the last thing we want to see, if we wait for business feedback, it means that online problems have affected users, we often lag behind the business to find problems because of the lack of monitoring and alarms, false negatives, so we hope that everyone, the team has this kind of self-awareness, the online problems are detected early, and the problems are prevented before they occur.

3.2.5 How to Respond to Questions

After the occurrence of online problems, our personal understanding of the problem is very limited, and people are in a state of high tension at this time, so at this time, we must know our leader in the group, express the situation truthfully, do not exaggerate and narrow the scope and impact of the problem, and at the same time notify the problem. The entire response process consists of the following steps:

1. Keep the site: The site where the problem occurs is the basis for us to troubleshoot the problem, so we should keep the logs, data and other information on the site, such as memory dump and thread dump, to avoid the loss of this information after the machine is restarted.

2. Provide information: Provide what you know, assist in troubleshooting, and don't expand and narrow the problem

3. Restore service: When there is an online problem, what we pursue is to restore the service at the fastest speed, quickly stop the loss, the industry has a few axes to quickly stop the bleeding and restore the service: rollback: service rollback, data rollback, restart, expansion, disabling nodes, and function degradation

4. Double confirmation: After the service is restored, we need to confirm whether it has been restored, and we can observe whether the business indicators are normal, whether the technical indicators are normal, whether the data is normal, whether the logs are normal, etc

5. Fault notification: After confirming that there is no problem with the problem, you need to inform everyone in the emergency group: business personnel, product managers, upstream and downstream of the system, testers, SRE, etc. And let the product and business confirm, and then inform the user.

3.2.6 How to locate the problem

After the service is restored, we can go back and analyze in detail what caused the online problem. Methodology should also be paid to the positioning problem, which involves the three elements of the positioning problem: knowledge, tools, and methods.

Knowledge: Compared with other industries, the computer industry should be the industry with the fastest knowledge update and iteration, so we need to continue to learn, update our knowledge base, and not set limits for ourselves. For example, if you want to solve the FullGC problem, you must systematically learn the JVM, and if you want to solve slow sql, you must systematically learn Mysql. With the knowledge we can encounter problems when we know what and why?

Tools: If you want to do a good job, you must first sharpen your tools, engineers should be good at using company tools to improve the efficiency of problem solving, proficient in the use of various middleware tools of the company, the company already has middleware, give priority to the use of the company's middleware, the middleware tools maintained by a middleware team in the company are better than the middleware tools maintained within the business R&D team, do not repeat the wheel within the group, or within the team, and the flow of personnel in the group changes, which is easy to cause no middleware to maintain. The following diagram shows the middleware tools commonly used by companies:

A 10,000-word long article on the construction of system stability



Method: To solve the problem, we must pay attention to the method, choose the right method can get twice the result with half the effort, improve our efficiency in locating the problem and solve the problem, the following is a common method for our R&D personnel to troubleshoot the problem

A 10,000-word long article on the construction of system stability



3.2.7 How to fix the problem

With knowledge, tools and methods, in fact, we quickly locate the problem, after locating the problem, we have to figure out how to fix the problem, the following is the process of problem fixing:

A 10,000-word long article on the construction of system stability



3.2.8 How to review the problem

After the problem occurs, we need to analyze the root cause of the problem and learn from the lessons and experiences to avoid making the same mistakes. This involves the review of the problem, how to carry out the review of the problem, generally going through the following steps: reviewing the goal, evaluating the results, analyzing the reasons, and summarizing the experience. For example, in the weekly opsReview meeting of our C2 department, there will be a review of the online question: coe, how to conduct a coe review and talk about some of your own thoughts.

•Refer to the industry's 5WHY analysis method to analyze the root cause of the problem

• 5WHY Analysis: 5 represents the depth of the problem, not the number of questions

• Continue to ask questions based on the answers to the questions, the 5 questions are related, progressive, and find the root cause of the problem

A 10,000-word long article on the construction of system stability



4 References

•https://itrevolution.com/articles/20-years-of-google-sre-10-key-lessons-for-reliability/

•https://learn.microsoft.com/en-us/previous-versions/msp-n-p/jj591573(v=pandp.10)?redirectedfrom=MSDN

•https://sre.google/books/