There's a trick to lowering MTTR! Construction of Qunar network observability system

This article is based on the content of the speech shared by the teacher in "Deeplus Live: Qunar Observability Practice" online. (There is a playback method at the end of the article, don't miss it)

Xiao Shuang

Qunar Infrastructure Department Technology TL

He joined Qunar in 2018 and is currently responsible for the construction of Qunar's CI/CD, monitoring platform and cloud-native related platforms.
During this period, he was responsible for the construction of the Qunar containerization platform, assisted the migration of large-scale applications of business lines to the container platform, and completed the transformation and upgrade of the monitoring system Watcher 2.0 and the implementation of the root cause analysis system. In-depth understanding and practical experience in monitoring and alerting, CI/CD, and DevOps.

Share the synopsis

First, from the beginning to the end, use fault indicators to measure the perfection of the monitoring system

2. Order faults are found in seconds

3. How to locate the cause of complex faults in minutes

Fourth, summary

First, from the beginning to the end, use fault indicators to measure the perfection of the monitoring system

In the early days of the introduction of the internal monitoring system, we will share some core monitoring indicators such as storage and alarms. As monitoring iterates, we gradually ask: how helpful are these monitoring system metrics to the business?

1. Fault data analysis

Last year, we analyzed the company's failure data and found that:

The average time to find faults is about 4 minutes
20% detection rate of order failures in 1 minute
48% of failures that take longer than 30 minutes

There is still a lot of room for improvement in these indicators, and based on the above data, we refer to MTTR to optimize the indicators.

MTTR divides faults into three phases: discovery, diagnosis, and repair, so we can develop tools for each stage to shorten the overall MTTR.

2. Order faults are found in seconds

Last year, we made it clear that order failures must be discovered in seconds.

1. Status quo

Before realizing the second-level fault detection, according to the existing monitoring system, the key points of the construction of the second-level monitoring system were sorted out:

Graphite protocol is used for collection: The collection and format of metrics follow the Graphite protocol, and its compatibility must be considered when building second-level metrics.
High storage I/O and large footprint: Graphite's Carbon+ Whisper and Whipser used by TSDB in the past will cause high disk I/O and excessive disk space usage due to space pre-allocation policies and write amplification issues.
Minute-level design: Modules such as data collection, reporting, and alarm are all minute-level design, and the collection uses the Pull mode to pull every minute at a regular time.

2. Storage solution selection

当时对比了两个TSDB，也就是M3DB和VM（VictoriaMetrics）。

1)M3DB

Merit:

High compression ratio, high performance
Scalable
Graphite protocol and some aggregate functions are supported: In business scenarios, aggregate functions are frequently used and the usage scenarios are complex

Shortcoming:

Complex deployment and maintenance
Low community activity
Slow update iteration

2）VictoriaMetrics

Merit:

High performance, single machine read and write up to 10 million indicators
Each component can be scaled arbitrarily
原生支持Graphite协议
Simple to deploy
The community is highly active and updates and iterations are fast

Shortcoming:

In the Graphite aggregate read scenario, the performance is severely degraded

3. Stress test data

Stress test a VM:

After the stress test, it was found that the disk usage was low (40 GB) and the overall performance consumption was low. However, when using a large number of aggregate metrics or a large number of aggregate function queries, the performance degradation is very severe and even times out.

4. Separation of storage and computing

Based on the above features, we separated the storage and computing of VMs.

Because the advantage of VM is that the performance of single-metric query scenarios is very good, we use VM to do single-metric query and use the above aggregation layer (CarbonAPI) to calculate all aggregate functions.

The advantage of the aggregation layer (CarbonAPI) is that it is stateless and can be arbitrarily extended, on top of which any policies, data monitoring functions, aggregation logic, throttling, and monopoly capabilities can be added.

In this architecture, an additional metadata DB needs to be introduced to store all the metrics (which metrics are located in the VM cluster).

When querying, the user first queries the CarbonAPI, which obtains the information stored in the indicator from the metadata DB, then goes to the VM to take out the corresponding data, performs aggregation calculations at the aggregation layer, and finally transmits the data back.

In the aggregation layer (CarbonAPI), we also made some custom features.

5. Indicator collection

The separation of storage and computing solves the problem of VM shortcomings, and the second task is to realize the second-level processing of metric collection and alarms.

1) The status quo of the client

Originally, we used a self-developed client, and the client architecture is on the right in the figure below.

In the early implementations, Counter.Incr() was used to calculate metric autoincrement, which would put the data into a local metrics repository, which was actually in the application's client, i.e. in memory.

At the same time, the asynchronous scheduler generates data snapshots at regular intervals (possibly minutes), so when the server pulls a snapshot, the number of snapshots pulled per minute is fixed. If you pull multiple times in a minute, it's also a fixed snapshot.

The advantage of this processing is that when the server side is abnormal, the data will not be changed, and the data accuracy is high. The disadvantage is that the scheduler schedules data snapshots every minute, and the indicator warehouse only supports the storage of minute-level data.

2) Client-side transformation

Metric warehouse scheme 1: Refer to the Prometheus mode, the client does not take snapshots, but only accumulates or records data, and the server does the default incremental calculation when pulling

Advantages: The client saves memory and does not need to store multiple copies of data
Disadvantages: The server side does incremental calculations, which is more stressful; The acquisition architecture has changed too much, and there may be data accuracy issues

Metric warehouse solution 2: Calculate and store multiple copies of data on the client and generate multiple snapshots

Advantages: Minimal changes to the acquisition architecture, no data accuracy problems, and low server pressure
Disadvantages: It will occupy a portion of the memory
Optimization: Use Tdigest for data sampling, and the memory usage is acceptable

3) The status quo of the server

In the early days of the server, the Python + multi-process model was used for development, and the master node sent full tasks to the work node every minute, which is a typical producer worker architecture, and its advantage is that the worker node can be expanded at will.

Issue:

For example, when hundreds of thousands of collection tasks are distributed every 10 seconds, the process time may exceed 10 seconds when the collection tasks are distributed to the worker through MQ and then consumed, which is not possible in this scenario. When Python performs a large number of aggregation calculations, the CPU consumption is high

4) Server-side transformation

In fact, it is still the architecture of master and worker, but the MQ is removed. Now the worker becomes a stateful node, once the worker starts, the task is scheduled by the master to the worker, and the master detects that the worker is hung up through ETCD, and then rebalance will be enabled, saving the process of task allocation.

6. Final Architecture

The core of the final architecture of second-level monitoring is divided into three parts: data collection, data query, and alarming.

In terms of practice effect, it helps business lines improve the practice of order fault discovery from three minutes to one minute.

3. How to locate the cause of complex faults in minutes

The problems that are difficult to locate after microservices focus on:

Complex links: Complex architectures such as monolithic architectures
Complex dependencies: Applications depend on a variety of resources and rely on more and more external resources

1. Link and indicator correlation - precise positioning

The link part is called Qtracer internally, and the monitoring system is called Qmonitor.

Why do you want to relate these two parts?

We found that after an alarm (usually a metric alarm), if the QPS of the application is high, many traces will be found. Since it is uncertain which link is related to the indicator, it interferes with the rapid positioning.

For example, if there are three ingress points, the incoming traffic forms three traces, and their code paths may not pass through this metric, so they will not be alarmed. Or if only one trace passes through this alarm indicator, it makes no sense to use the remaining two traces to locate the fault.

So we put the metrics in Qtracer, and Qmointor will check if there is a trace environment when calling the metric count. If there is a trace or span, put the current metric into the span.

If an alarm occurs in a metric, you can directly associate the traffic that actually passes through the metric based on this metric, and then use these traces to locate it, which significantly improves efficiency.

Alarm panel

After searching for a trace on the right, we will initially perform anomaly detection on the application on the trace, such as which applications have abnormal logs and abnormal alarms, whether the resources that the application depends on are normal, whether the container running environment is normal, etc., and display these anomalies on the panel to assist development in locating the problem.

2. Self-dependency Quick Troubleshooting - Application Overview

Aggregate the core metrics, status, and content of each application to form an application overview. If there is a problem with the app, you can quickly browse the app overview to quickly check whether the app is healthy.

The application overview provides two pieces of information: a list of requests for ingress and a list of requests for egress.

The list of ingress requests refers to the services that are provided, i.e., the services requested by others. You can view information such as which interfaces are requested, average QPS, exception rate, average response duration, and trend charts in the panel.

You can learn about the target application, its average QPS, the exception rate outside the request, and the response time in the application overview.

In actual application, developers can judge whether the application status is normal when they see the above information.

3. Quick check of self-dependencies - Dependency details

4. Automatic analysis platform

The running logic of the automatic analysis platform is to collect Qtraces, view link exceptions based on traces, and then automatically analyze anomalies with high recommended weights on the platform.

The above is a simple model diagram of the automated analysis platform.

The bottom layer is the mining of knowledge graph, which is very important for automatic analysis, and the content of application dependency, operation container, computer room information and other contents need to be recorded by the knowledge graph to form the analysis basis of behavior.

The abnormal information is analyzed by the behavior, and then the possible abnormalities are finally recommended through the evaluation and scoring and ranking of the weight system.

1) Knowledge Graph

Basic data: Unified event center, logs, traces, monitoring alarms, and application portraits (meta-configurations of basic applications).
Establishment of application associations: service call chains and strong and weak dependencies
Establishment of resource associations: Awareness of the resource relationships and physical topology of the application - Awareness of the host and network environment of the application running on the container expansion KVM
Establishment of correlation between anomalies: Through anomaly indicators, you can accurately and quickly find the corresponding traces, logs, etc., and mine the correlation between abnormal alarms

2) Application analysis

3) Link analysis

When the QPS is high, even if the association is accurate, there are still many traces found within one minute, so it is necessary to filter or converge to

Smaller range.

Filter Categories:

Exception trace filtering: focus on the focus and classify first
T-value classification and filtering: vertical classification based on service entrances
Topological similarity filtering: Compares the similarity of traces between ingress A and ingress B

How do I locate an APPCode that may be the root cause?

Use the alarm APPCode as the vertex to find the Unicom subgraph.

Traversing the subgraph:

AppCode that is marked as abnormal on the trace link
Calculate the alarm concentration and the AppCode if the alarm concentration is higher than a certain threshold
当前有L1/L2级别告警的AppCode
Analyze the relationship between the interfaces invoked by each application in the trace link, check whether the error rate and delay of interface monitoring are normal, and mark abnormal AppCodes

For AppCode, perform more detailed anomaly detection.

4) Pruning sorting-weight system

There are four types of weights:

Static weights: Empirical weights, which are adjusted annually or on a time-by-time basis based on the cause of the failure.
Dynamic weights: root case weights, which escalate themselves according to the severity of each root cause to prevent the real root cause from being overwhelmed.
Application Weight: indicates the weight ratio of the application exception in the current fault, indicating the probability that the application will affect the faulty application.
Strengthened dependency weight: Strong and weak dependencies can clearly indicate whether the m1 interface of application A depends on the m2 interface of application B whether it is a strong dependency or a weak dependency, and the impact probability can be confirmed based on this information.

How to calculate the weight of the app:

Convergence Trace: During the trace convergence process, the abnormal AppCode is calculated, and the weight of the abnormal AppCode is added to the application distance, and the closer the AppCode is to the alarm AppCode, the higher the weight

As shown in the figure above, all three traces are A entries, and the weight of C will be increased by one when C is found for the first time. The second time the C anomaly is found, the weights continue to accumulate, and the application weights are calculated.

In addition to this, the app weight is also affected by the app distance. In the hotel business scenario, we believe that the closer A and B are, the higher the probability that A will cause B to be abnormal, and the higher the weight of A.

Strength depends on pruning:

When An alarm occurs in A, and both B and C have root case exceptions, and B is weakly dependent and C is strongly dependent, then it is inclined to assume that A's alarm is caused by C's exception.

5. Result output

The results will indicate what releases are available, what percentage of root cases are used, and which exceptions need to be paid attention to.

6. Practical effects

THE ABOVE FIGURE SHOWS OUR ACTUAL ONLINE FAULT SCENARIO, SUCH AS THE MYSQL THREAD IS ABNORMAL AND THE LOAD IS PARTICULARLY HIGH, IT WILL BE PROMPTED ON THE RESULT OUTPUT INTERFACE.

After the above changes, the proportion of faults with slow location is now reduced by 20%, and the accuracy rate of root cause location is 70%-80%.

Fourth, summary

Use the MTTR indicator of failure to operate, optimize, and build our monitoring system
Second-level monitoring mainly solves the problem of high-level fault detection time
To assist in fault location, it is necessary to confirm the dependent components of the application and the health status of the dependent application in the case of complex links and dependencies, and calculate the weight of related anomalies and the fault

Q&A

Q1: What exactly is the indicator placed in the span? What does the data model look like?

A: For example, for example, if there is an interface called false, you need to monitor QPS, and the indicator is named false-QPS. Generally, this metric will be counted when the request enters. At the same time, we will check whether there is a trace and span, and if so, we will put the metric name into the span to further associate the trace id. By washing the numbers in the later stage, we can do indexes in ES and search for traces through indicators.

This mathematical model is relatively simple, there may be a lot of spans in the trace id, span can be understood as a big JSON, put the metric into the span, you can achieve association.

Q2: What technology stacks or tool stacks are used in Intelligent Analytics?

A: Go is used in the language layer, and the algorithm includes scheduling algorithm, outlier detection, weighting, etc.

Q3: How do I control the accuracy of data in the indicator warehouse? What are the judging criteria?

A: The data accuracy is controlled by the client or SDK. For example, if a snapshot is generated at 0 seconds, snapshot data from 0 to 59 seconds is generated. No matter how the server pulls it, the new data is stored in the new warehouse, so the data does not interfere with each other.

Data accuracy has different requirements in different business scenarios. In most business scenarios, if one or two monitoring data is lost, the problem will not be too big. If it is an order-based scenario, the problem of data loss is relatively large.

Q4: What is the approximate proportion of different weights in the weight system?

A: At present, no weight proportion is used, and the overall score is similar to the scoring system, where different weights are added to the score, and the indicators with high scores are finally screened out. We can set the range of weights, and it can be upgraded by itself based on its own interval or situation.

Q5: How to determine the strength and weakness of the service, and is it through static configuration in advance? Or is it judged based on a rule when running dynamically?

A: Strength depends on the data provided by another system. Qunar implements chaos engineering, one of which is the dependence of strength and weakness. We carry out normalized strong and weak dependency governance in the line of business and retain this data.

Q6: What online tools and systems do alarm management and O&M teams use? How to evaluate the application business team, quantify the performance evaluation of SLA indicators, or accident statistics?

A: One of the most common ways to quantify SLA metrics in the industry is to use burndown charts to count failures. For example, the SRO set by the team is four nines, which means that there are only more than 50 minutes of failure time in a year.

↓ Link to watch this live broadcast

https://weixin.qq.com/sph/A4plNbhzq