laitimes

Zhihu community data monitoring systematization capacity building and implementation

author:Flash Gene

The quality assurance system is mainly divided into two parts: the interception of offline problems and the recall of online problems.

Set goals

background

Through the overall analysis of online faults and user feedback, it is found that with the development of multiple business lines within the company, the business complexity is getting higher and higher, the existing monitoring and business indicators are not correlated, and the detection cycle of functional problems and "buried points" is long, resulting in the inability to stop losses in time, so a complete data monitoring system is urgently needed to find online problems in time and ensure that the business can run stably online.

target

Starting from the business line, design and sort out a reasonable and accurate set of general data monitoring system construction plan, and then apply it to other business scenarios, guide the construction of data monitoring system of different business lines, and improve quality assurance capabilities.

The construction of the data monitoring system needs to establish a comprehensive and reliable effect measurement index to measure the construction effect of the monitoring system:

  • Establish a core business data monitoring system, and cover 100% of core business indicators, performance and other related indicators.
  • Ensure the quality and stability of the abnormal pop-up boxes of core business indicators and core functions, and the quality and stability of the abnormal interface display, and reduce the detection period of related abnormal problems to the hourly level.
  • Establish a process specification for setting data monitoring indicators: standardize and unify all scenarios.

Challenges

In the process of building a data monitoring system, there are two main challenges:

(1) Formulate an effective and reliable monitoring system construction plan:

  1. Determination of monitoring indicators: It is an important prerequisite for the monitoring system to ensure the coverage of monitoring indicators of core functions.
  2. Monitoring dimensions and means: The most important point in building a data monitoring system is to define reasonable monitoring dimensions and means.
  3. Statistical methods: For different types of monitoring indicators, it is very important to adopt reasonable statistical methods, otherwise it is easy to introduce noise and mislead decision-making.
  4. Alarm threshold: It is particularly important to determine a reasonable alarm method and threshold based on actual business indicators and indicator data fluctuations, otherwise it is easy to misreport, misreport and understatement.
  5. Diagnosis and analysis: The value of the monitoring system lies in how to quickly diagnose and locate according to the alarm information.

(2) The implementation and operation of the monitoring system:

Ensure that cross-team tasks are executed to achieve project goals and deliver value.

solution

In view of the above two types of challenges, this chapter explains the overall idea of the solution from two parts: the construction plan of the monitoring system, the implementation and operation of the monitoring system.

(1) The plan for the construction of the monitoring system

The monitoring system construction plan is mainly divided into four parts: the determination of monitoring indicators, monitoring capabilities, monitoring construction and problem handling.

Zhihu community data monitoring systematization capacity building and implementation

Part 1: Monitoring Metrics

The core idea of determining monitoring indicators is to sort out business scenarios and combine indicator stratification to determine different types of monitoring indicators: business indicators, availability indicators, and performance indicators.

The business-driven top-down hierarchical data monitoring system mainly includes three layers: business layer monitoring, application layer monitoring, and system layer monitoring.

Business layer monitoring is the most intuitive indicator in the monitoring system, which can monitor the real situation of user use of services and the health status of services, and is directly linked to business results.

Application layer monitoring is the core of the monitoring system, which monitors the software system operation indicators that are closely related to service stability and business system performance, and is mainly divided into two categories: availability indicators and performance indicators: response time, traffic, error rate/success rate, and error code.

    • Response time: Client-side response time statistics and server-side response time statistics, end-to-end computing time on the client, including server-side execution time, network interaction time, etc. The server-side time consumption includes only the execution time of the service provider.
    • Traffic: A measure of the number of requests that flow through the network, whether it's an http request, an RPC request, or a message sent to a processing queue.
    • Success/failure rate: The probability of success/failure of calls to different functional interfaces.
    • Error Codes: Error codes in different states after requests for different functions fail.

System layer monitoring is the bottom layer of the monitoring system, which is mainly concerned by R&D and operation and maintenance students, focusing on the monitoring of the relevant indicators of the hardware equipment running the software system.

Zhihu community data monitoring systematization capacity building and implementation

Among them, system-layer monitoring mainly refers to the monitoring related to the operation status of machine resources or middleware, and this part generally has relatively mature supporting operation and maintenance monitoring strategies and alarm specifications, so the first two are focused on the construction of indicator systems with higher relevance to business.

Part II: Monitoring Capabilities

In the overall monitoring system, it is also extremely important to have what capabilities the existing platform can support and to select the appropriate monitoring methods for different indicators. At present, the existing monitoring of Zhihu is mainly divided into the following types, different types of monitoring are applicable to different monitoring indicators and data sources, and the capabilities of different types of monitoring are now introduced in detail for several types of monitoring.

Zhihu community data monitoring systematization capacity building and implementation

The client real-time monitoring is aimed at the core business tracking points of product indicator dismantling and the technical tracking points participating in the calculation of application layer indicators, mainly including core business scenarios and fine-grained tracking indicators, business performance indicators and business availability indicators.

Characteristic:

1. With the help of the client's buried point reporting for monitoring, it is closer to the perception of the user side, and the buried data is more abundant, which can monitor all the operation processes of the user, and at the same time, it can also find some scene problems that cannot be monitored by the server side, and build a comprehensive business monitoring.

2. Problems can be found more quickly and timely, and the response to possible online problems is more sensitive.

3. At the same time, due to the large fluctuation of online traffic, it will bring certain difficulties to the monitoring configuration, and the monitoring system needs to reduce noise and false alarms.

平台:Grafana + firing 报警等。

The client monitors the daily aggregate data of the data warehouse table, focusing on the business indicators corresponding to the product indicators.

Characteristic:

1. The data source is the data warehouse table that has undergone preliminary data processing, which is more suitable for the business indicators that the product focuses on by the day dimension statistics, and the impact of data volatility can be reduced by the day dimension statistics on offline data.

2. However, because it is offline data and T+1 data, it is lagging behind, and there will be a delay in finding problems.

Platform: DQC "Offline Data Analysis Platform".

Real-time monitoring of the server mainly focuses on the log logs and other information of the request success rate, and monitors the service availability and performance indicators, mainly referring to the request volume, time consumption and interface success rate of the interface: monitor the interface request situation and dotting data on the server side, pay more attention to whether the ability provided by the application to the user meets a certain SLA, and can easily find problems on the server side.

Part III: Monitoring Construction

Based on the background premise of the above-mentioned monitoring index stratification and monitoring platform capability cognition, the core elements of monitoring construction are mainly elaborated below.

Grafana monitors the dashboard

The core elements of GF's dashboard configuration solution include monitoring dimensions and data aggregation methods.

Monitoring dimensions

Setting a reasonable monitoring dimension can clearly and intuitively understand the data situation under different subdivision dimensions, which can effectively help drill down the dimensional analysis. The monitoring dimension is mainly disassembled by using the core ideas of the business-specific dimension and the general dimension, and the general dimension mainly refers to the APP version and mobile phone system.

When setting dimensions, you also need to consider the problem of metric pressure, otherwise there will be problems such as slow report query and query errors.

  • Control the number of dimensions, the fewer the dimensions, the faster the query speed.
  • Control the number of values for each dimension, and do not carry high-dimensional variables in the metric, such as: user ID, request IP, random ID, task ID, etc.
  • If the dimension is important but has a large number of values, you can consider limiting the range of values, for example, the page dimension has more values, and the top page with the largest number of values can be limited.

How the data is aggregated

Monitoring data is often a fluctuating value, and when we need to process monitoring data, we usually need to use statistical means to assist in analysis. Common monitoring data statistics methods include raw value (observation), maximum/minimum value (max/min), average value (avg), quantity (count), percentage (%), and percentile (percentile).

For different types of monitoring indicators, it is very important to adopt reasonable statistical methods, otherwise it is easy to introduce noise and mislead decision-making. For example, if you only count the average response time of an application cluster, it is easy to be disturbed by several extremely high or low values, resulting in high or low mean data. Usually, the 50th percentile, 95th percentile, and 99th percentile are used to eliminate extreme interference and reflect the true response time of the cluster.

Zhihu community data monitoring systematization capacity building and implementation

Alert policy

The core of the alert policy is to determine the check item indicators, preprocess the query result data, and set the threshold.

The determination of check indicators mainly depends on the calculation method of data aggregation indicators & indicators configured in the Grafana dashboard, and the common calculation methods of indicators mainly include direct value, year-on-year comparison, and function reference: https://graphite.readthedocs.io/en/latest/functions.html

There are five types of data preprocessing methods: maximum, minimum, average, total, and last.

Threshold settings: Divided into single thresholds and dynamic thresholds. A single threshold indicates that an alarm will be generated whenever the currently set threshold is met. The dynamic threshold can be dynamically alarmed in different time ranges according to the set time.

Zhihu community data monitoring systematization capacity building and implementation

After the configuration of the Grafana dashboard and monitoring and alarm platform is completed, you need to check & correct data in each link, and finally optimize the alarm policy and summarize and correct the data through the trial operation of the alarm rules.

Part IV: Problem Handling Center

Duty mechanism

Create > Determine the members of the on-duty group, determine the on-duty policy, and configure it into the alarm rule to ensure that the corresponding on-duty students respond in time and take corresponding actions to avoid online problems.

Alarm analysis

After receiving the alarm notification, you need to analyze the cause, which needs to be analyzed manually, and the following methods can be used for analysis:

  • Application layer metrics: alarms such as failure rate and error rate, directly analyze the error code monitoring report, and directly locate the error code.
  • Business-level metrics: Drill down into specific dimensions in the monitoring report to subdivide the abnormal metrics in which dimensions occur to narrow the scope of troubleshooting.
Zhihu community data monitoring systematization capacity building and implementation

(2) Landing and operation

  • Quality assurance operation means: clarify the task owner and completion node through milestone dismantling, and assist in ensuring the landing effect of the task and the iterative optimization of the plan through the operation mechanism of the project weekly meeting.
  • Team Collaboration & External Linkage: Establish a data monitoring system quality construction mapping team with RD and PM to continuously promote the construction of optimized monitoring system.
  • Business promotion: In the early stage, the effect of the monitoring system construction plan is measured and optimized through the implementation of the pilot business, and then promoted from point to area to other businesses.
Zhihu community data monitoring systematization capacity building and implementation

Summary of results

On the whole, the construction of the data monitoring system has made phased progress, and all the results are in line with expectations. We can improve the recall and discovery capabilities of online problems through the data monitoring system, and then improve the comprehensiveness and reliability of the quality assurance system, and ultimately serve the online quality.

  1. In terms of system construction:

    In the community consumption scenario 0-1, a data monitoring system was built, and N+ services were implemented, with 100% coverage of core indicators, which is mainly reflected in:

    1. "Business Coverage" is piloted by social services, and then promoted on the consumer side, landing a total of N+ businesses.

    2. "Indicator Coverage" covers 100% of core indicators.

    3. "Dashboard & Alarm Configuration" has completed a total of 200+ related business dashboard configurations and 300+ alarms.

  2. Ability to find problems:

    In the community consumption scenario, through the operation data monitoring system, the cumulative number of intercepted problems is * times, and the number of effective problems is *.

    1. Real-time monitoring of the client: At present, the problem detection time in core scenarios can reach the hour-level ability to perceive problems through alarms.

    2. Offline monitoring of the client: build the ability to find problems in product business indicators.

    3. Real-time monitoring of the server: At present, the efficiency of finding stability and quality problems on the server side has reached the ability to perceive problems in 10 minutes.

Author: Red Maruko Jingjing

Source: https://zhuanlan.zhihu.com/p/614526578