background

With the rapid development of Zhihu's business, the complexity of the business is also increasing, and the monitoring of the server side is currently in a scattered state, fragmented, and the problem discovery cycle is long. Therefore, a complete server-side monitoring system is urgently needed to monitor services and ensure the stable operation of the business.

target

Establish a homepage server-side monitoring and alarm system to improve the perception of online problems, shorten the online problem discovery cycle, and reduce the server-side problem discovery cycle to hours.
Based on the application, a set of general server-side monitoring system solutions are sorted out to guide the construction of server-side monitoring systems of different business lines.
The server monitors the problem and finds the valid problem X+.

Hierarchical monitoring metrics

In order to sort out a complete and systematic monitoring index, this article divides the monitoring indicators into three layers: business layer, application layer, and system layer based on business-driven thinking and top-down concept. The following is a hierarchical diagram of monitoring metrics:

Exploration and practice of server-side monitoring solutions

Business layer

Business layer monitoring mainly covers four aspects: comprehensive data monitoring, business data monitoring, interface monitoring, and key node monitoring.

Comprehensive data monitoring: It mainly includes comprehensive application data, and has a monitoring of the overall data of the application, such as application SLA and the number of "warning, critical" alarm data at each level.
Business data monitoring: Collects statistics on whether the business data is within the normal fluctuation range on a daily basis, including metrics such as the number of interface requests and 5XX.
Interface monitoring: focuses on interface metrics and observes the running status of interfaces. For example, the number of interface requests, interface response time, interface response success rate, and interface status code.
Critical node monitoring: Disassemble key nodes based on service links, pay attention to whether the logic, data processing logic, and flow direction of each node meet expectations, and quickly locate service nodes through link monitoring when problems occur. At the same time, the time consumption of each stage on the link is monitored to provide reference for subsequent performance optimization. In addition, you need to monitor scheduled tasks, including metrics such as the frequency of scheduled tasks, the success rate of scheduled tasks, and the duration of scheduled tasks.

Application layer

Application layer monitoring mainly covers three aspects: log monitoring, third-party dependency monitoring, and middle-layer monitoring.

Third-party dependency monitoring: Under the microservice architecture, multiple services will be involved after the business request comes, and the request link is long and complex, so it is necessary to monitor the third-party dependency, sort out the strong dependencies and weak dependencies according to the business links, and mainly focus on indicators such as request volume, response time, and response success rate.
Log monitoring: During the operation of the service, a lot of logs will be generated, and logs are the most critical information for us to troubleshoot problems, so the monitoring of logs can reflect the running status of the service, mainly focusing on the total number of logs at each level, the amount of logs at each level on key links, and the proportion of error logs.
Middle-layer monitoring: The stability of middle-layer services is the key to ensure the continuous availability of applications, common middle-layer software includes MySQL, TiDB, Hbase, Redis, Kafka, Nginx, etc., middle-layer software has its own characteristics, and the indicators that need to be monitored are also different.

Middle-tier software	Monitoring metrics
MySQL	Query throughput: QPS, number of read statements, and number of non-read statements for different types of SQL statements Query execution performance: the average execution time of the query, the number of statements with execution errors, and the slow query Database Connections: The number of currently open connections, the number of currently running connections, the number of connections denied due to server errors, the number of failed attempts to connect to the server, and the number of connections denied because the maximum number of connections was exceeded
TIDB	Query comprehensive data: SQL statement execution time, number of SQL statements executed per second, and slow queries Query details: the number of transactions executed per second, the execution time of transactions, and the number of transaction retries
Redis	Basic activity metrics: client connections, QPS, read QPS, write QPS, number of slaves, total number of key values in the database, cache hit ratio, slow queries, time spent writing data, time spent reading data, and the number of different commands executed per minute Error metrics: The number of connections rejected because the maximum number of connections was exceeded, and the number of failed key value lookups
Kafka	整体指标：Patition 数量、消息写入 QPS Broker metrics: QPS of messages written to QPS, number of partitions on the broker, total time spent requesting FetchConsumer, total time spent requesting FetchFollower, total time spent requesting Produce, number of threads on the broker Producer metrics: The average number of requests sent by the producer to the broker per second, the average number of responses sent by the producer to the broker per second, the total number of messages sent by retrying, the total number of messages sent with errors, and the number of messages stacked consumer metric: the average number of messages consumed per second and the number of messages piled up
HAProxy	Front-end monitoring metrics: QPS, number of bad requests, rejected requests, number of 4XX, number of 5XX Background monitoring metrics: response time, number of faulty connections, number of rejected responses, and status codes

System layer

The system layer is mainly concerned with hardware and resource usage, and this part of monitoring is focused on by the operation and maintenance team, which is beyond the scope of this article.

General monitoring scheme

Based on the above monitoring indicators, the following will focus on how to establish server-side monitoring based on business. There are several main steps:

Sort out business scenarios to determine the core business

Based on the business scope, the overall business architecture diagram is sorted out first, and then the core business within the business can be further sorted out based on the business architecture diagram.

Sort out key links

The next thing to do is to sort out the key links of each sub-business according to the core business, and this part of the work is responsible for the R&D teacher. The idea of sorting out the associated link is to read through the service code, based on the service code, from the starting point of business processing, sort out the key nodes and what key operations have been carried out by each node, until the end point of business processing, and generally provide the final business data in the form of interfaces.

The granularity of key nodes can be split and defined according to the actual service link, which is more flexible, and generally selects the more important nodes as the key nodes.

Sort out the tripartite dependency

In the microservice architecture, after a request comes, the service call link is long, and when the interface response is abnormal, it is necessary to check layer by layer based on the service link, so it is necessary to monitor the third-party dependencies of the business to facilitate quick location when problems occur.

There are three main ways to sort out the tripartite dependencies:

In the process of sorting out key business links, you can sort out the tripartite dependencies of the business.
Through chaos experiments, the tripartite dependence of the application dimension is sorted out.
Through the Beidou platform - > service topology function, sort out the third-party dependence of the business.

Sort out & complete key logs

According to the combing of key links, we have clarified the path and logic of each key node, and then we need the R&D teacher to confirm whether the logs of each key node are complete, and clarify the correspondence between the key logs and the actual business.

Log levels can be divided into three categories: error, warning, and info, and R&D teachers can choose the appropriate log level according to actual business scenarios.

Sort out the middle-tier software

The middle-layer software is an indispensable part of ensuring the stable operation of the system, and there are two main ways to sort out the middle-layer software:

In the process of sorting out key business links, the middle-tier software used in a single core business is sorted out at the same time.
In the cloud effect-->application--> resources, you can sort out the middle-layer software used in the application dimension.

Determine monitoring metrics

Based on the combing of the above business scenarios, business core links, third-party dependencies, key logs, and middle-layer software, combined with the "business layer indicators" and "application layer indicators" in the monitoring indicator layer, the monitoring indicators of a single core business can be determined.

Business-layer metrics

Comprehensive data: focuses on comprehensive application data, such as application SLA and alarm volume at each level.
Business data: Focuses on whether the day-level business data is normal, such as the total number of interface requests, data consumption, and 5XX.
Interface data: focuses on metrics at the interface level, such as the number of interface requests, interface response time, interface response success rate, and interface status code.
Business key nodes: Disassemble key nodes based on the core service links, pay attention to whether the data processing and flow direction of key nodes meet expectations, and pay attention to the time consumption of each node to determine monitoring indicators.

Application layer metrics

Third-party dependency: focuses on the metrics related to the third-party dependency of the core link, such as the response time and response success rate of the third-party dependency.
Log monitoring: focuses on the number of key link logs and the proportion of error logs during service operation.
Middle-tier software: Mainly focus on the middle-layer software-related indicators that the application depends on, different middle-layer software needs to monitor different indicators, please refer to "Monitoring Indicator Layering" for detailed indicator design

Establish a monitoring dashboard

At present, the monitoring of the server side is mainly distributed on two platforms, one is the Beidou platform and the other is the GF platform.

Beidou platform

The Beidou platform provides a part of the monitoring capabilities, mainly including the following monitoring capabilities:

Service interfaces: QPS, response time, and number of errors per minute.
Service Overview: APIs and Service Dimensions in HAProxy, Service Mesh, Containers, Dependencies and Dependencies display service-related data.
Dependency analysis: Service dependencies and service dependencies are mainly based on the metric data of the upstream or downstream services/interfaces of the queried service. The data is displayed in the top 5 requests per second, top 5 errors per minute, top 5 response time, and top 5 response time percentage, and service/interface lists
Container: Container-side monitoring in the service dimension, including pods, container CPUs, container memory, MeshSidecar CPUs, MeshSidecar memory, and service threads.
Middle-layer software monitoring: The monitoring of the middle-layer software on which the application depends, and the monitoring indicators of different middle-layer software are different. For specific indicators, please refer to "Monitoring Indicator Layering" → "Intermediate Layer Software Monitoring Indicator Table".

GF Platform

Zhihu's more mature monitoring platform is GF, and the server-side monitoring can use the GF platform to build a monitoring dashboard. GF dashboard configuration documentation

GF's dashboard is more flexible, and the monitoring capabilities are completely customized according to the business side.

When the business side is building a monitoring dashboard, it can combine the Beidou platform and the GF platform to select the appropriate monitoring method.

Configure alerts

The Firing platform is a mature alarm platform within Zhihu, so this time the alarm configuration is based on the Firing platform.

The basic platform has created some alarm check items, which are mainly divided into two categories:

Alarm at the interface level: interface 5XX, interface time P95
Infrastructure alerts: Kafka message accumulation, abnormal container resource usage, high DNS query latency, high CPU utilization, high memory utilization, high disk utilization, high Redis memory utilization, high Redis CPU usage, and Redis master-slave synchronization failure

The core of alarm construction is to determine the inspection item indicators and the determination of the alarm threshold.

Check item metrics

According to the monitored metric items, select the key indicators to determine the check items.

Alert threshold

Threshold settings are mainly divided into unified thresholds and custom thresholds, and the unified threshold means that as long as the currently set threshold is met, an alarm will be generated. Custom thresholds can be dynamically alarmed in different time ranges according to the set time.
For different check items, the alarm threshold can be calculated based on the fluctuation trend of the indicator over a period of time.

Run & Alert Threshold Tuning

After completing the construction of the monitoring panel and the alarm system, it is necessary to continuously pay attention to the alarm situation in daily work, create a new & determine the members of the duty team, and configure the duty policy into the alarm rule to ensure that there are students on duty who can respond in time when the alarm occurs, so as to avoid the expansion of the impact of the problem.

At the same time, the alarm threshold needs to be corrected according to the business operation to avoid false alarms and false negatives, and improve the accuracy of the alarm system.

The closed loop of alarm events is mainly divided into three stages:

Beforehand: Pay attention to the discovery and prevention of problems and improve the efficiency of alarm processing.
During the event: Focus on quickly identifying and solving problems, quickly recovering business, ensuring business continuity, and reducing the scope of impact.
Afterwards: pay attention to the root cause of the problem, through continuous accumulation of alarm analysis, precipitate alarm landing documents, and continuously optimize monitoring and alarm strategies.

Author: Eiko

Source: https://zhuanlan.zhihu.com/p/620111232

Exploration and practice of server-side monitoring solutions

background

target

Hierarchical monitoring metrics

Business layer

Application layer

System layer

General monitoring scheme

Sort out business scenarios to determine the core business

Sort out key links

Sort out the tripartite dependency

Sort out & complete key logs

Sort out the middle-tier software

Determine monitoring metrics

Establish a monitoring dashboard

Configure alerts

Run & Alert Threshold Tuning

Read on

Handle demanding LLMs and large-scale AI inference with custom servers

China Lucky Co., Ltd., a world-class new material system service provider, has joined hands with UFIDA to promote the construction of digital workshops

What "special" services are provided in the presidential room of the hotel for tens of thousands of dollars a night? Netizen: I've seen it for a long time

The Palace Museum has released its 2024 Summer Visitor Guide: Strengthening Services, Ensuring Safety, and Appreciating Ancient Charms

Responsibility first, long-term companionship, provide high-quality services, so that customers do not have to worry

Innovative development of traditional Chinese medicine services Community residents are "circled fans"

Deliver public welfare services to the masses - celebrate the "July 1st" party media into the community activities into the Gulou Street Celebrity Garden community

The founding general drank a bottle of Moutai, and the waiter asked him to settle the bill, and the general did not drink Moutai again

Microservice granularity challenges: Finding the right microservice size

Yao Bi, a post-90s beauty: Served the state banquet three times in succession, and now has become a "gold medal waiter for the state banquet"

Realize intelligent operation: optimize the management and service of smart parks

Hundreds of paper men danced a fox dance in Haidilao, looking at the waiter as stupid: a bunch of crazy people

From managers to service providers, regulators should focus on retail investors!

Lin Chiling went to the big S Gu Junye barbecue restaurant for dinner and never paid the bill, and the waiter acquiesced, the reason was super funny

Secret membership card CPS rebates: a one-stop consumer service platform

laughed like crazy, netizens saw that the waiter resembled Lu Xun in the Hu spicy soup shop, and the netizens in the comment area instantly boiled