laitimes

In-depth analysis of the core design principles of intelligent monitoring of observation clouds

author:Observe clouds

background

Monitoring highly distributed applications, which can rely on hundreds of services and infrastructure components across multiple cloud-based and on-premises environments, can be challenging to identify errors, detect the cause of high latency, and identify the root cause of problems. Even if a robust monitoring and alerting system is in place, infrastructure and applications can change over time, making it difficult to reliably detect anomalous behavior, and for background services that are running 24/7, monitoring alerts are the cornerstone of stable operation. Many developers have had such an experience, and have done strict monitoring and alarming for every indicator of the service, lest the problem cannot be found due to missing the alarm, resulting in a large number of invalid alarms received every day, and the flood of alarms gradually paralyzing vigilance, and as a result, the real problems are ignored when they are first leaked, which eventually leads to serious failures. This article explores how to improve the effectiveness of alarms and accurately identify problems without drowning in a large number of invalid alarms.

Intelligent monitoring systems play a vital role in modern IT infrastructure. It not only helps with business analysis and user behavior analysis, but also quickly locates the root cause of failures. Intelligent monitoring can quickly locate abnormal nodes in business indicators and highly volatile indicators, and is suitable for locating key dimensions of multi-dimensional indicators. In a microservices architecture, it quickly analyzes anomalies by analyzing service calls and resource dependencies.

Intelligent monitoring does not need to configure detection thresholds and trigger rules, but only needs to set the detection range and notifier to enable it with one click. Based on intelligent detection algorithms, it can identify anomalous data and predict future trends. Due to its efficiency and accuracy, intelligent monitoring also occupies an important place in the viewing and management of cloud platforms.

Alerting is the foundation of reliability

First, let's take a look at the importance of alerting and why we need to put so much effort into optimizing it. While we all expect a service to be fault-free, the reality is that there is no such thing as a 100% problem-free system, we can only continuously improve the reliability of our services, and we expect to:

  • You know the current state of your service and have control over it
  • It can find the problem at the first time and quickly locate the cause of the problem

In order to achieve the above two points, you can only rely on perfect monitoring & alarm, monitoring and displaying the complete running status of the service, but it is impossible to keep staring at the screen to observe, and it is impossible to pay attention to all aspects, if you want to passively understand the system status, the only way to automatically detect abnormal situations is through alarms. Therefore, alerting is one of the most important means for teams to monitor service quality and availability.

Real-world problems faced by alerts

1. It is difficult to set appropriate static thresholds for business dynamic change indicators

As the business changes, the relevant indicators tend to show seasonal characteristics on an hourly, daily, and weekly basis. These indicators themselves fluctuate, resulting in a poor match for static thresholds and year-on-year thresholds. However, if you use a lazy way to choose a fixed threshold for all applications/interfaces, such as response time, error rate, and call volume, you will naturally generate a large number of false alarms.

2. Different thresholds for the same indicator are different in different applications

For example, if the response time of an application is normal, the response time may be about 200 ms, but when the response time is greater than 300 ms, it can be judged as an exception. However, in real business scenarios, some interfaces have a large number of long-term visits, and the overall metric fluctuates normally around 500 ms, so the appropriate alarm threshold may be about 600 ms. An application may have hundreds of interfaces, and the time cost of configuring appropriate thresholds for all of them can be very high.

3. The threshold of the indicator will change with the business

As the company's business grows and new applications are launched, the thresholds for the normal state of some metrics will continue to change. If the thresholds are not updated in a timely manner, it is easy to generate a large number of false alarms.

Principles for setting alarms

Whenever an alarm occurs, O&M students need to pause their work and view the alarm. However, this kind of interruption greatly affects work efficiency and increases R&D costs, especially for students who are developing and debugging. Therefore, whenever we receive an alarm, we hope that it can truly reflect the anomaly, that is, the alarm will not be a false alarm as much as possible (alarm for the normal state); Whenever an abnormality occurs, the alarm should be sent out in time, that is, the alarm should not be missed (missed alarm). False positives and false negatives are always a pair of contradictory indicators. The following are some alarm setting principles:

  • Alerts are real: Alerts must report a real-world phenomenon and show that your service is experiencing or is about to occur
  • Detailed description of the alarm: In terms of content, the alarm should be close to the detailed description of the phenomenon, such as what specific abnormality occurred on the server at a certain point in time
  • Operable alarms: Whenever an alarm is received, some operations are generally required, and it is best to cancel some alarms that do not require action. Notification is required if and only when something needs to be done
  • Use a conservative threshold for new alarms: At the beginning of the alarm configuration, you should expand the coverage of monitoring alarms as much as possible and select a conservative threshold to avoid false negatives as much as possible
  • Continuous optimization of alarms: Statistical analysis of alarms is carried out in the future, and false alarms are reduced by masking, simplifying, adjusting thresholds, and reflecting the causes more accurately, which is a relatively long-term process

For example, if the number of requests failed exceeds a certain threshold, there may be multiple reasons, such as maliciously constructed requests, which also trigger the failure alarm. Such an alarm is neither realistic nor actionable, because it really doesn't need to be dealt with at all. For such cases, we should identify them by features as much as possible to more accurately distinguish the cause of the alarm.

Introduction to intelligent monitoring

Common anomaly detection scenarios

Unsupervised programme

1. Maximum and minimum value volatility threshold judgment method

Identify anomalies by detecting the maximum and minimum values in the dataset, as well as their volatility (rate of change). Set a predefined threshold that will be flagged as an anomaly if the data exceeds it.

2, 3-Sigma principle

Based on the theory of normal distribution, it is assumed that the dataset follows a normal distribution, and any data point that is outside the range of ± 3 standard deviations of the mean is considered an anomaly.

3、basic detectors 控制图理论

Use control charts (e.g., Shewhart charts, CUSUM charts, EWMA charts) to monitor data changes and determine if the data is within acceptable controls, and data points that are outside of the control range are considered anomalies.

There is a monitoring programme

1. Statistical regression

  • Linear regression: Predict data points using a linear regression model and mark them as anomalies if the difference between the actual value and the predicted value exceeds a predefined threshold.
  • Logistic regression: Use a logistic regression model to classify data and predict whether a data point is normal or abnormal.

2. Machine learning

  • Decision Tree: Uses a decision tree model to classify data, and the nodes of the tree are split according to the importance of the variables.
  • Naive Bayes: Based on Bayes' theorem, which assumes that features are independent of each other, calculates the probability that a data point belongs to each class.
  • Random forest: Integrate multiple decision trees to determine the classification of data points through a voting mechanism.
  • XGBood: A decision tree ensemble method based on gradient boosting, which optimizes the computational efficiency and model performance.

3. Deep learning

  • DNN (Deep Neural Network): Uses multi-layered neural networks to model data and capture complex feature relationships.
  • LSTM (Long Short-Term Memory Network): A special RNN for time series data that captures long-term dependencies.
  • DNN+Attention: Introduce an attention mechanism into DNN to enhance the model's ability to pay attention to important features.
  • LSTM+Attention: The attention mechanism is introduced into the LSTM to enhance the model's ability to pay attention to important time points.
  • VAE (Variational Autoencoder): A generative model that performs anomaly detection by learning the latent distribution of data.

Introduction to the intelligent monitoring solution of Observation Cloud

The intelligent monitoring solution provides comprehensive monitoring of IT infrastructure and applications in a series of steps. The overall implementation principles include data collection and preprocessing, feature engineering, predictive models, anomaly detection and validation, and alerting and notification. These steps work together to form a complete monitoring loop to ensure the stable and efficient operation of the user's system.

Data collection and pre-processing

1. Data collection

The intelligent monitoring system first needs to collect time series data from the past 30 days, one point in time per minute. This data can include host resource usage, application performance metrics, user access logs, and more.

2. Data preprocessing

Once the data is collected, pre-processing is required to ensure the quality of the data. Preprocessing steps include data cleansing (removing noise and outliers), data filling (processing missing values), data normalization, and more.

Feature engineering

On the basis of the preprocessed data, feature engineering is carried out to extract the key features of the time series data. These features include:

  • Maximum (Max) and Min (Min): Indicates the extreme value of the data.
  • 值域(Range):最大值与最小值的差。
  • Mean and Median: Indicates the concentrated trend of the data.
  • 方差(Variance)和标准差(Standard Deviation):表示数据的离散程度。
  • Kurtosis and Skewness: Indicates the morphology of the data distribution.
  • 同比(Year-over-Year)和环比(Month-over-Month):表示数据的变化趋势。
  • Periodicity: indicates the periodicity of data.
  • 自相关系数(Autocorrelation Coefficient):表示数据的自相关性。
  • 变异系数(Coefficient of Variation):表示标准差与均值的比率。
In-depth analysis of the core design principles of intelligent monitoring of observation clouds

Anomaly detection

The intelligent monitoring system collects time series data and uses the confidence intervals established by historical data to predict the normal fluctuation range. The system compares the data characteristics of the current time period with historical data to detect whether the data exceeds the predetermined confidence interval. If the data point falls outside of this range, the system will determine that it is an anomaly and may trigger an alert; If the data point is within the normal range, the system will continue to monitor to ensure the stability and security of the real-time data.

In-depth analysis of the core design principles of intelligent monitoring of observation clouds

Examples of anomalous data:

In-depth analysis of the core design principles of intelligent monitoring of observation clouds

Alarms and notifications

When an abnormality is detected, the intelligent monitoring system will send an alarm notification to relevant personnel in real time. Notifications can be sent through a variety of channels, such as email, SMS, instant messaging apps, etc., to ensure that relevant personnel can respond and deal with them in a timely manner.

Application scenarios

Apply intelligent detection

The application is direct-to-user, and any performance issues (e.g., response time delays, increased error rates) can significantly impact the user experience. Traditional application monitoring methods often struggle to cope with complex application architectures and rapidly changing load scenarios.

Application intelligent detection detects and handles anomalies in a timely manner by monitoring various performance metrics of applications. For example, an intelligent monitoring system can detect a sudden increase in response time for an application and analyze whether the cause is due to a performance degradation of a dependent service. In this way, development and operations teams can respond quickly, optimize application performance, and ensure a stable user experience.

In-depth analysis of the core design principles of intelligent monitoring of observation clouds
  • Exception Summary: displays and views the current abnormal application service tags, exception analysis report details, and outlier distribution statistics
  • Resource analysis: You can view the top 10 resource requests, the top 10 resource error requests, and the top 10 resource requests per second

Intelligent host detection

In modern IT infrastructure, the stability and performance of the host (server) directly affect the proper functioning of all applications and services. Abnormal usage of host resources such as CPU, memory, disk, network, and so on can cause system crashes or degraded performance.

Based on the intelligent detection algorithm, the CPU and memory of the host are intelligently detected on a regular basis. By analyzing the root cause of the host with CPU and memory abnormalities, determine whether the host has abnormal conditions such as sudden increase, sudden drop, and interval increase, so as to monitor the running status and stability of the host. When the intelligent monitoring system identifies an abnormality, it will notify the relevant personnel for processing. With this detection, system administrators can quickly locate and resolve potential problems to ensure the stable operation of the host.

In-depth analysis of the core design principles of intelligent monitoring of observation clouds
  • Event Content: Displays the event content of the monitor configuration
  • Exception Summary: displays the current exception hostname tag, exception analysis report details, and outlier value time series diagram to display the exception trend
  • Anomaly analysis: The anomaly analysis dashboard allows you to view basic information such as abnormal processes and CPU usage of hosts
  • Host Details: You can view the host integration status, system information, and cloud vendor information

Kubernetes intelligent detection

Kubernetes is a popular container orchestration platform that is widely used to manage and deploy containerized applications. In large cluster environments, Kubernetes' resource management and service status monitoring become particularly important.

Based on the intelligent detection algorithm, Kubernetes Intelligent Detection can detect and predict possible problems in the cluster by regularly monitoring key metrics such as the total number of pods, the number of pod restarts, and the API Server QPS. This approach not only identifies anomalous fluctuations in resource usage, but also pinpoints the source of the problem through root cause analysis, whether it's a misconfiguration, a mismatch of resources, or too many requests. Through intelligent detection, the O&M team can adjust resource configurations or optimize applications in a timely manner to ensure the efficient and stable operation of Kubernetes clusters.

In-depth analysis of the core design principles of intelligent monitoring of observation clouds
  • Exception Summary: displays statistics on the distribution of the number of abnormal APIServer nodes in the current cluster.
  • Anomaly analysis: You can view the number of APIServer nodes, API QPS, number of read requests being processed, write request success rate, and number of write requests being processed.

Intelligent log detection

Log files record the operating status of systems and applications, which is an important basis for problem diagnosis and troubleshooting. Traditional log monitoring methods often rely on manual checks and simple rule matching, which is inefficient and easy to miss critical issues. Intelligent log detection uses advanced algorithms to automatically analyze log data and identify anomalous log entries and frequencies. For example, an intelligent monitoring system can detect frequent occurrence of error logs within a certain period of time and associate them with specific applications or services. Through intelligent analysis of logs, O&M personnel can quickly locate the cause of faults and improve the efficiency and accuracy of problem solving.

In-depth analysis of the core design principles of intelligent monitoring of observation clouds
  • Exception Summary: You can view the tags of the current exception logs, the details of the exception analysis report, and the distribution of the number of error requests
  • Error Analysis: You can view the clustering information of error logs

The future of intelligent surveillance

A higher degree of automation

1. Intelligent self-optimizing monitoring system

In the future, intelligent monitoring systems will further enhance the degree of automation and reduce the dependence on human intervention. Through machine learning and artificial intelligence technology, the monitoring system is able to self-learn and optimize, and automatically adjust monitoring strategies and parameters. For example, the system can adaptively adjust detection algorithms and thresholds through the analysis of historical and real-time data to continuously optimize the accuracy and efficiency of anomaly detection.

2. Automate response and repair

In addition to automated monitoring, future systems will integrate automated response and remediation mechanisms. When an anomaly is detected, the monitoring system can automatically trigger predefined remediation scripts or policies to perform automated troubleshooting and resource adjustments, reducing the time and risk of human intervention.

More accurate anomaly detection

1. Advanced algorithms and deep learning

In the future, intelligent monitoring systems will introduce more advanced algorithms and deep learning technologies to improve the accuracy of anomaly detection. Through deep neural networks and sophisticated time series analysis models, the system can more accurately capture and predict anomalous patterns, reducing false positives and false negatives.

2. Multi-source data fusion

In the future, the monitoring system will integrate more multi-source data, including sensor data, user behavior data, business indicator data, etc., and improve the comprehensiveness and accuracy of anomaly detection through multi-dimensional data fusion and analysis. The fusion of multi-source data makes the system have a more comprehensive understanding of anomalies, can more accurately locate the root cause of anomalies, and provide more targeted solutions.

Enhanced user experience

1. Intelligent user interface

The intelligent monitoring system of the future will provide a more intelligent and user-friendly interface. Through natural language processing and intelligent assistant technology, users can interact with the monitoring system through voice or text, and quickly obtain system status and abnormal information. For example, a user can ask "What is the health of the current system?" "The system automatically generates detailed health reports and recommendations.

2. Personalized monitoring and customized services

In the future, the monitoring system will provide more personalized monitoring services, and customize monitoring policies and alarm rules according to user needs and business characteristics. The system can automatically recommend the best monitoring configuration and optimization scheme by analyzing the user's usage habits and business models, so as to improve the user's management efficiency and experience.

Read on