—

Why you should do it

从谷歌2003年发布的三篇经典论文《The Google File System 》、《MapReduce: Simplified Data Processing onLarge Clusters》、《Bigtable: A Distributed Storage System for Structured Data》开启了大数据的时代，经过20年的蓬勃发展，大数据已经非常普及常用了。

Considering the nature of Big Data 4V, it is difficult to say that a single technical solution or component can be used to address all scenarios and needs. Therefore, the technical architecture of big data is relatively complex, and it also involves many distributed and high-availability mechanisms. For example, if the namenode of HDFS does not do HA and the service is terminated abnormally, basically the entire big data cluster will be down, and all services will be basically unavailable. This situation is fatal, and your service will be completely paralyzed and unable to recover quickly.

In order to ensure the stability and high availability of big data services, in addition to HA design for related services or components, we also need to have a complete monitoring and alarm scheme to find and eliminate hidden dangers and faults in the current big data services in a timely manner to ensure the SLA of the current big data services.

Next, let's start a discussion, this article is about the way of big data monitoring and alarm construction, we will introduce from which aspects to build monitoring and alarming, and how to build and use which technical solutions you can implement in combination with the current production situation or existing standards and specifications.

—

From what to do

Before we start, it is necessary to introduce the technical architecture of big data, which will help us understand the composition of big data, so that we can better cut into the construction of monitoring.

The way of big data monitoring construction

Looking from the bottom up, we can divide it into the following

Data source layer: This layer is basically the data source of the ODS layer of the data warehouse, such as the buried logs of apps/web, business data in MySQL/mongodb, external files, etc., and this layer basically does not need to be monitored.
Data collection layer: This layer is the implementation of connecting business data and buried data to the data warehouse, we currently use FlinkCDC for real-time integration of business data, and you may need to pay attention to whether your data integration task is normal in this layer.
Data storage layer: This layer basically puts the collected business data and buried data into the big data storage, considering different data storage requirements, the data storage in this layer may be relatively rich not only HDFS, this layer needs to pay attention to the health of storage-related services and your storage usage.
Data computing layer: This layer is mainly used for data computing, and there will be real-time processing by Flink & offline batch processing. At this level, you need to pay attention to whether your computing task is executed normally, whether the execution times out, whether the Flink task is terminated abnormally, and whether the Flink task is OK. You need to sort out what monitoring needs to be done in conjunction with your computing tasks.
Scheduling engine layer: This layer schedules your computing tasks periodically, and also has real-time processing & offline batch processing by Flink. What you need to pay attention to at this layer is the health of your scheduling service and the scheduling execution of tasks. Let's look at the scheduling execution of tasks together with the monitoring of the data computing layer, and the scheduling itself is also doing the calculation and execution of data.
Data service layer: The data service layer basically provides data service capabilities to the outside world, and different companies will have different data service capability output schemes. It can be a data visualization platform such as grafana, external API output, or a self-built BI platform. This layer basically focuses on whether your data service capability is normal, and needs to be based on your actual production situation.

So let's abstract it as follows:

Big data base

The big data base is basically cluster-related services, including but not limited to HDFS, hive, yarn, spark, etc., and they combine to build the foundation of big data, based on which we can carry out a series of tasks such as data storage, calculation, and analysis in the upper layer.

In our experience, whether you are hosted in a third-party cloud vendor or built based on CDH or HDP, the monitoring needs to focus on the same points. The main ones are as follows

The health of the host instance

CPU
memory
Disk usage, disk read/write
Internet
......

The health of the cluster service

The health of services such as hive, hdfs, and yarn, service unavailability, and process failures
Memory usage of each service heap
The YARN task is suspended, the YARN resource is used, and the YARN queue resource is insufficient
The success rate of Hive SQL execution is low
Key events such as process restarts and active/standby switchovers
......

Cluster service health is relatively more to do, we are specific to each individual component may have different types of monitoring, such as Hive's HMS, HiveServer2, Hive Session, etc., here we will not go into details, you can refer to the monitoring of big data clusters of various cloud vendors or refer to the official documentation of each component.

Data integration is integrated

First of all, let's take a look at some of our data integration links. This helps us understand what needs to be monitored.

Behavioral buried data

First of all, we have specified the client tracking protocol based on our actual business situation, and the tracking protocol mainly defines a complete event content from several major aspects such as user information, equipment information, event information, and application information, so that all APP products in the company can do the whole link reporting, storage, statistical analysis and other processes based on our tracking protocol.

Then when the user triggers the click browsing behavior in the APP, it will generate an event that conforms to the tracking protocol, and then collect the logs in nginx and send it to Kafka through logstash, because our buried point is relatively large, with an increment of about 500GB a day, so we use Kafka here to buffer.

After the buried logs are added to Kafka, we will use Flink to do real-time ETL and write them to the Kudu data acceleration layer for near-real-time statistical analysis.

So the monitoring you want to consider here is the various links of the entire log collection chain, including but not limited to:

● Logstash service health

● Kafka service health

● Whether the reporting address of the buried event is normal

● .....

Even if you do the monitoring of the above links, there is no 100% guarantee that the problem of the buried log can be found immediately. We have encountered two other types of problems, one is that the domain name used by the client's buried reporting address is blocked, and the other is that the HTTP certificate of the client's buried reporting address expires, resulting in the buried point cannot be reported normally. At this time, your services are normal, but the buried points cannot be reported. Therefore, we also need to continuously monitor the fluctuation of the total number of reported buried events, and you can combine your actual business situation to monitor the fluctuation of each event at the granularity of minutes, hours, and days. In this way, it can be quickly perceived in the case of large fluctuations in the amount of buried events.

Here's an example of this:

Business Data

Since the big data side cannot be directly connected to the business DB for some refined monitoring. Therefore, we can only do relevant monitoring and alarming at the link of data integration and the data layer entering the ODS layer.

We will use FlinkCDC to subscribe to the entire database change of the business database. Therefore, the first thing to pay attention to is the health of your FlinkCDC task, whether the Flink task is executed normally, etc. We'll cover this in more detail in the Calculating Real-Time section below.

In addition, after FlinkCDC writes business data to Kudu, we will continue to pay attention to the latest data generation time of business data, so that we can find out in time if the business data has not been updated after the specified time. Intervene to confirm the processing process.

Of course, you can also implement this feature in your FlinkCDC tasks, and the specific implementation is still based on your actual business situation and specifications.

In addition to the global monitoring based on the whole database, after the business data enters the ODS layer of the data warehouse, we will also combine data quality monitoring to monitor the data of specific business tables, such as the situation where the data of a single table drops 0 or the data fluctuation of a single table is abnormal.

storage

HDFS

If you are using HDFS, then from the storage level, we need to monitor the following points:

● DataNode disk failure: Written files may be lost.

● The number of blocks in a single replica exceeds the threshold: The data of a single replica is easily lost when a node fails, and too many files in a single replica will affect the security of the HDFS file system.

● The number of blocks to be filled exceeds the threshold: HDFS may enter safe mode and cannot provide write services. Lost block data cannot be recovered.

● Inappropriate data directory configuration: The data disk is mounted in the root directory or other key directories. This affects the performance of the HDFS system.

● The number of HDFS files exceeds the threshold: If the number of HDFS files is too large and the disk storage is insufficient, data storage may fail. This affects the performance of the HDFS system.

● The number of lost HDFS blocks exceeds the threshold: If HDFS storage data is lost, HDFS may enter safe mode and cannot provide write services. Lost block data cannot be recovered.

● If the disk space usage of the DataNode exceeds the threshold, HDFS data writing will be affected.

● If the HDFS disk space usage exceeds the threshold and the disk capacity of the HDFS cluster is insufficient, HDFS data writing will be affected.

● ......

Object storage

If you are using object storage, then congratulations, you basically don't need to pay attention to the above HDFS-related monitoring items, and everything is left to object storage.

Here are a few things you might want to focus on:

● Bucket usage

● Data lifecycle management strategy

● Security audits, such as saving and modifying AK/SK

● ......

compute

At the computing level, we basically focus on the execution of specific tasks, and we will explain them separately for real-time and offline tasks.

real time

Flink has become the de facto standard in the real-time computing field, so the real-time tasks here are mainly focused on Flink, and the real-time tasks need to be focused on as follows:

● Whether the task is terminated abnormally

● The number of task restarts

● Whether Kafka consumption is delayed

● Whether the CK is normal, the time required, and the number of failures

● Whether there is counterpressure and inclination

● Resource usage of the job itself

● Check whether the execution time of the sink expires

● Custom indicator dot collection

● ......

The following diagram is an example of our Flink task monitoring:

The monitoring of Flink tasks can be formulated in a more granular manner in combination with Flink metrics.

offline

Offline tasks are mainly batch processing tasks, and batch processing tasks are relatively simple, with no success or failure or timeout, so we mainly focus on the following points:

● The task is terminated abnormally

● The task execution timed out

● Average task execution time (timeout optimization)

● Long-tail quests

● Tasks that take up too many resources

● ......

dispatch

The scheduling service is relatively simple, and on the premise of ensuring HA, you can pay attention to whether your scheduling service is abnormal. Such as dolphinscheduler, we want to pay attention to

● The status of the master node

● Worker node status

● Node-related loads

Data Services

Data service is the data capability provided externally, which is also the carrier of the value of big data. Therefore, the monitoring related to data services needs to be paid special attention.

We need to focus on the following:

● Whether the data service is normal, such as whether grafana can be accessed normally, and whether the API service can be called normally

● Whether the data provided is accurate, whether the data is missing, etc. (we will elaborate on the data quality section)

● Service response time, such as page load time and API call time

● ......

Data quality

Data quality monitoring will be relatively complex, but it must be done, and wrong data will directly affect the relevant decision-making and judgment of the business.

According to the data standard management measures formulated by DAMA, we need to monitor data quality from the following perspectives

Integrity: Data integrity issues include: incomplete model design, such as incomplete uniqueness constraints and incomplete references; Incomplete data entries, such as: missing or unavailable data records; The data attribute is incomplete, for example, the data attribute has a null value.
Accuracy: Accuracy, also known as reliability, is used to analyze and identify inaccurate or invalid data, which can lead to serious problems, flawed methods, and poor decision-making.
Timeliness: Timeliness is used to measure whether data can be obtained when needed, and the timeliness of data is directly related to the speed and efficiency of data processing of enterprises, and is a key indicator that affects the efficiency of business processing and management.
Uniqueness: Used to identify and measure duplicate and redundant data. Duplicate data is an important factor that leads to the inability of business collaboration and process traceability, and it is also the most basic data problem that needs to be solved by data governance. For example, the ID of the primary key of the service is duplicated.
Data consistency: The data models of multi-source data are inconsistent, such as inconsistent naming, inconsistent data structures, and inconsistent constraint rules. Data entities are inconsistent, such as inconsistent data encoding, naming and meaning, classification hierarchy, and lifecycle, ....... Data inconsistencies and data content conflicts when there are multiple copies of the same data.

On top of that, we need to fine-grained divide the various types.

—

Conclusions and experiences

Use mailing lists or subscriptions to notify alarms to prevent changes in the recipients of related alarm notifications in the event of personnel changes.
Important alarms are integrated with Ruixiang Cloud to carry out SMS and phone alarms, so as to avoid delays in the acceptance and processing of alarms during non-working days.
Monitoring does not need to be large and comprehensive, but is built according to a more important degree and SLA.
The affected parties must subscribe to the relevant monitoring alarm notifications so that Party A does not think that the alarm is not important, but it is important to Party B, and even affects the business.
Monitoring, alarming, and processing should have a complete process closed-loop and knowledge precipitation, and form a knowledge base for monitoring and alarm processing.

References:

[1]. Percent Big Data Technology Team: Trillion-level Big Data Monitoring Platform Construction Practice_Big Data_Percent Cognitive Intelligence Laboratory_InfoQ Selected Articles https://www.infoq.cn/article/XudrcZEUFhPJR7kfYNur

[2]. Big Data Cluster Monitoring System Architecture - Nuggets (juejin.cn)

https://juejin.cn/post/6967234979847733279

[3]. ALM-13000 ZooKeeper服务不可用(2.x及以前版本)_MapReduce服务 MRS (huaweicloud.com)

https://support.huaweicloud.com/intl/zh-cn/usermanual-mrs/alm_13000.html

[4]. Interpretation of DAMA Knowledge System (12) Data Quality Management - Zhihu (zhihu.com)

https://zhuanlan.zhihu.com/p/208935690

Author | Feng Chengyang is a senior big data development engineer

Source-WeChat public account: micro carp technical team

Source: https://mp.weixin.qq.com/s/TJk704oudI3mnhJgpjaLXg

The way of big data monitoring construction

Big data base

Behavioral buried data

Business Data

storage

HDFS

Object storage

compute

real time

offline

dispatch

Data Services

Data quality

Read on

What kind of data warehouse is good for enterprise big data analysis?

Isn't this malicious shorting? It's so obvious! Too bold! Is it blatant short-selling, not afraid of big data scrutiny? Or, who came out to explain? A-shares rose like a rainbow today,

China Energy Big Data Report (2024)

A number of new regulations will be implemented from tomorrow, involving the governance of big data, "killing ripeness", etc

Governance of big data "killing" and the introduction of electronic driving licenses...... The new rules are coming!

Ansible's practice in Zhihu Big Data

Chen Yunsong | Macro Quantitative Sociology: Humanities and Social Sciences Applications of Big Data

The 2024 International Conference on Intelligent Optimization and Big Data Management was held in Han

The wave of big data is coming, and science and big data technology have become the new favorites of colleges and universities!

The new rules will come into effect today! Prohibit big data killing, and standardize automatic renewal

Challenges and solutions for life science research in the era of big data

Application of Protobuf in Zhihu big data scenarios

It turned out that the Ming Dynasty would use the knowledge of big data to make statistics on Tibetan tribes through trade

"Dreamland" big data AI reproduces the deceased and meets his lover in virtuality

The new regulations will come into effect on July 1, prohibiting "big data killing" and regulating "automatic renewal"