The implementation of the B-station monitoring 2.0 architecture

background

As we all know, Metrics indicator data is one of the important cornerstones of observability, and at the end of 2021, Bilibili completed the implementation of a unified monitoring platform based on the Promtheus+Thanos solution. However, with the rapid development of Bilibili's business, the indicator data level has also ushered in explosive growth, which not only has an impact on the stability of the existing monitoring system (availability, cloud monitoring data quality, etc.), but also cannot meet the company-level observable stability construction (1-5-10) goal. The pain points faced by the current architecture are summarized as follows:

Poor stability: Because the alarm calculation is based on the local calculation of Promtheus, all instances of an application must be collected by the same Promtheus when the target monitoring instance is scheduled, and some metadata metrics such as pod name will change whenever some large applications are released, and a large amount of memory will be occupied when the timeseries index is rebuilt, resulting in Promtheus oom.
Poor user query experience: Frequent oom alarms often lead to data breakpoints, coupled with limited Promtheus+Thanos query performance, which often causes panel queries to slow or timed out, open API query failures, and false alarms.
Poor quality of cloud monitoring data: Because we use many virtual machines from different vendors, even the same vendor has multiple accounts and different regions, some self-built availability zones and cloud availability zones are connected, some are not, some are private lines, and some are non-private lines, and the network topology is complex, and there is often no monitoring data due to network failure. At the same time, each account is an independent set of Promtheus collection, so there are multiple cloud monitoring data sources, and it is difficult for users to select the cloud data sources they need.

2.0 Architecture Design

Design ideas

Separation of collection and storage: Because Promtheus integrates collection and storage, it is difficult for us to quickly elastically scale and schedule different collection nodes when scheduling and distributing target monitoring instances. Therefore, to realize the architecture of collection and storage separation, target instances can be dynamically scheduled, and collectors can be elastically scaled.
Separation of storage and computing: Since Promtheus is also integrated with storage and computing, with the development of business, the amount of metric data and computing requirements are often not linear, for example, when the storage resources are less than doubled, the computing resources (query requirements) need to be more than doubled, so if the server resources are purchased in the way of integrated storage and computing, it is bound to cause some waste of resources. Therefore, in order to realize the architecture of storage and computing separation, the write, storage, and query can be elastically scaled separately.
Time series database selection: We learned that VictoriaMetrics (hereinafter referred to as VM) has been used as a time series database by more and more companies, and after our research, we have met our needs in terms of write & query performance, distributed architecture, and VM operation and maintenance efficiency. (Some advantages of VM selection will not be discussed in this article, and some query optimizations for VM will be introduced below.)
Unitized disaster recovery: Since most scenarios (95%) are based on the pull mode, each target monitoring instance needs to be assigned to different collectors according to certain scheduling rules. However, there was no standard before, and some scenarios were scheduled according to the cluster dimension, some scenarios were scheduled according to the IDC dimension, and some were scheduled according to the instance name dimension, resulting in the indicator data cannot be queried from collection, > transmission, > storage, → to the closed loop in the entire link unit. Therefore, we have formulated a new standard, which schedules all units according to the zone dimension, so that the content disaster of the whole link unit can be realized under the same zone.

Based on the above core design ideas, the monitoring 2.0 architecture is designed. First of all, let's take a look at the overall functional architecture:

Overview of the functional architecture

The implementation of the B-station monitoring 2.0 architecture

As can be seen from the above functional architecture diagram, Metrics metric data is not only used in basic scenarios such as dashboards and alarms, but also relies on metric data in some business scenarios and publishing platforms to make some process decisions. Therefore, the implementation of the 2.0 architecture has the following challenges: the stability of the monitoring system itself, data availability, query performance, fault explosion radius, etc.

Overall architecture

The above is the overall technical architecture, and the following focuses on data sources, data collection, data storage, and data query:

Data source

From the perspective of broad classification, the monitoring coverage scenarios are mainly divided into the PaaS layer (application monitoring, on/offline component monitoring, and some middleware monitoring) and the IaaS layer (self-built computer room servers/virtual machines, container monitoring, and network monitoring). Previously, target monitoring instances were discovered based on pull, and there were two main problems:

It is found that there is a delay in the target monitoring instance: because the pull method has a certain synchronization period, such as 30 seconds, when a new monitoring instance appears, it takes 30 seconds to find it, so the monitoring metrics are 30 seconds later than expected.
High O&M costs: When there is a problem with the interface provided by the pull business, the business party generally cannot perceive the problem at the first time, and we need to actively cooperate to deal with it.

To solve the above problems, we change the pull method to the push mode, so that the business side can actively push the target instances that need to be monitored. At the same time, in order to allow each business party to manage target instances in a more comprehensive way, the indicator platform provides monitoring access according to the integration task, and displays the collection status, collection quantity and collection time of the target instance in real time, further improving the visibility of monitoring access status.

Data acquisition

Scheduling layer

The scheduling layer is mainly responsible for collecting job configurations and distributing target monitoring instances. In order to achieve a full closed-loop link within a unit, the scheduling layer is divided into first-level scheduling (all collection configurations) and second-level scheduling (collection configuration in the local computer room)

Level 1 Scheduling (Master)

From the database, get the full collection job configuration and target monitoring instance. According to the collection scheduling configuration (zone-dimensional scheduling), the collection configuration required for each secondary scheduling is constructed in memory.
To ensure the high availability of master data, the memory snapshot data is not updated when the access dependent database is abnormal. For example, when a job > 5k targets is deleted at one time, and the amount of diff is reduced > 5k targets at one time, the platform will intercept the protection and do doubble checks to prevent users from misoperation.
At present, through some means, such as asynchronous, memory cache, and multi-coroutine, the full scheduling time of the master is reduced from 50 seconds to less than 10 seconds.

二级调度(Contractor)

According to the collection cluster name + version number, go to the master to obtain the collection configuration of the local data center. Multiple sets of secondary dispatching are designed to prevent the explosion radius of the fault.
After obtaining the collection configuration in the local data center, according to the Collector heartbeat, the current health collection node (Collector) is obtained, and the collection configuration is scheduled according to the instance and capacity dimensions, and the collection configuration is assigned to the corresponding collection node.
When the master collection configuration or collection node is updated, a new round of scheduling is triggered. In order to ensure that the target is not randomly scheduled, when the scheduling algorithm is implemented, the configuration that has been assigned to a collection node will still be given priority in the next scheduling.
When the contractor application is released and restarted, how can I ensure that there are no breakpoints or jitter in the indicator data?

When the contractor restarts, the memory will maintain a global state variable, and wait for the Collector to report all the targets before starting the targets scheduling, and when the scheduling is completed and the state is set to ready, the Contractor API will provide access to the Collector.

Collector

The collector is based on the vmagent. There are two main functions, one is to report the heartbeat to the contractor on a regular basis, and the other is to get the relevant collection configuration, call reload API, and trigger the vmagent to start the collection.
When we grayscale some amounts, we found that the memory occupied by VMagent is high, and we found that the memory consumes more memory after each pull capture, and the memory is reduced by about 20% after enabling streaming collection promscrape.streamParse=true.
VMagent itself has a random (collection interval) smooth load mechanism. For example, if we configure the collection interval to 30 seconds, when the vmagent gets the configuration, it will take 30 seconds for a target to be available at the slowest level. Therefore, when a collector scales in or out, the target drifts to other collectors, and a metric breakpoint occurs. Therefore, we have designed a mechanism that does not exit immediately after the Collector listens to the exit signal, but only stops reporting the heartbeat, and will exit after continuing to collect for a cycle, so as to ensure that the indicator has no breakpoint.

Data storage

We store metrics via vmstorage. Since vmstorage involves a lot of storage details, this section mainly describes the storage of index structures. First, let's take a look at the core types of VMSuStorage.

MetricName stores the variable of label kv, and when it is written, vminsert will serialize the label in the actual metric data into MetricName for a write request, and at the same time, when vmselect is queried, the label information in the result set is also serialized and stored by MetricName, which is more similar to the content meta information of tv in the actual storage. The serialization structure of a MetricName on a vmstorage disk is as follows:

4 byte account id | 4 byte projectid | metricname(__name__) | 1 | tag1 k | 1 | tag1 v | 1 | .... | tagn k | 1 | tagn v | 1 | 2 |

MetricId nanosecond timestamp, the unique key of a cluster of ts, is generated when a cluster of ts is first stored in vmstorage. 8 byte。
TSID 也是一簇 ts 的唯一键。由指标的租户,指标名id,job instance labelvalue id 和 MetricId 组成。 TSID 在 vmstorage 落盘的序列化结构如下

4 byte accountid | 4 byte projectid | 8 byte metricname(__name__) id | 4 byte job id | 4 byte instance id | 8 byte metricid

TSID and MetricId are both unique keys in a cluster of ts, the difference is that MetricId is only an 8byte timestamp, and the serialization information of TSID contains a lot of other identification information in addition to MetricId, when TSID is sorted by dictionary order, the indicators with the same name of the same tenant will be continuously distributed together, the characteristics of comprehensive metric query, the query results are always concentrated in the same tenant under the same name of the index, and will have similar label condition. Therefore, the TSID is actually the key in the data directory of vmstorage, and the core of the query in the index part is to obtain the final TSID according to the query conditions.

Under the above three types, the following abstract index structures can be formed:

MetricName -> TSID：0 | MetricName | TSID
MetricId -> MetricName：3 | 4 byte accountid | 4 byte projectid | MetricId ｜ MetricName
MetricId -> TSID：2 | 4 byte accountid | 4 byte projectid | MetricId | TSID

At the same time, the core inverted index structure is abstracted as follows:

1 | 4 byte accountid | 4 byte projectid | metricname(__name__) | 1 | MetricId
1 | 4 byte accountid | 4 byte projectid | tag k | 1 |  tag v | 1 | MetricId

Combining inverted indexes and index 3 gives you the following flow:

According to the tenant information and metric name and label information carried by the query, you can form the required inverted index prefix, and under the dictionary-sorted index, you can find the required MetricId set through a binary search. By intersecting the MetricIds queried by multiple conditions from the inverted index, you can get the set of MetricIds that meet the label query conditions, and you can query the TSID that meets the last condition by assembling the index prefix of index 3 into the response. When executing a query in vmstorage, the core goal of the indexing part is to find the final set of TSIDs that are required based on the conditions.

Combined with the above abstract indexes, VMs reduce disk footprint through prefix-based compression for storage. As a result, the cost of disk storage is reduced by about 40% compared to Prometheus.

In normal VMSstorage usage, VMStroage provides excellent performance, taking a cluster as an example, a single 48C 256G VM Stroage is enough to support 40W writes per second and 2W QPS queries per second. In this process, the VM's resource usage is optimized by adjusting the application's GOGC to balance the CPU ratio. By default, the value of vmstorage's gogc is given to 30, and there may be frequent gc, and in the case of query write saturation, if it occurs with the merge of the metric point, the query will have a relatively large handshake delay in a short period of time and cause failure. If the memory is sufficient, increase the gogc of vmstorage in exchange for a more stable vmstorage operation at the cost of a certain amount of memory.

Data queries

PromQL Auto-Replace Enhancements

In the process of accessing Victoriametrics through Grafana for daily metric queries, we often encounter some panels that return data too slowly or directly return a query that covers too much data and directly fails. This part of the panel is often not able to be queried by normal means even after optimizing the query conditions as much as possible.

Let's take a promql statement as an example:

histogram_quantile(0.99, sum(rate(grpc_server_requests_duration_ms_bucket{app="$app",env=~"$env"}[2m])) by (le, method))

This statement is used to query all grpc APIs in an app under an env, and it takes time to call p99. In this statement, if there is an app A, which contains 2000 instances, 20 interfaces, 50 callers, and 5 le buckets, and reports it once at 30s intervals, there are 1000 W of points that meet the conditions of a single time, and this statement also includes a 2M time window statistics, then the overall number of matching points in this query is 4000W. Even if such a query statement can avoid the capacity limit of the query, it will affect the stability of the overall query data source. In the general usage habit of Grafana, we will perform a pre-aggregation of the whole of this statement, but in the scenario of this statement, the number of active apps on station B is in the thousands, and only the two-digit applications in this statement have query performance problems.

At the same time, in this statement, if the first parameter is modified to 0.5 and 0.9 respectively, then the corresponding p50 and p90 indicators can be directly queried, but the optimization of this clause cannot help the p50 and p90 queries, and the output of a single aggregation is not ideal. So, one way to think about this is whether it is possible to solve the query problem of the panel containing this statement by aggregating only the parts that have performance problems.

First of all, starting with the process of executing this statement in vmselect, this statement will be parsed into the following execution tree during the execution of vmselect:

First of all, let's describe the execution process of the statement, after parsing the execution tree, each node will be executed in order in a depth-first manner. The specific process is as follows

Metric data query: At the leaf node, you can perform distributed query from VMSoster based on the filter conditions of the metric name and label to obtain the required raw metric data
rate function execution: The indicator data queried from step 1 is divided into an indicator cluster according to the label of the same indicator, subtract the last point and the first point in the 2m time window in the same indicator cluster, and divide the distance between them to obtain the result of the rate function
Sum aggregate function execution: Finally, the data in step 2 is partitioned according to the partition key of le + method according to the group by partition key of sum, and then the sum operation is performed on the indicators divided into the same area
histogram_quantile Function execution: Finally, traverse the data of each partition obtained in step 3 to calculate p99 to obtain the result of the final time-consuming distribution

Let's continue to use the app mentioned above as an example, which contains 2000 instances, 20 interfaces, 50 callers, and 5 LE buckets, and reports at 30s intervals, and the number of metric points that need to be queried in the first step reaches 4000W. The 4000W data is reduced to 1000W in the time window calculation of the time rate function in the second step. In the third step, the 1000W data will be divided into 100 partitions according to the partition key of le + method, and the final output will be reduced to 100 accordingly. The result of step 4 is the same as the number entered.

Therefore, we can see that in the execution process of the whole tree, step 1 often becomes the performance bottleneck in the actual execution process, and although the output of step 2 is greatly reduced, there will still be a huge metric output, and the performance problem is not fundamentally solved. If you can pre-aggregate the output of step 3 and use the pre-aggregated results in the query, the query will directly become a simple query with 20 metric points. At the same time, consider that whether it is the calculation of p50 p90 p99, the underlying calculation logic is completely consistent until the indicator output of step 3, so if we pre-aggregate and persist the results of step 3, it can be reused in the time-consuming calculation of the histogram_quantile quantile of any parameter, and the cost performance of a pre-aggregation has reached considerable value.

So, let's assume that the promql of app A has been pre-aggregated in real time or on a scheduled basis and the result of the above expression has been converted into an indicator test_metric_app_A.

sum(rate(grpc_server_requests_duration_ms_bucket{app="A"}[2m])) by (le, method)

So when we actually execute the query time of the p99th quantile of app A, we want to actually execute the query statement as follows

histogram_quantile(0.99, test_metric_app_A)

However, in Grafana's panel configuration, if one panel corresponds to the original query statement and multiple such optimized queries, the actual query experience will become very strange, and even the overall experience will achieve a negative optimization effect, so go back to the query execution tree of the original statement at the beginning:

If we replace the yellow part of the subtree with the test_metric_app_A query tree, we can seamlessly complete the query optimization. Then we need to build the following mapping relationship on vmselect before querying:

When the query is formally executed, when the execution tree parsing is completed, we can start from the root node of the execution tree to check whether there is a subtree of the execution tree that fully satisfies the mapping conditions, and as long as it does, we can replace the subtree of the original query statement with the mapped value tree.

This seems to be a very suitable solution, but there are still problems with substitution in the actual query process. First of all, in our original statement, env is used as a query condition, but the partition key used in our pre-aggregation result is le + method, and there is no env in our pre-aggregation result metrics. Then our pre-aggregation statement should be modified to read:

sum(rate(grpc_server_requests_duration_ms_bucket{app="A"}[2m])) by (le, method, code)

In this kind of pre-aggregation, when we want to replace the execution tree of the query statement, it is obvious that we cannot simply apply the replacement relationship directly to the execution tree of the query, on the contrary, a new layer of aggregate function nodes should be added to the outer layer of the aggregate function.

Rethinking the current scenario, when we have the above pre-aggregation, what we need is to automatically add a layer of aggregation functions on top of the pre-aggregation to achieve the purpose we need.

Our new mapping looks like this:

When we get this mapping relationship and start traversing from the root node again, we will find that there is a mismatch on the partition key of the sum layer, but the set of partition keys of the query (le, method) is part of the mapped partition key (le, method, env), and the direct use has no effect on the execution of the statement. On top of this, it is possible to hang the new subtree directly under the aggregate function of this layer when replacing the subtree. The replacement result is as follows:

After the execution tree substitution as described above is achieved, the automatic substitution of promql can be completed under the premise of semantic loss. At the same time, with the support of this transformation, we can increase the dimensions of some aggregations as much as possible for the pre-aggregation metrics without affecting the number of pre-aggregation outputs, so that the results of the pre-aggregation can be used as subqueries in more scenarios. The query transformation here is completely imperceptible to the user level when using grafana for querying. For apps that use pre-aggregation in the query, the parsed execution tree will be automatically replaced, and for query conditions that do not need to be optimized, the query will be directly based on the original query statement, and the query statement with performance bottleneck will be optimized on the premise of saving the resources consumed by pre-aggregation as much as possible. In addition, even if the user modifies the query statement, as long as the statement contains pre-aggregated subqueries, it can have an optimization effect and achieve the maximum return on pre-aggregated resource consumption.

ps: (There will be a certain semantic loss in the avg aggregate function, but the error is negligible in most scenarios)

PromQL-based pre-aggregation of Flink metrics

For the pre-aggregation mentioned above, the main pre-aggregation of Bilibili was originally pre-aggregation through scheduled batch queries. Such query aggregation puts a lot of pressure on both the VMStroage on the storage side and the VMselect on the query side, which may affect the normal query requirements of metrics. From a pre-aggregation perspective, there is a certain amount of memory waste in performing native promql aggregations, as exemized by one of the promql mentioned above:

sum(rate(grpc_server_requests_duration_ms_bucket{app="A"}[2m])) by (le, method, code)

In this pre-aggregation statement, vmselect will fish out all the metrics that meet the grpc_server_requests_duration_ms_bucket{app="A"} from vmstorage and keep all the labels in memory, until all the distributed queries are completed, the calculation will be performed, and the final sum part will be labeled according to le, method, code of group by Convergence and then get the final result. In this process, VMselect's memory often encounters a bottleneck before the sum is executed, especially the rate function here, which sets a time window of 2M to calculate, which further expands the pressure on VMSek. At the same time, such queries tend to become slow queries, and there will be significant delays in the aggregation of this part of the data. Similarly, scheduled, high-volume queries can also increase the pressure on the stroage.

In order to avoid the pressure of query requests on the VM cluster, and at the same time ensure the effectiveness of pre-aggregation as much as possible, we try to use Flink for PromQL-based metric pre-aggregation.

前文提到了 promql 在 vmselect 当中的执行方式。类似 sum(rate(grpc_server_requests_duration_ms_bucket{app="A"}[2m])) by (le, method, code) 这样的表达式，将会在 vmselect 当中解析成如下的执行树。

On the configuration side of the Flink pre-aggregation, we can parse this promql similarly and parse it into the execution tree as above. On the flink side, you don't need to care about the composition of the original promql statement, you only need to get the json of the execution tree, and what you need to care about on the configuration side is that the time window information obtained during parsing is saved together as metadata, and the flink job will select the time window configuration that meets the requirements from the configuration for execution.

For this execution tree, we approach the design execution process from the perspective of Flink.

Data filtering phase

In the first stage, we only need to check the leaf nodes of the execution tree to get all the matches from the real-time indicator stream Metric raw data queried by promql

Data partitioning phase

sum(rate(grpc_server_requests_duration_ms_bucket{app="A"}[2m])) by (le, method, code), we often need to hold the full amount of raw data within 2m in memory when performing this 2m time window aggregation, and there is pressure to do so, normally, the label set of a metric point will reach 1-2k, For pre-aggregation is a lot of memory pressure. In the process of optimization execution aggregation, the memory of the indicator points held in the in-memory time window is the core optimization direction.

When we aggregate the partition time window through flink, what we need to rely on is the sum node in the execution tree, and the group by key of sum is the partition key needed in flink, we can completely reuse the logic here, extract the corresponding label in the indicator as the partition key, and cache the data points as value in the time window in flink for aggregation. However, there is the same memory bottleneck here as the VMselect execution statement. Let's review this statement, grpc_server_requests_duration_ms_bucket this indicator, in fact, except for the label other than the le method code involved in the partition key, is not actually the label we need to consider in the execution process, we can completely discard all the remaining labels here, and only keep a unified uuid of the label collection for use in the rate time hash into a bucket for aggregation.

At the same time, for the combination of partition keys, we can even directly discard the part of the label key, because our partition keys are all arranged in order, our partition keys only need to retain the specific values of these two labels, and the key of the label is directly discarded, and it can be taken in the order of the promql declaration when executed, specifically the promql id and label value, And our specific indicator data only needs to retain the label set uuid and time value of the point. The optimization here directly addresses the memory pressure in the aggregate.

Take one of the metrics involved in this statement as an example, the partition key generated by an indicator is "promqlid (four bytes need to locate the specific execution tree in the time window) + le value + method value + code value", and the content of the indicator points that need to be cached in the time window is "label set uuid (four bytes) + time (eight bytes) + value (eight bytes)".

The memory footprint of the last vertex in memory is fixed at 20 bytes, and the partition key size shared between vertices under the same partition key is only 4 bytes + the required value string, which achieves the minimum memory consumption.

The execution phase of the time window

When we complete the data filtering and partitioning, we enter the operator stage of the time window, and we only need to make a depth-first call to the execution tree at this stage.

At this stage, due to the time window constraints of Flink, there will be some subtle differences between the specific function implementation and the VM.

The most obvious example is the increase function, in the VM implementation of this parameter, it is the result of subtracting the last point of the current time window and the last point of the previous time window, in FLINK, because the real-time data has been cached for 2 minutes, this implementation will cause additional memory consumption, and at the same time, the implementation of other functions will also need to consider the implementation cost of additional storage, so we chose Prometheus increace to achieve our purpose. In Prometheus' increase implementation, the increment of the period is estimated based on the length of the time window by the k value of the function of the last point in the same window and the last point in the time axis.

In a similar scenario, since the VM introduces the data from the previous time window to participate in the calculation to make the data more accurate in many functions involving time windows, such an optimization will lead to additional resources in Flink, and all similar functions in Flink will be implemented with reference to Prometheus' design.

At the same time, since we discarded most of the labels in the previous stage, there are also unsupported scenarios for functions like topk that need to retain the original label information before and after.

Data reporting phase

The pre-aggregated metric data obtained in the previous stage is the result we need, which can be written directly to VMinsert in the sink of Flink through the remotewrite protocol.

After the above design, our ProMQL-based Flink pre-aggregation can be quickly configured according to the actual query bottleneck PromQL, and in the actual scenario, the 100C 400G Flink job configuration can meet the time window caching and computing requirements of 300 million metric points every 2 minutes, which greatly reduces the resource requirements of pre-aggregation compared with scheduled pre-aggregation, and the implementation here also meets our expectations for maximizing resource utilization in transparent queries.

Query optimization earnings

The automatic optimization of queries and the special pre-aggregation of Flink can optimize the governance of specific promql concentrations according to the actual query situation, and reduce the slow queries of more than 20 seconds by 90% and the query data source resources by 50% per day on average with very small resource consumption, bringing very high benefits at a very small cost.

Data visualization

At present, we mainly use Grafana to build monitoring dashboards, due to historical reasons, Grafana has always used the older version (V6.7.x), but the new version of Grafana has a lot of feature iterations and performance optimizations, so we decided to upgrade the Grafana version to V9.2.x. However, there are the following challenges in version upgrades:

v6.7.x升级到v9.2.x存在多个break change（除了官方描述的一些break change，同时也包括一些自定义数据源/pannel 插件存在break change）。
The deployment cost of the old version of Grafana is high: Previously, it was deployed on a physical machine, and the deployment method was relatively black-box, and we added the nginx+grafana auth service for auth proxy authentication, so the deployment and operation cost was further increased.

In the face of the challenges and problems of version upgrades, we did the following:

Before the upgrade, we did a lot of testing and verification, and used some scripts to fix some data incompatibilities caused by break change, and replaced the new version of the ES plug-in with the opensearch data source plug-in.
In terms of deployment mode, firstly, the entire Grafana deployment compilation script is managed by the git repository, so that each version change is no longer in a black box, and at the same time, nginx+grafana auth+grafana is built into an all-in-one image to achieve containerized deployment and further reduce deployment costs.
In the new version of Grafana, if the Prometheus data source uses version >V2.37.x, the variable acquisition method is changed from the Series API to the Label Values API by default, and the query performance will be improved by about 10 times (2s->200ms).

After the upgrade is completed, the panel loading performance and the overall user query experience are greatly improved.

Overall earnings

After switching from Prometheus to VM architecture, the p90 query time is reduced by more than 10 times
At present, 170W+ collection objects are supported, and all of them are scheduled and collected according to the zone dimension to achieve unit content disaster
If only disk resources are added, the full collection interval of application monitoring is adjusted from 60 seconds to 30 seconds, so that it can be discovered within 1 minute from 1 to 5 to 10
Exceptions such as metric outages and OOM alarms are reduced by more than 90%.
At present, the write throughput is 44 M/s, the query throughput is 48 k/s, and the p90 query time is reduced to the ms level (300 ms) through query optimization and query re-acceleration.

Cloud monitoring solution

Currently, cloud monitoring has the following pain points:

Due to the network environment of various cloud vendors or the same cloud vendor due to different regions, the collection fails and there is no monitoring data.
Each account is an independent set of Promtheus data collection, so there are multiple cloud monitoring data sources, and it is difficult for users to select the cloud data source they need.

The cloud monitoring solution is similar to the IDC collection solution, and in order to facilitate unified data source query and alarm calculation, we use remote write to return the cloud data collected by Prometheus to the source to the IDC storage cluster. The architecture is as follows:

Compared with IDC collection solutions, cloud monitoring has the following differences:

1. Contractor支持从公网pull本zone所需的采集配置。

2. Why use Prometheus instead of vmagent for acquisition?

Low transformation cost: Historically, data on the cloud was collected by Prometheus, and at the current Collector implementation level, the collector is pluggable, and the cost of using Prometheus to collect and cover data is low.
Prometheus local data: The data collected on the cloud is returned to the source through remote write to the IDC storage cluster, and 1D data is stored locally, so that the local data can be queried when the remote write fails.

3. Why was the vm-auth component introduced?

Because data from the cloud is repatriated to the data center and goes through the public network, and there are security risks when the data is re-sourced and remotely write to vm-insert, we add vm-auth to perform tenant authentication and traffic scheduling for the back-to-origin traffic. The vm-auth configuration is as follows:

earnings

After all cloud monitoring is scheduled according to the zone, the quality of cloud data is greatly improved, and the oncall of cloud monitoring without data is reduced by more than 90%.
Unified data source query, 20+ cloud data sources converge to one.

Planning for the future

Metrics data storage for a longer period of time: The default storage time of data is 15 days, and you want to provide longer data for business customers for analysis, review, resource estimation, and other scenarios.
Support more fine-grained metric burying: At present, the default application monitoring is 30 seconds per point, and a small number of scenarios support 5 seconds.
Enhanced self-monitoring capabilities: At present, there is a set of self-monitoring links, which are linked with the O&M platform to provide basic monitoring and alarm monitoring of its own system, and it is hoped that all self-monitoring scenarios & O&M SOPs can be covered in the future.
Iteration of the indicator platform: At present, the indicator platform mainly provides basic capabilities such as monitoring access, collection configuration management and monitoring object query, and plans to add capabilities such as writing/query blocking, whitelist, etc., and hopes to use the capabilities of large models to achieve text2promql (natural language translation to PromQL, and automatic generation of PromQL annotations) through indicator meta-information enhancement

Author: Bao Senle, chopped pepper

Source-WeChat public account: Bilibili Technology

Source: https://mp.weixin.qq.com/s/gTB_hEXJQ2gz_oP7VN3-dg