laitimes

Governance efficiency increased by 77%! Demystifying the best practices based on DataLeap real-time health scores

author:ByteDance data platform

background

An enterprise's real-time data warehouse team builds a real-time data warehouse through data collection, integration, calculation, and storage, providing enterprises with fast, accurate, and reliable real-time data analysis and decision support. At present, the real-time data warehouse team has run tens of thousands of real-time tasks, relying on a variety of components (such as Flink, Yarn, Abase, Doris, etc.), a large number of developers, and uneven development habits and experience, resulting in frequent problems such as task stability and resource waste. Therefore, task governance is imperative, but throughout the governance process, there are still the following contradictions:

1. The contradiction between business stages and data governance

The business stage can be roughly divided into two stages: the development stage and the maturity stage. Development period: The product is constantly iterating, the requirements are constantly increasing, and the real-time tasks are constantly increasing. This period is also the stage of building trust with the business, where the quality of real-time tasks is prioritized and cost control is prioritized over quality assurance. Maturity stage: In this mature stage, the real-time team must not only ensure the quality of the data warehouse, but also pay attention to the reasonable allocation and utilization of resource costs.

2. The contradiction between labor cost and data governance

The cost of governance for real-time tasks has always been high due to its technical complexity and attributes such as online operation, and human resources often hovered between data governance and business needs. Since real-time task governance will inevitably occupy business support energy, how to improve governance efficiency, reduce governance costs, and release personal energy is also a point of special concern.

3. Contradiction between governance issues and evaluables

In general, real-time tasks can be screened out by certain rules to identify problematic tasks and be managed in a centralized manner. Although this method can solve the problem of phased governance to a certain extent, it cannot quantify the health of the task and the urgency of the governance to be governed, which makes the governance unsustainable. Therefore, there is a need for an evaluable system to evaluate the health of the data warehouse, and continue to promote governance through the evaluated scores.

Enter DataLeap Real-Time Health Score

DataLeap Real-time Health Score is a one-stop real-time data governance solution that integrates governance evaluation, goal setting, governance-driven, governance efficiency improvement, and effect quantification, which meets the requirements of precise governance, reduces governance costs, and ensures the overall standardization and stability of data.

The real-time health sub-scheme can be roughly divided into four modules: meta-data warehouse construction, governance item precipitation, score calculation, and platform governance.

1. Construction of meta-warehouse

The health component database refers to the metadata information related to the task, which is the underlying data that the health component processing process depends on, including the stability, quality, standardization, cost, and SLA of the task.

Metadata type description
Stability metadata gc, failover, cp, state, backpressure, tilt, etc
Quality metadata Timeliness, accuracy, indicator monitoring coverage, etc
Prescriptive metadata Task configuration, component configuration, alarm configuration, etc
Cost metadata Queue resources, computing resources, storage resources, etc
SLA metadata Component SLAs, data SLAs, task SLAs, etc

2. Governance item precipitation

Governance rules are a set of general rules accumulated by the Flink team from the perspective of the engine and the real-time data warehouse teams of each BP through the business perspective, through which the effect of rapid reuse can be achieved, and the cost waste and quality hazards in real-time tasks can be quickly discovered. At this stage, more and more teams are joining real-time governance, contributing more governance experience, summarizing more general rules, and attracting more teams, thus forming a positive cycle. At present, there are 14 quality rules and 2 cost rules.

Governance efficiency increased by 77%! Demystifying the best practices based on DataLeap real-time health scores

3. Score Calculation

Explanation of terms:

Weight of governance items: Depending on the importance of governance rules, the weight of governance items is different, for example, "CPU resource waste" = 40 and "queue configuration is not standardized" = 15.

Task level coefficient: Each task level will have different coefficients, the higher the level, the greater the impact on the score, for example: D1/D2=10, D3=5, D4=3, D5=1

At present, the real-time health score covers two sets of evaluation systems: quality score and cost score, and the result is equal to the mean of quality score and cost score. Each set of evaluation system adopts the point deduction system algorithm, which has simple scoring logic and strong interpretability, and can realize the score calculation results from the granularity of tasks and individuals to the granularity of departments and companies.

  • Mass score calculation

Caliber: ∑ (task level coefficient of the hit governance item) ∑ (task level coefficient of all tasks) ∗ weight of the governance item\frac {\sum(task rank coefficient of the hit governance item)}{\sum (task level coefficient of all tasks)} *Weight of governance items∑ (task level coefficient of all tasks) ∑ (task level coefficient of all tasks) ∗ weight of governance items

For example:

There are a total of 1000 tasks, ∑ (level factor for all tasks) = 2500\sum (level factor for all tasks) = 2500∑ (level factor for all tasks) = 2500

Among them, 100 tasks are hit and no alarm is configured, ∑ (task level coefficient of the governance item hit) = 500\sum (task level coefficient of the governance item hit) = 500∑ (task level coefficient of the governance item hit) = 500

If an alert governance item is not configured (governance item weight: 15), the score is 500 / 2500 * 15 = 3 points

Mass score = 100-3 = 97 points

  • The cost is calculated in points

Caliber: ∑ (the number of CPU allocations of the task that hits the governance item) ∑ (the number of task CPU allocations of all tasks) ∗ the weight of the governance item\frac {\sum(the number of CPU allocations of the task that hits the governance item)}{\sum (the number of task CPU allocations of all tasks)} *The weight of the governance item∑ (the number of task CPU allocations of all tasks) ∑ (the number of CPU allocations of the task that hits the governance item) ∗ the weight of the governance item

For example

There are a total of 1,000 tasks, and the ∑ (the number of tasks allocated by all tasks) = 25,000\sum (the number of tasks allocated by all tasks) = 25,000∑ (the number of tasks allocated by all tasks) = 25,000

Among them, 100 tasks are wasted, ∑ (the number of CPU allocations of the task that hits the governance item) = 10,000\sum (the number of CPU allocations of the task that hits the governance item) = 10,000∑ (the number of CPU allocations of the task that hits the governance item) = 10,000

The CPU resource wasting governance item (governance item weight: 40) is deducted from the score of 10,000 / 25,000 * 40 = 16 points

Cost points = 100-16 = 84 points

4. Platform Governance

The real-time health sub-platform provides efficient governance capabilities, including three modules: governance panorama, governance workbench, and governance assistance.

  • Governance panorama: Provides dashboards such as health trends, cost item governance trends, and the distribution of problems to be treated to observe health trends.
  • Governance Workbench: Provides tools such as governance item details, recommended parameters, one-click governance, and post-event monitoring to improve governance efficiency.
  • Governance assistance: Provides governance broadcast cards and custom scenario governance assistance tools to improve governance scenarios.
Level 1 items Level 2 items description
Governance panorama Healthy portion Shows the current health score of each line of business or individual
Health sub-trends Shows the trend of health scores, including cost and quality.
Cost item governance trends Displays the trends of cost items that have been governed, need to be governed, have been saved on CPUs, and have been saved on CPUs.
Trends in quality item governance This section displays the trend of the number of tasks to be governed and the number of tasks that have been governed.
Distribution of issues to be addressed Displays the number of issues to be addressed and the points deducted for each rule.
Governance Workbench Governance item details The details of the task to be governed can be filtered by rule item, task level, task type, task owner, etc
Recommended governance parameters Suggestions on optimization parameters are given for the governance items hit by each task.
One-click governance in batches Governance parameters are used to recommend governance that completes multiple tasks in batches. The efficiency of single-task governance has been increased from 15 minutes to 30 seconds
Monitor the market after the fact After the governance is completed, the operation of the task is observed by pushing the LAG monitoring panel of the governance task.
Governance assistance Governance Broadcast Card Push the governance card to the corresponding owner every day, and broadcast the current cost score, quality score, number of cost items to be governed, number of quality items to be governed, and yesterday's governance information.
Custom scenario governance Provides services with the ability to customize governance items to meet the needs of personalized and non-common governance scenarios.

Real-time governance project

Governance efficiency increased by 77%! Demystifying the best practices based on DataLeap real-time health scores

An enterprise data platform has requirements for cost reduction, efficiency increase, and stability assurance, and there are problems such as CPU usage waste, unconfigured alarms, irregular queue usage, and high CPU usage for daily tasks. Therefore, the real-time data warehouse team and the DataLeap team set up a governance project. Set up a virtual team and governance POC mechanism to split governance goals from top to bottom, quickly respond to governance blockages, promote governance progress, coordinate governance resources, and ultimately ensure the achievement of the goals.

The members of the virtual team always pay attention to the health of the business line, assess the risks of achieving the goals, and communicate with the business governance POC in a timely manner about the difficulties and obstacles encountered in the governance process after discovering that there are risks in the governance progress, and the virtual team will develop new tools or formulate new governance plans to help the business governance POC overcome the governance difficulties and promote each business direction to achieve the set quarterly goals.

1. Real-time cost specialization

The real-time tasks of the data platform have a large amount of resource waste, with the number of resource wasted tasks being 3.8k+ and the CPU resources to be managed being 27.9w+ cores. Based on the serious problem of resource waste, a real-time cost project was set up, a virtual support team was formed, and the business went deep into the business to assist the business in resource waste management, with a total of 1.15k resource waste management tasks and 27.9w+core -> 17.7w+core CPU resources to be governed.

2. Real-time quality specialty

At the same time, there are a variety of hidden risks in the quality stability of real-time tasks on the data platform, such as excessive CPU usage, unconfigured alarms, irregular queue usage, and data skew. Based on the problem of potential stability problems, multiple parties jointly formed a real-time quality project, precipitated 11 quality rules, helped the enterprise data platform find 3K+ quality problems, promoted the data platform to carry out quality governance, and completed 1.1K quality governance.

3. Quarterly governance gains

Indicator Explained:

Improvement rate of one-click governance duration: One-click governance reduces the governance duration from 15 minutes to 0.5 minutes, resulting in an increase rate of 96.5%.

One-click governance scenario coverage: Number of one-click governance issues, number of all governance issues, \frac {number of one-click governance issues}{number of all governance issues} number of all governance issues, number of one-click governance issues

Governance efficiency: The rate of one-click governance duration increase * The coverage rate of one-click governance scenarios

The benefits are as follows:

  • The data platform's Q3 health score went from 80.57 (9 new governance items led to a decrease in the score) to > 81.85
  • Quality item governance problem 1.11k+ (including zero "task not configured alarm" problem and 700+ "high CPU usage" problem)
  • The coverage rate of one-click governance scenarios is 80%, the one-click governance duration increase rate is 96.5%, and the governance efficiency is increased by 77%.

Click to jump to the big data R&D and governance suite - Volcano Engine to learn more

Ji

Read on