laitimes

The preliminary practice of data analysis in Zhihu's commercial quality assurance

author:Flash Gene

background

The release cycle of the Zhihu client is 1 week, and the front-end and back-end projects have many requirements and fast iterations. Although the modules of the client are componentized, there is still inevitable coupling, which will affect the business functions by third-party changes. However, the online fault of the client needs to be solved by the version, the version coverage speed is slow, and the commercial fault directly causes a large financial loss. How to find problems as soon as possible and reduce online failures when time is tight and code is complex? Using core data for testing and verification is a method that Zhihu's commercial testing team is currently trying. By combing the core data, data checking is added to the project process as the last link of quality assurance.

How to build a core data model

To build a core data model, you need to have a deep understanding of the business and focus on the core indicators of the product. The data process of advertising on the client mainly includes several steps, including the client requesting the advertisement, the engine issuing the advertisement, the client loading the advertisement, swiping the screen to the advertisement visible, the user clicking on the advertisement, and the user generating conversion on the advertising landing page, forming the funnel model as shown in Figure 1. However, the absolute value of these indicators is affected by the changes of the users covered by the client, and has no reference significance for judging the quality of the client. By dividing two adjacent numbers, we can get the fill rate, loading rate, viewability rate, click-through rate, and conversion rate respectively, which are not affected by user changes and are the main business indicators we use to judge the quality of the client.

The preliminary practice of data analysis in Zhihu's commercial quality assurance

Figure 1 Advertising funnel model

In addition to the above business metrics, we also pay attention to the performance metrics on the client: such as the crash rate of the client, the loading time of images, the white screen duration of splash ads, and the loading speed of landing pages.

The core metrics of the server include resource usage, such as CPU, MEM, threads, network, and disk I/O. Service performance data such as throughput, average response time, .99 response time, failure rate, timeout rate; Business data, such as consumption, click-through rate, filtering, and delivery.

How the data is used

Data can be used to assist in testing at all stages of the project process. And data checking needs to be a necessary part of the project process.

Requirements review stage

During the requirement review, it is necessary to pay attention to whether the link on the funnel model is designed to bury the point, whether the timing of the burying point is reasonable, and whether the definition of the burying point is unique. For example, for the advertisement of the app, there are two forms: one-click download on the advertising card and click to download after entering the landing page.

Grayscale phase of the client

The client runtime environment is complex, and the functional test covers limited scenarios. Comparing the core data of the grayscale version and the online version can help determine whether the client function is normal. Figure 2 shows the grayscale report of the client version. In addition, the newly opened advertising space will be trial-launched during the grayscale period, which can quickly verify whether the click-through rate and other data meet expectations. It is worth mentioning that the acquisition and display of data must be intuitive and easy to use, and if the query statement of the data is complex and the comparison of data needs to be manually processed many times, it is difficult to implement the data test.

The preliminary practice of data analysis in Zhihu's commercial quality assurance

Figure 2 Grayscale report of the client version

The process of onboarding the server

Different backend services can be used to develop different online checklists. For example, service resource metrics, performance metrics, and business metrics. During the rollout process, observe and monitor in real time to determine whether the rollout is successful.

After the project goes live

After the project is launched, it is necessary to analyze whether the data meets expectations in a timely manner. For the function of the AB experiment, it is necessary to analyze the data to determine whether it is full. For example, if you need to accelerate the landing page, you need to verify whether the landing page arrival rate of the ad has been improved and whether the increased proportion has reached the target after the release of the version.

Regular data inspections

Regular data inspections can detect indicators that are slowly deteriorating online and correct them in time. For example, the memory usage of the server continues to rise, and the timeout rate is getting higher and higher. The crash rate of the client increases with the new version coverage.

How to analyze the data

The model for data analysis is shown in Figure 3. For ease of description, 4 layers are defined.

  • Level 1 is the result, and all the changes are ultimately reflected in the changes in consumption.
  • Level 2 is the metric, corresponding to each layer of the funnel model.
  • Level 3 is a dimension, and each metric of Level 2 can be grouped and aggregated by each dimension (or combination of dimensions) of Level 3.
  • Level 4 is a variable that lists the factors that cause data changes, such as changes in traffic, service launch, client release, account changes for large customers, and policy adjustments such as frequency control.
The preliminary practice of data analysis in Zhihu's commercial quality assurance

Fig. 3 Data analysis model

Changes in Level 1 are affected by all the indicators of Level 2 and are difficult to analyze, so it is usually necessary to find the most obvious indicators first. Once the metrics are determined, they are aggregated by the various dimensions of Level 3 to find the dimensions that have changed. Once the dimensions are determined, the variables are easier to determine. For example, if there is a sudden change in the loading rate by hour, it is usually caused by the variable that the backend service is online. If it is a gradient and the trend coincides with the coverage speed of the new version, it is caused by the client version. Here are some examples:

Case 1: During the grayscale process of a certain version of the client, the download success rate in the conversion indicator decreases significantly.

Analysis Method:

1. Compare the data of the grayscale version and the online version, only the grayscale version changes. Level 4 variables determined, new version issues.

2. Aggregate according to the dimension of Level 3, and find that all the ads and indicators have decreased

3. According to the aggregation of network environment, it is found that the download success rate of 4G environment has decreased significantly. At this point, check the 4G download logic of the client.

Case 2: During the grayscale process of a certain version of the client, the loading rate of a third-party channel skyrockets

Analysis Method:

1. Comparing the grayscale version and the online version, both skyrocketed on January 2

2. Analyze the data on January 1, grayscale version and online version, the data is normal, and the basic judgment has nothing to do with the client

3. Analyze the data on January 1, aggregate it by hour, and find that the data starts to plummet at 19:00, and you can check the online records of the server near 19:00

4. Continue to aggregate the upstream metrics of the funnel model by dimension to narrow the scope of services.

Creating data reports for each Level 4 variable can speed up the troubleshooting process, such as:

  • Consumption comparison table of key customers. Comparing yesterday's topN consumer advertiser data with today's topN consumer advertiser data, you can quickly locate consumption changes.
  • Advertiser account operation logs. The modification of the advertising targeting conditions and the change of the bid will affect the issuance of the advertisement, which in turn will lead to changes in consumption.
  • Back-end online records, experimental volume records, and client release records. When the data is abrupt, it is convenient to quickly find the online project.

The effect of the data test

In the grayscale process of the client, a number of data anomalies are found by comparing the core data of the version, and the problems can be quickly located through data analysis to ensure the quality of the release. Data inspection is added during the server launch process, which can be rolled back in time when problems are found in the small traffic stage to avoid multiple online failures.

Author: Zhongli Sugar-free

Source: https://zhuanlan.zhihu.com/p/57013245

Read on