Preface

Effect evaluation has always been a headache in software testing, because in many cases, the quality of the effect is a subjective thing, how to evaluate the quality of a product through sampling, comparison, scoring, user data and other objective ways is a problem worth thinking about for our test engineers.

With the continuous growth of Zhihu DOU, Zhihu search has also become an important entrance for traffic distribution. Zhihu search has higher and higher requirements for quality, so it is imperative to evaluate the performance.

This article mainly introduces the current evaluation methods and processes commonly used in the industry. the methods and tools used in the Zhihu search evaluation; Problems encountered and the construction of the evaluation platform. I hope that it will have some guiding significance for students who are getting started with evaluation, especially those who do search evaluation.

Evaluate the basic process

The basic process of evaluation is mainly divided into evaluation methods, evaluation sampling, evaluation capture, manual labeling and quality inspection, statistical analysis and reporting, and short board review, which will be introduced one by one below.

Evaluation Methodology:

Ppage(per-page)

It is often used in relatively formal comprehensive competitive evaluation, which is evaluated by the page as a whole, and the main purpose is to evaluate the overall quality of a page. It can be divided into overall sensory scoring, relevance scoring, sorting scoring, page quality scoring, etc.

PI(per-item)

There is a list to return, and the evaluation that needs to be scored one by one is often used for search evaluation, and this method is used in Zhihu search evaluation, which can be used to assign different weights to different items with DCG scores, and obtain a comprehensive score to achieve the effect of comprehensive evaluation.

SBS(side by side)

As the name suggests, the evaluation is conducted in a comparative way, which is divided into comparison with competing products, comparison between versions, etc. Formulate indicators, use scoring, comparison, etc., to evaluate the quality of the two versions.

Gongge Review

It is suitable for the comparison between the effect of manual judgment and the actual program, and is mostly used for special evaluation to evaluate the effect of a certain function in the system. Reference: Confusion Matrix.

Evaluation sampling

random sample

Advantages: It can fully reflect the real distribution of user needs (such as time, region, etc.)
Disadvantages: Some long tails and local problems cannot be well exposed

Stratified sampling

According to the query volume, it is divided into popular, mid-range, and long-tail, and sampling is carried out separately according to the proportion

Advantage: Targeted analysis of individual segments
Disadvantages: The proportion of user needs cannot be completely restored

Deduplication sampling

Deduplication and sampling of user queries can better cover long-tail queries, but the proportion of user queries is quite different

Vertical sampling

Local sampling for some characteristics (e.g., category, composition, length, DIFF) can better cover and find local types of problems, but it cannot explain the problem as a whole

Review scraping

Python script scraping

In the daily evaluation, the python script is mainly used to drive the search results of Zhihu and competing engines, and compare them separately. All use the corresponding API for scraping.

Recall the results of daily automated evaluations

The main strategy is to randomly sample 1000 long-tail search queries every day, and request three search engines, Zhihu, Competitor 1, and Competitor 2 respectively, to record the top 10 identical results of competing products, and if there are more than 3 return results that do not appear in the top 20 return results of Zhihu, the query will be recorded as badcase.

The results of the specific evaluation are as follows:

Badcase

Details of the number of queries in the question above

Manual annotation and quality inspection

Search evaluationWe currently use PI and SBS methods for evaluation, manually compare and score the results one by one, synthesize each result for the DCG score of the query, and finally generate a test report.

Evaluation platform related:

Statistical analysis and reporting

短板 review

Compare the pros and cons with the competition
Problem categorization
Demand generation
短板 review，形成排期
After optimization, badcase regression validation

Example:

Demand Generation:

badcase 回归验证：

Summary & Outlook

In general, reviews are suitable for tests that return uncertain results and are biased towards effectiveness. Especially in search, it is worth thinking about how we can quantify the quality of search.

In general, Zhihu's search reviews are in their infancy, and they are currently maintaining a frequency of 2 times a quarter, and they are still exploring strategies to optimize scoring. We already have our own evaluation platform, which can provide stable evaluation results, badcases, and optimization suggestions.

Follow-up expectations can be repeated practice to continuously optimize the direction and method of evaluation, and the reasonableness of scoring; In terms of the platform, it hopes to access more evaluation needs and expand the coverage of the platform, not limited to search.

Author: Little Ball Ball

Source: https://zhuanlan.zhihu.com/p/65835152

Zhihu search evaluation practice