Talk about A/B experiments

author：Everybody is a product manager 2022-01-19 09:15:00

Editor's Guide: As an important engine that most businesses rely on, it is found to encounter various problems in the actual use process. This article details the core engine of data-driven , A/B experiments , elaborating on the process, rhythm and results , and recommending children's shoes that want to understand A/B experiments.

A/B experimentation is the core engine of data drive, and most businesses currently rely on it for decision-making, but they encounter a variety of problems in the actual operation process.

Let me talk about A/B experiments together.

First, ab experimental design process

Let's start with the design of experiments process, which consists of 4 core questions:

Question 1: What is a random bucketing unit?

Most of the random bins are by user dimension, user dimensions have login ID, device ID, anonymous user ID (cookie), except for cookies in the time vertical instability, login ID and device ID are long-term stable.

Binbing method: At present, there are many A/B experimental platforms that can support binning, mainly through the device id and level id using the Hash function for binning, there is also a case of binbing by tail number, if the tail number binning needs to consider whether the data of each tail number sample is balanced, for example, whether there will be a tail number often do experiments, resulting in sample imbalance.

Question 2: What is our target group?

Thinking about what the target group of the experiment is, what is the background and goal of the experiment, is the core problem of experimental design, if the experiment is a target for a specific group means that you only want to run the experiment on a specific user with a certain characteristic, then the trigger conditions become particularly important, because the experimental trigger conditions may lead to various problems such as survivor bias, resulting in untrustworthy experimental results.

Question 3: How large is the sample size required for the experiment?

The size of the sample size required for the experiment, which involves whether the experimental power is sufficient, has a direct impact on the accuracy of the results. The larger the sample size, the better the power of the sample, the more credible the experimental results, but at the same time the more resources consumed, if the sample size is too small, the experimental power is insufficient, the results are not credible, then how to calculate the minimum sample size, you can refer to the following formula:

Question 4: How long does the experiment take to run?

For online experiments, users follow the time to enter the experiment, the longer the time, the more users, the statistical efficacy will usually increase, considering that the user will repeatedly access, the user's accumulation over time may be sublinear, that is, the first day to N people, the second day to accumulate into the experimental user < 2N, so the experiment runs only 1 day, it is more focused on high-frequency active users.

Also because weekends are different from weekday user groups, there will also be a weekend effect, and the same is true for seasonality.

Some experiments have larger or smaller novelty effects in the initial stages that also affect the data metrics, so it is recommended that the experiments be run for at least one week.

Second, the A/B experiment measures the rhythm

It is common to control the unknown risks of new feature releases through a step-by-step process, and we need to measure speed, quality, and risk.

The goal of the first stage is to reduce risk: you can establish a test population, test the risk of running the experiment, observe real-time or near real-time results, understand whether the experiment is risky as early as possible, and quickly roll back if there is a problem.

The second phase aims to ensure the quality of the experiment, we recommend maintaining it for a week at the end, if there is an initial or novel effect, it will take a longer time, if the experiment is only run for one day, the results will tend to be heavy users, and empirically, if no initial or novel effect is found, the additional benefits of each additional day of running after a week will be smaller and smaller.

The goal of the third stage is to make decisions through experiments, and to determine whether the experiment is full or abandoned by analyzing the core indicators of the experiment.

If the experimental period reaches statistically significant early, according to experience, it is not recommended to advance the full amount, and the statistical assumption usually used is to perform statistical testing at the end of the experiment, and ending the experiment early early violates the hypothesis and will lead to some false success.

Third, A/B experimental results analysis

Pitfall 1: The sample size is unbalanced

The first step of experimental analysis is to test whether the experimental group and the control group samples are balanced, the experimental group UV/ control group UV = 1 represents the equilibrium, if the experimental group and the control group sample size is not balanced, it may be caused by loopholes in the experimental stage, then we should not trust any other indicators. There are a variety of reasons for the imbalance in sample size, mainly the following:

Browser jump, a common ab experiment mechanism is the experimental group jump to another page, which usually leads to an imbalance in the sample ratio, the main reasons: a, performance differences: experimental group users need to accept additional jumps, jump process performance may be fast or slow. b. Jump asymmetry: After the experimental group jumps to the new page, it can carry out various actions such as collection, forwarding, and return, while the control group does not; so the control group and the experimental group need to have a jump page, so that the experimental group and the control group have the same treatment.
Residual or lingering effects, new experiments will usually involve new code, so the error rate will be relatively high, new experiments will cause some unexpected problems leading to the abort of the experiment and quickly repair the online, re-randomization will break the user's coherence, will let the user from one user group to another user group, thus affecting the sample balance.
There are loopholes in the user random bucketing process, and there may be bugs in the volume process that cause the experimental group and the control group to be unbalanced
Bad trigger conditions, trigger conditions should include any users who may be affected, and trigger conditions based on attributes that may be affected by the experiment will also make the experimental sample unbalanced

So how to identify sample imbalances?

Verify that there is no difference between the upstream of the bucket timing point or the trigger timing point;
Verify that the binning setting of the experimental sample is correct;
Follow the data funnel path to check whether there is any link that causes sample imbalance;
View the percentage of user segments in the sample
See intersections with other experiments

Pitfall 2: The analysis unit is different

When the analysis page jumps, there is a core indicator CTR, CTR has two common calculation methods, and the analysis unit of the two is not the same.

The first type of CTR1 = total hits / total exposure.

The second type of CTR2, the CTR of each user is calculated first and then the CTR of all users is averaged.

If the random bin unit is user-level, then the first method uses a different analysis sample than the bucket unit, violates the independence assumption, and complicates variance calculations.

For example, in the example below, there are 1W users, assuming that every 1K person has the same exposure and click, CTR1 = 7.4%, CTR2 = 30.4%, obviously CTR1 is affected by outliers.

There is no right or wrong in both definitions, and both are useful definitions of the CTR. But using different definitions gives different results. Generally, both indicators will be made into Kanban boards.

Trap 3: Dilution effect trap

When calculating the experimental effect on the trigger population, the effect needs to be diluted to the entire user base, if the income is increased by 3% on 10% of users, will the overall income increase by 10% * 3% = 0.3%? Generally not, the overall effect could be any value between 0–>3%!

If the change is for the low-spending population of 10% of the overall station users (spending 10% of the average user), and the change increases the revenue of this part of the user by 3%, then the overall revenue will increase = 3% * 10% * 10% = 0.03%.

Pitfall 4: Inter-sample information interference

We assume that each individual experiment in the experiment is independent and does not affect each other, but the user's individual body will affect each other because of social networks, interactive information on the same content, etc.

For example, for social apps, there is a strategy "people you may know" function, and the experimental group recommends a better strategy, which will prompt users to send more invitations. However, the user who receives these invitations may be in the control group, if the evaluation index is the total number of invitations, then the invitations of the experimental group and the control group will increase, so the difference between the measured experimental group and the control group will be small. The same person may know a limited number of people, so the new algorithm may perform better at the beginning, but may reach a lower equilibrium in the long run due to the supply constraints for recommendations.

First of all, we can define the impact of special behaviors, only when these behaviors are affected, we observe whether special behaviors affect downstream indicators, while focusing on analyzing the ecological value brought by the behavior, and considering the final decision according to the value and contribution of the behavior to the ecosystem.

Pitfall 5: Misleading confounding factors

Confounding factors are factors that are related to both the study factors (exposure factors) and the findings (outcome factors) and that, if unevenly distributed among the compared populations, can distort (obscure or exaggerate) the true link between the exposure factors and the outcome factors.

For example, many products have a phenomenon: users who see more mistakes usually churn less!

So, can you try to show more errors to enlighten can you reduce user churn? Of course not, this correlation is caused by a common will: the degree of use. Heavy users see more errors and a lower churn rate.

If you find a new feature that reduces churn, do you need to analyze whether the new feature is working, or is it because of the high active user churn rate and the possibility of using more features?

This kind of experiment is to split the activity level for experimentation. AB experiments need to control variables, but there are too many things to control, the experiment becomes tied, and it is very worth thinking about which variables to control specifically, and sometimes it is ultimately controlled what you really want to measure, so it falls short.

Sometimes the results of the experiment do not give a clear answer, but the decision still needs to be made. In this case, we need to identify the factors that need to be considered, especially how these factors affect the setting of experimentally significant and statistically significant boundaries. These considerations will provide a basis for decision-making, rather than being confined to the current single decision.

This article was originally published by @Li Zhaojie everyone is a product manager, and reprinting without permission is prohibited.

The title image is from Unsplash, based on the CC0 protocol