Note To find out more about split tests, see Split tests
Results from Meta advertising always have random variation. You may get 245 conversions in a week, but if you were to run the same campaign again you might get 231 conversions instead, without changing anything.
Therefore, when interpreting the results, it is important to ensure that the differences are so large that they can not be explained by random variation alone. When we can say this with enough confidence, we say that the difference is statistically significant.
In this article, we discuss how to make sense of the results of your Facebook split tests with the help of statistics.
Power analysis when creating an ad study
The first step happens before the ad study has even started: estimating how much data you are likely to need to reach a statistically significant result. The estimation tool is built into the ad study creation dialog, and you need to supply a few details to get a reliable estimate.
-
Metric and Conversion goal define what you want to measure. It is possible that there is a statistically significant difference for CPA but not for Conversion Rate. This happens, for example, if both campaigns have received the same number of clicks and conversions, but one campaign has accomplished this with only half the spend.
- The selection made in this step is used as the default when estimating the statistical significance once the results start coming in.
- The conversion goal selected initially is your primary account reporting goal. To change it, go to the Reporting section.
- The smallest interesting difference is the most important factor in calculating how much data you will need to collect. The smaller differences you want to find, the more data you will need to collect to distinguish the differences from random variation. In most cases values between 10%–20% give the best compromise between the price of the ad study and value gained from learnings.
- Confidence level defines how certain you want to be that the difference you find is a true difference and not random variation. If there is no difference at all between the ad study cells and confidence level is set to 95%, there is a 5% probability that a statistically significant difference is detected anyway. A larger value means that the outcome is more likely to be correct, but also that more data must be collected to reach a statistically significant result.
- Because randomness is at the core of statistical testing, it is not possible to predict exactly how much data is needed. Maybe you get lucky and 300 conversions are enough, but you might need 800 in most cases. Statistical power allows you to explore this uncertainty. The default value of 80% means that by collecting the indicated number of conversions you will get a statistically significant result with 80% probability (assuming the difference is exactly as large as the smallest interesting difference). With 20% probability, you would need to collect more data. Note that statistical power is only used to calculate the estimate and does not affect the actual calculation of statistical significance at the end of the ad study.
- CPA can be filled in to estimate the total cost of the ad study. We pre-fill this field using historical data from your ad account, if there is enough data.
The displayed number of required conversions (or clicks in case of CTR) is always the total number of conversions in all cells. If you add new cells, the estimate will increase accordingly. The estimate is calculated assuming the total budget for the ad study is split proportionally to cell sizes.
While the ad study is running
You can analyze the statistical significance of an ad study by clicking the test's name in Home → Ad Studies:
When the ad study is running, the most interesting question is whether to stop it—either because a difference has been found or because there does not seem to be a difference—or whether to collect more data. To help you make the decision we will always show a clear recommendation to either stop or continue the ad study. Note that you can modify the end date as long as the ad study is running.
The recommendation to continue or stop depends on the values of the smallest interesting difference and confidence level given when the ad study was created.
When the ad study has ended
After the ad study has ended you can see which ad study cell was the best one, assuming you have collected enough data to draw conclusions. You access the results the same way as in the previous chapter. There are three possible outcomes:
- There is a statistically significant difference. In this case, you should implement the better variant.
- There is no statistically significant difference. You can implement either variant. It is not certain which is better, and the difference is most likely too small to be of practical importance anyway.
- There is not enough data to draw conclusions. There might be a difference that is big enough to be interesting, but the ad study ended before enough data was collected to estimate this with enough confidence. If you still want to know which alternative is better, you should create a new ad study and run it longer (and with a larger budget) to collect more data.
If a difference is found, you will also see information about how large the difference is. For an example, see the screenshot below.
In the above example:'
- It is almost certain that CTR is 0.100% smaller in "Cell 1: Countertop"
- However, there is also a small probability that CTR is 0.114% smaller
- Because CTR is a percentage, these differences are given as percentage points
- If the split test was examining CPA instead, the differences would be shown as monetary values. In this case, CTR was 0.583% in "Cell 2: Virtual" and 0.476% in "Cell 1: Countertop", which gives the absolute difference of 0.107 percentage points.
When the ad study has three or more cells, there are more outcomes. For example, it is possible that there is no statistically significant difference between the two best cells, but both are better than the other cells.
FAQ
Where can I see p-values?
Our calculations are based on Bayesian statistics and p-values are not relevant in this context. As to why we prefer to use Bayesian statistics instead of classical statistical tests, the main reasons are:
- Most people want to calculate statistical significance also while the ad study is running, not only after it has ended. However, if you do this with classical statistical tests and stop the ad study when a statistically significant result is reached, you actually affect the result of the test. The fact that testing alone can change the outcome might sound surprising. The reason this happens is that the outcome always fluctuates during the ad study. The more often you check, the more likely you are to check at a moment when the result happens to be statistically significant. Just using Bayesian statistics does not magically make the problem go away, but it allows us to be more flexible and reduce the problem to a negligible level. For more information on the topic, see this article on repeated significance testing errors.
- Another problem is that p-values are more often misunderstood than not. The large p-value is often taken to mean that there is no difference; however, it could also mean that you just do not have enough data yet to draw conclusions. Bayesian statistics allows us to calculate quantities that are more intuitive and more useful than p-values: how likely it is that there is a difference, and how large the difference is. These are, after all, what most people really want to know.
I want to know if there is a difference of any size
Are you sure about that? Suppose Campaign A has 0.1% higher conversion rate than Campaign B. This means that when Campaign B gets 1000 conversions, Campaign A gets 1001 — on average, that is. In any individual test, the outcome would be something different because of random variation, which is why you would need to collect approximately 4 million conversions in each campaign to be able to conclude that Campaign A indeed is a tiny amount better. There are probably better ways to spend your time and money.
From a purely theoretical point of view, there is almost always some difference. However, the difference can be so small that it is irrelevant for all practical purposes. There are a million other changes you could make that would have a bigger impact. When it is unlikely that a large enough difference will be found, we show a recommendation to stop the ad study so you do not waste time and money pursuing differences that are not relevant in practice.
Why is it important to define the smallest interesting difference?
The smallest interesting difference is used to decide when enough data has been collected and the ad study should be stopped. In loose terms, data is collected until we know the difference with sufficient precision. The exact definition is somewhat involved, but if you are still reading this you probably want to know anyway.
Let θ represent the metric whose difference we want to analyze. Given two ad study cells, A and B, we first estimate the posterior distributions for this metric in each cell, θA, and θB. Using these distributions we can calculate the distribution for the relative difference of θA and θB, and then find the width of the 95% highest density interval (HDI) in this distribution (the value 95% corresponds to the selected confidence level). The ad study should be stopped when this width becomes smaller than the smallest interesting difference.
Because this stopping criterion does not depend on how large the difference is but only on how well the difference can be estimated, it is possible to estimate statistical significance even while the ad study is running without affecting the outcome.
Why do I see a warning about uneven spend between cells?
You see the warning because we think the results are not reliable due to uneven spend between the cells.
Uneven spend means that the cells have not spent in proportion to the cell sizes. When one cell has used less than its share of the budget, Facebook has been able to focus the delivery on the people more likely to convert, while the cell spending more has had to expand its reach. In other words, higher spend means higher average cost. This happens due to Facebook's lowest cost pacing, and as a result, the CPA estimates are not directly comparable if total spend is not proportional to cell sizes.
We show the warning only when it is likely that the uneven spend changes the conclusion of the test. In practice, this is done by correcting the CPA estimates for the uneven spend by modeling price elasticity. The model scales CPAs up or down depending on the difference in spend. The statistical test is then performed again with the corrected CPAs. If this changes the conclusion, a warning is shown.
If you see the warning, and the spend difference is large, it is very likely that the results are not reliable. In this case it is best to create a new ad study and try to correct any mistakes in the configuration that are causing the difference in spend. For example, make sure total budgets are proportional to cell sizes and bids are identical between cells.
If you see the warning but the spend difference is small, collecting more data may resolve the problem over time.