This article concerns a Meta-only feature.

To learn more about split tests, see Split Tests

When analyzing split test results, it is important to remember that there has to be an equal amount of spend in each cell for an equal comparison. If the cell with less spend has better performance, that's probably just because it spent less – not because it was better. However, if one cell spent more and still performed better, it is the real winner!

In this article, we talk about things to consider when analyzing Facebook split tests results, both while they're running and after they have ended.

View split tests on the 'Ad Studies' section

In the top navigation bar, click on the Home button
From the list of available options, click on "Ad Studies"
On the ad study list, you can use the Type search filter to only show split tests

Analyze results while the test is running

Keep an ad study running until it has gathered enough valid data to give you meaningful insight. You can see the progress of each ad study both the ad study list. If an ad study's progress is below 100%, the ad study page also shows that you don't have enough data yet to view any results.

A single ad study can measure multiple cells with multiple metrics in each

When the progress is at 100%, there's at least one combination that has enough data to show the results
Other combinations might still be gathering data before you get the results

Tip: Hover over the progress bar to view how many combinations there are and how many have meaningful results already.

The results are always a value combined with probability, for example, "The CPA is at least €9.07 larger in cell 1 compared to cell 2 with 95% probability".

Even after you get the initial results that there's a statistical difference between the cells (or that there isn't), you can get even more precise data by letting the test run longer. However, after 100% progress has been reached, the general conclusion that there either is or is not a significant difference in the cells won't change.

Analyze results after the test has ended

After the ad study has ended, it is marked as Completed, and you can see which cell was the winning one, assuming you have collected enough data to draw conclusions.

There are three possible outcomes:

There's a statistically significant difference: In this case, you should implement the better variant more widely
There's no statistically significant difference: You can implement either variant. It is not certain which is better, and the difference is most likely too small to be of practical importance anyway.
There's not enough data to draw conclusions.
- There might be a difference that is big enough to be interesting, but the ad study ended before enough data was collected to estimate this with enough confidence.
- If you still want to know which alternative is better, create a new ad study and run it longer (and with a larger budget) to collect more data.

When the ad study has three or more cells there are more alternative outcomes. For example, it is possible that there is no statistically significant difference between the two best cells but both are better than other cells. It is also possible that there is a statistically significant difference for CPA but not for Conversion Rate. This happens for example if both campaigns have received the same number of clicks and conversions, but one campaign has accomplished this with only half the spend.

Analyze the differences between cells: An Example

If statistically significant differences were found, you are also shown information about how large the differences are:

Screenshot_2021-10-20_at_9.02.02.png

The comparison plot shows differences between the cells with three different levels of certainty.

In the above example:

With 95% probability, CTR is at least 0.100% smaller in "Cell 1: Countertop" than in "Cell 2: Virtual"
There is also a 5% probability that CTR is more than 0.114% smaller
In other words, with 90% probability, the difference in CTR is between 0.100% and 0.114%
The best estimate for the difference is represented by the median, which in this example is 0.107%. This is the estimate that is also used by Facebook, and means that the true difference is as likely to be higher or lower than this estimate.

The differences in CTR are given as percentage points because CTR is also a percentage; if CPA was selected the differences would be shown as monetary values.

In addition to the plot, there is a textual summary of the differences between the cells. From the “Show results with” selector, you can select whether the textual summary shows the difference estimate thresholds with 5%, 50%, or 95% probability. By default, they are shown with 95% probability.

"Not statistically significant"

What does it mean when we say that a difference is "not statistically significant"? Let's first be clear about what this phrase does not mean: It does not mean that there is no difference at all.

For example, suppose you are testing two campaigns, A and B and we want to run a study until both campaigns reach roughly 100 conversions:

Campaign A yields 96 conversions during the test
Campaign B yields 104 conversions

In the above test, the result will be "not statistically significant", despite Campaign B performing 10% better.

In essence, "not statistically significant" means that you do not yet have enough data to definitely determine whatever difference there might be between your cells, and crucially, you do not have enough data to conclude which case is better.

Important: Note, however, that you have learned something useful by running the ad study: if Campaign B had been 50% better, you would have gotten a different result already, so it is unlikely that there is a large difference in performance.)

Keeping all of this in mind when running ad studies can easily get overwhelming, and misunderstanding statistical significance is common even in science. This is exactly it is generally recommended to define the smallest interesting difference already when creating an ad study: it allows for more understandable results.

If the smallest interesting difference had been 10% in the above example, you would have seen a recommendation to continue the ad study, because, given the data so far, it is still possible that there is a difference larger than 10%
On the other hand, if the smallest interesting difference had been 40%, you would see a recommendation to stop the ad study, because it would have been unlikely that there is a difference larger than 40%

Apply learnings

How to implement changes based on the study results depends on the ad study, and how well the results can be generalized. Let's use a creative split test as an example:

If you learned that for a prospecting campaign, showing the prize in the image performs better than not showing the prize, you can gradually implement this across your prospecting campaigns to minimize risk. Here, "gradually" means that you implement the change for a couple of audiences at a time. Alternatively, you can implement the change across all prospecting campaigns at once. Which approach is better depends on your risk tolerance.

FAQ

Where are p-values?

Our calculations are based on Bayesian statistics and p-values are not relevant in this context. As to why we prefer to use Bayesian statistics instead of classical (frequentist) statistical tests, the main reasons are:

Most people want to calculate statistical significance also while the ad study is running, not only after it has ended. However, if you do this with classical statistical tests and stop the ad study when a statistically significant result is reached, you will actually affect the result of the test. The fact that testing alone can change the outcome might sound surprising. The reason this happens is that the outcome always fluctuates during the ad study. The more often you check the results, the more likely you are to check at a moment when the result happens to be statistically significant. Just using Bayesian statistics does not magically make the problem go away, but it allows us to be more flexible and reduce the problem to a negligible level. You can find more information here and here.
Another problem is that p-values are, more often than not, misunderstood. A large p-value is often misunderstood to mean that there is no difference; however, it could also mean that you just do not have enough data yet to draw conclusions. Bayesian statistics allows us to calculate quantities that are more intuitive and more useful than p-values: how likely it is that there is a difference, and how large the difference is. These are, after all, what most people really want to know.

I want to know if there is a difference of any size!

Are you sure about that? Suppose that Campaign A has a 0.1% higher conversion rate than Campaign B.

This means that when Campaign B gets 1000 conversions, Campaign A gets 1001 — on average, that is. In any individual test, the outcome would be something different because of random variation, which is why you would need to collect approximately 4 million conversions in each campaign to be able to conclude that Campaign A indeed is better by a tiny amount. There are probably better ways to spend your time and money.

From a purely theoretical point of view, there is almost always some difference. However, the difference can be so small that it is irrelevant for all practical purposes. There are a million other possible changes you could make, that would have a bigger impact on your performance. When it is unlikely that a large enough difference will be found, we show a recommendation to stop the ad study so you do not waste time and money pursuing differences that are not relevant in practice.

For more details on budgeting for Facebook split tests, see our Knowledge Base article on Planning a Facebook split test.

Why is it important to define the smallest interesting difference?

The smallest interesting difference is used to decide when enough data has been collected so the ad study can be stopped. In brief, data is collected until we know the difference with a sufficient precision. The exact definition is somewhat involved, but if you are still reading this, you probably want to know anyway.

In statistical terms, let θ represent the metric whose difference we want to analyze. Given two ad study cells, A and B, we first estimate the posterior distributions for this metric in each cell, θA, and θB. Using these distributions, we can calculate the distribution for the relative difference of θA and θB, and then find the width of the 95% highest density interval (HDI) in of this distribution (the value 95% corresponds to the selected confidence level, which can be selected when creating the ad study). The ad study should be stopped when this width becomes smaller than the smallest interesting difference.

Because this stopping criterion does not depend on how large the difference is, but only on how well the difference can be estimated, it is possible to estimate statistical significance even while the ad study is running without affecting the outcome. A more elaborate explanation can be found here.