Ad study is the umbrella term for split tests and lift tests on Facebook. This article is about best practices for both split testing and lift testing on Facebook.
Test one thing at a time
To make sure you can interpret the results, test only one thing at a time.
For example, if you want to test whether manual bidding is better than automatic bidding, create two campaigns that are identical except that one uses manual bidding and the other uses automatic bidding. When you observe a difference you will know exactly what caused it.
It is easy to setup campaigns like this in Smartly. First create one campaign, then clone it, and change just the one thing you want to test in the cloned campaign.
It is also good to avoid making changes during the test – unless, of course, the changes are the one thing you want to test. For the same reason try to avoid using optimization strategies as they do automatic changes.
Make sure the cells spend equally (or proportionally)
The scale of your advertising (and therefore spend/budget) affects Facebook's pacing algorithm and therefore also the CPA. With larger spend, Facebook needs to reach out to a wider audience, not just the people most likely to convert. If one cell spends more than the other, its CPA will naturally be higher, if there are not other differences.
Unless you are specifically testing different budgets, you should set budgets proportional to the cell sizes. For example, in a 90%–10% split, your budgets should also be split 90%–10%. This way, the expected reach per audience size and realized prices should be equal.
This means that you should not compare performance between ad sets that are in the same campaign using Campaign Budget Optimization (CBO). CBO would allocate different spends to each ad set, making their results not comparable. Instead, you have to create an ad study to compare across separate CBO campaigns, in order to ensure they all spend identically.
This also makes it quite hard to test campaign setups that are steered by bids — setups that "spend as much as possible as long as the CPA is good". In these setups, it's equally important for the cells to be able to maintain the same average CPA; then you can compare which cell was able to deliver a higher volume with this CPA.
Run the ad study at least one week
Performance can differ between weekdays, especially between working days and weekend. By running the ad study at least one full week you get a better idea about its performance.
In general, you need a lot of data to prove one campaign is better than another one, so run your tests long enough, with a big enough budget. The Power Analysis tool will help you estimate how much spend is needed, and the Statistical Significance Calculator will tell you during and after the test if the results were statistically significant.
Collect enough data
Just because Campaign A received 14 conversions and Campaign B only 10 does not mean that Campaign A is better: the difference might be due to random variation instead of a true difference in performance. The only way to tell the two apart is to collect enough data. When you create an ad study in Smartly you will automatically see an estimate on how much data you are likely to need. And when the ad study is running you we will tell you when the ad study should be stopped. You can read more about statistical significance in the section below.
Most people tend underestimate the magnitude of random variation. A good rule of thumb is that differences smaller than 2 ⋅ √CV are not statistically significant (CV = number of conversions). For example, if one campaign receives 30 conversions and the other 37, the difference is not statistically significant (7 < 2√30 ≈ 11) . You would need at least 41 conversions in the second campaign until the difference would be even close to statistically significant – and that would require the second campaign to be almost 40% better than the first one.
Make sure that your events are firing properly
All too often, advertisers start running an ad study and notice during the test that their Pixel events are not firing. This naturally invalidates the test and wastes money.
Set yourself a reminder before the test ends
This way, you have a chance to extend the test, in case enough conversions have not been reached to guarantee statistical significance. The end date can only be extended before the test ends.
Don't compare an old campaign to a new one
Campaigns that have already been running have had time to collect data and learn what is most likely to convert. It is therefore not fair to compare such a campaign against one that was just started.
Instead, clone the original campaign twice to create two new campaigns to use in the ad study.
Don't use the same posts
Never use the same posts in the campaigns being compared, especially you are specifically testing different creatives. For example, comments and likes affect ad performance. Using the same post could mean that your better campaign attracts likes and comments with a better creative or a better audience, and that goodwill would leak into the other campaign's results, too.
AA/BB testing: just don't do it
AA/BB testing (that is, running two identical copies of both campaigns) is sometimes suggested as a way to measure variance of results while running the test. We do not recommend AA/BB testing for the following reasons:
- Running two identical copies is not enough to give a reliable estimate of variation. You would need at least 20 identical copies for each variant to do that.
- Variance can be calculated quite accurately anyway. In fact, any test of statistical significance does exactly that.
"Not statistically significant"
What does it mean when we say that a difference is "not statistically significant"? Let's first be clear about what this phrase does not mean. It does not mean that there is no difference at all.
For example, suppose you are testing two campaigns, A and B, and Campaign B has 10% higher performance. We then run the ad study until we get roughly 100 conversions in each campaign, say 96 in Campaign A and 104 in Campaign B. Result? No statistically significant difference. Even though we know that Campaign B is 10% better!
Here is a better way to interpret "not statistically significant": it means that you do not yet have enough data to distinguish whatever difference there might be, and crucially, you do not have enough data to conclude which case is better. (Note, however, that you have learned something by running the ad study: if Campaign B had been 50% better, you would have gotten a different result already, so it is unlikely there is a large difference.)
Keeping all of this in mind when running ad studies can easily get overwhelming, and misunderstanding statistical significance is common even in science. (Read related articles: "Statistical significance and its part in science downfalls", "Statisticians issue warning over misuse of P values", "Scientific method: Statistical errors")
This is exactly why we ask you to define the smallest interesting difference already when creating an ad study: it allows us to give you more understandable results. If the smallest interesting difference had been 10% in the example above, you would see a recommendation to continue the ad study because given the data so far it is possible that there is a difference larger than 10%. On the other hand, if the smallest interesting difference had been 40%, you would see a recommendation to stop the ad study because it is unlikely that there is a difference larger than 40%.
Adding and removing campaigns or ad sets in active ad study
It is possible to add and remove campaigns and ad sets from active ad study cells. Pay special attention when doing these changes, so that you do not accidentally harm your ad study results.
Remove/add campaigns to an active ad study when you want to:
- add an entirely new campaign to an ad study (e.g. you're running a lift study with all your campaigns and want to launch a new campaign).
- fix an ad study if you made a mistake when creating it.
- create ad study first, and only then create the campaigns and add them to the study cells.
Avoid:
- removing already active campaign - instead consider just pausing it but leaving as a part of study cell.
- adding already active campaign - instead consider cloning and adding newly created one.
Consider:
- launching campaigns in a paused state, to give Facebook time to review and approve all ads before going live. That way when the campaign is live, all ads are approved and can start delivering at the same time to have clean test results.