When a Test Rollout Shows Different Results

Test Winner Goes Flat

I received this question: "We ran a campaign where we observed an uplift with a change in button text. When we "rolled" it out with 5% default 95% change, the new data is showing that it's under performing. Have you observed situations like these before?"

Here's my response:
Occasionally, yes. There are a lot of things that can cause this situation, so it's impressive you kept a 5% holdout to be able to confirm the test results (or in this case, possibly discredit them).

Let's assume everything was done "by the book" in terms of the test setup and analysis. Even so, there is still at least a 1 in 20* chance the initial "winner" was really not a winner but appeared so due to sampling error. (*Assuming you ran a statistics test and picked winners when the results are p<0.05).

I say "at least 1 in 20" because these stats tests are designed around the idea that there is an underlying "truth" that is stable. In the natural sciences there is usually an underlying stable truth rooted in the laws of nature. In marketing scenarios however, the underlying truth can have a great deal of fluctuation. The population of site visitors changes from day to day due to seasonality, promotions, competitive influences, and so on.

Aside from that, there are a number of ways marketers tend to use design of experiments and statistics that further degrade the integrity of the results (or at least our interpretation). Without knowing anything about this particular test yet, here's a couple of things in general that tend to throw more uncertainty around the interpreted results:

  1. # of recipes. When you have more than 1 challenger recipe (A/B/N tests) you have more comparisons that are being made. Therefore you have more statistical tests being run. Therefore you have more opportunities for sampling error to occur. It is sometimes possible to correct for this (e.g., making your critical value smaller) but in practice it is rarely done.
  2. # of segments. When you look at your results data by segment (e.g., PPC vs SEO, new vs returning) you again have more comparisons and more chance for statistical error. Also, the hypotheses being tested here are "post hoc" which is less than ideal.
  3. Peeking. When you look at your test results every day for 2 or 3 weeks, each time you look there is a chance of sampling error occurring. From a statistical purity perspective you should only run your statistical analysis once. Again, it's possible to correct for these things (not too easily however), but few if anybody does. I must note here that you can and should look at your results immediately after the test goes live to look for potential bugs in the test or the reporting. But that is different from looking at the results with the purpose of making a call. That is what should be per-determined before the test runs (e.g., "How many participants will I roll this out to before I stop it and make a call?").
  4. Bias. This one is related to the last point. When you look at the data multiple times, you are presumably able to "call it" at any one of those looks. That introduces subjectivity. If there are any biases or incentives to finding winners, it then introduces the possibly that sometimes we might "call it" during those looks when it comes out statistically significant. But results come in and out of being statistically significant sometimes and the ability to call it at any time degrades the integrity of the statistical approach.
  5. P value. Earlier I assumed the p value was 0.05 or below to call it a winner. Usually we find in business that everyone is happy to call it a winner with much larger p-values -- 0.10 or 0.15 for example. So now the Type 1 error rates are 1 in 10 or 1 in 7. You run lots of tests hopefully and if you are happy to call a test with a p=.82 as a winner, then essentially 1 in 5 of your winners will be due to sampling error, not a true effect.
  6. Choice of statistical test. There are a variety of different statistical tests designed for different purposes and different underlying data types and test designs. Marketers can tend to assume all tests they run are analyzed with the same test. That is not quite true.
  7. Parameters in statistical tests. Even if you select the right statistical test, there are settings or correction terms that can be applied to make them more accurate given the particular data set and hypothesis.
  8. Beginning of test influences. Many practitioners of optimization have observed that the test results you get in the first week or two of the test may not be indicative of the ultimate tests results when you roll it out.
    • One reason for this could be if you are including people who have already been to the site previously. Possibly they are reacting to a change - any change - not necessarily the new creative or experience itself. Seeing new things can make the site feel fresh. Or conversely the change can confuse long-time visitors about where to go to get what they need. They may stumble at first trying to "re-learn" how to navigate your site.
    • Conversely, if you are including new visitors only in your test then there can be a ramp-up period needed depending on the purchase cycle required. For a considered purchase, the conversion rate may be much lower in the first week (or even month) of a new-visitors only test. You'd like to believe that the relative conversion rates are indicative of future performance even if the absolute rates are lower. However it's always possible that the challenger recipe does a better job with new visitors who are ready to purchase "today" but a poorer job with prospects earlier in the funnel who won't purchase for a month. The reverse could be true as well. In those scenarios, the first couple of weeks of a test may differ from the "true" test results both in absolute and relative terms.

Finally, it should be noted that the most popular practices in using a/b testing in business rely on the branch of statistics called "frequentist" statistics. Frequentist statistics had for many decades won the hearts and minds of scientists as the gold standard. However, the tide has been turning quite sharply and today most modern day statisticians agree that the competing branch called Bayesian statistics is a more reliable method of estimating truths from sample data.

To conclude, yes there are numerous blind alleys in the path for truth from statistical sampling. Results may vary. The practice of keeping a holdout of say 5% of traffic for sometime after you rollout the "winner" is a solid practice which I strongly support. It has its own downsides (may leave less traffic available for other tests) but depending on your application may be worthwhile. As a final note on this topic, the "multi-armed bandit" approach to a/b testing (used by Google AdWords) is similar to this in that a small amount of traffic will drip to the "loser" recipes to see if anything has changed.

Trackback URL for this post: