When a Test Rollout Shows Different Results

Test Winner Goes Flat

I received this question: "We ran a campaign where we observed an uplift with a change in button text. When we "rolled" it out with 5% default 95% change, the new data is showing that it's under performing. Have you observed situations like these before?"

Here's my response:
Occasionally, yes. There are a lot of things that can cause this situation, so it's impressive you kept a 5% holdout to be able to confirm the test results (or in this case, possibly discredit them).

Let's assume everything was done "by the book" in terms of the test setup and analysis. Even so, there is still at least a 1 in 20* chance the initial "winner" was really not a winner but appeared so due to sampling error. (*Assuming you ran a statistics test and picked winners when the results are p<0.05).

I say "at least 1 in 20" because these stats tests are designed around the idea that there is an underlying "truth" that is stable. In the natural sciences there is usually an underlying stable truth rooted in the laws of nature. In marketing scenarios however, the underlying truth can have a great deal of fluctuation. The population of site visitors changes from day to day due to seasonality, promotions, competitive influences, and so on.

Running Multiple Tests At A Time

Running Multiple Tests At A TimeThe most dogged misperception about split testing is that you should only run one test at a time. For example there are two tests conceived of by two different people in your organization – one on the home page and one in the checkout – and someone says, “We can’t run them at the same time because we won’t get clean results.” Should you believe them? It depends. Most of the time you can run tests simultaneously without worrying. Having the flexibility allows you to learn more quickly.

There are scenarios when you have to be careful however. Let’s setup one of those examples.

Test #113: Home Page
Recipe A (Style Blue) vs. Recipe B (Style Red)
Test #114: Checkout
Recipe A: (Style Red) vs. Recipe B (Style Purple)

Let’s say if tested in isolation Test #113 shows higher conversion rates with Style Blue. And in isolation Test #114 shows higher conversion rate with Style Purple.

Estimating Sample Sizes by Experiment Effect Size

If you are running an A/B test there is an inherent trade-off between how long you run the test and how precise your answer will be.

This reference chart below should give you some guidance. Find the baseline conversion right on the left-hand size that is closest to your conversion rate. Then select a column that approximates how big an impact you guess your experimental version will have on your conversion rate. Next, locate the number of visitors needed PER recipe for a two-recipe test in the first grid of numbers. If you'd rather estimate based on number of conversions in the baseline recipe then look at the second set of numbers, or else the number conversions in the challenger in the third set of numbers.


Sample Size by Effect Size

(click to open larger image in new window)

Overall, the smaller the effect size you need to detect, the more traffic you need. If you don't have that traffic volume, you'll only be able to detect effects that are larger. Also, the smaller your baseline conversion rate is, the more traffic you need to detect the same effect size.

That's totally random!

Actually, no it's not.

When you run a test and split the traffic "evenly" it's unlikely you will get exactly the same number of visitors in each recipe. The most common method of splitting traffic is via a random number stored in a cookie. Some other methods are used as well: "round robin" and "time splitting". Assuming you are using a random number generator there are two main sources of variance off a completely even split. First is that the random number generator isn't completely random. In this illustrator we stick to a common method of generating a fairly random number which does a pretty good job, as you can see when you increase the sample size per round. We must admit however that computers can't generate something purely random, but they can get very close. In Javascript we flip a coin like:

randomnumber = Math.floor(Math.random() * 2) + 1;

The remaining variance is due to "sampling error". The sampling error is the variation that occurs due to the fact that you are only sampling some members of a larger population, and those samples may not be perfect representations of the larger population. This is also known as statistical noise. In this simulator you can adjust the sample size per round and visually see the kinds of variations you can expect to get when using samples of this size.

You may try my random number generator simulation which will open in a new window. This will allow you to play with the sample size and see the impact on the variance that is due mainly to sampling error.

random_number_generator.png

Sample Size Calculator & Statistical Power

Calculating sample size is an important first step in the process of running an a/b split test. Many people ignore this step, or do not understand the purpose it serves. Let’s explore why to calculate sample size and how to do it.

What is Statistical Power
First let’s explain why to calculate sample size. It’s all about the test’s “Power”. By using a sample size calculator before your marketing experiment you can ensure a certain “power”. What is Statistical Power? It is the chance that your test will detect a difference if that difference exists. A test with a small sample size may have low power. Also a test that is trying to pick up a very weak signal may also have low power.

What to Consider When Building Your Own A/B Test System

Building Your Own A/B Testing SystemSometimes Adobe Test & Target is the right tool for your A/B testing needs. Or maybe Google Website Optimizer fits the need (and budget) better. How about SiteSpect or Optimost? But what if your particular business requires you to build your own A/B testing or MVT framework from scratch? That's the situation that confronted me during the past year. It can feel a little daunting, but what I've learned is that if you stick with it, it could be worth the effort to code your own website optimization system.

Why would you want to build your own website optimization platform?

When you’re optimizing a landing page for leads or sales, and those leads or sales are going to happen right away, an off-the-shelf solution may be just perfect. Adobe Test & Target will track how many people come into each test recipe, and how many people convert. You’ve got a conversion rate on your control and your test treatments. Voila – you’ve got a working A/B testing solution.

For subscription businesses or software-as-a-service (SaaS) businesses, this may not be so simple. A given recipe may outperform control on some

Trader Joe's Gives a Lesson in Measuring A/B Test Success

Trader_Joes_Sign_0.jpg
I was in Trader Joe’s the other day. My kid was at a little drawing table they have setup with markers and blank paper. My wife was in the cashier’s line. I was watching my kid, but other than that pretty useless. So I had time to really take in all the beautiful signage in the store.

“Are all these signs handmade?” I asked the three friendly-looking employees behind the Customer Service desk.

Syndicate content