Estimating Sample Sizes by Experiment Effect Size

If you are running an A/B test there is an inherent trade-off between how long you run the test and how precise your answer will be.

This reference chart below should give you some guidance. Find the baseline conversion right on the left-hand size that is closest to your conversion rate. Then select a column that approximates how big an impact you guess your experimental version will have on your conversion rate. Next, locate the number of visitors needed PER recipe for a two-recipe test in the first grid of numbers. If you'd rather estimate based on number of conversions in the baseline recipe then look at the second set of numbers, or else the number conversions in the challenger in the third set of numbers.

Sample Size by Effect Size

(click to open larger image in new window)

Overall, the smaller the effect size you need to detect, the more traffic you need. If you don't have that traffic volume, you'll only be able to detect effects that are larger. Also, the smaller your baseline conversion rate is, the more traffic you need to detect the same effect size.

Let's take an example. Say your baseline is 1% conversion rate. And you'd like to be able to detect any changes 10% of greater. Then you're going to require 163,095 visitors in each recipe. If you run the test to that sample size, then you'll have pretty good statistical power to detect a change of 10% or greater. The statistical power assumed in producing this graph was 80%. That means that 4 out of 5 times you run such an experiment, you will get a statistically significant result (p<0.05) after running the statistical test IF the effect size is greater than 10%.

But what if the experimental recipe DID have an impact, but it was only 2%? If you ran the experiment above, putting 163,095 visitors in each recipe, but the true effect size was only 2% then it is unlikely you would get a statistically significant result at that sample size. If before you run the test you determine you need to detect as small a sample size as 2% then then you should be planning on running the test for much longer. When an effect actually exists, but you got a statistically insignificant result then that is called a "Type II" error. Of course no one can tell you whether you're making that error. You have to plan your tests out knowing what your comfort level is for Type I and Type II errors at a programmatic level.

What does this mean for your testing program? Most people, once they grasp how this works, start to wonder how they can "think bigger". In other words, how can we focus our testing on just the "big bets" that will really move the needle in big ways.

But what if 1% improvement in conversion rate translates into a million dollars in revenue for you? Certainly then you'd love to know each change and whether it moves the needle even 1%. However, depending on the underlying numbers in your business the opportunity cost of running the test that long - potentially locking up other tests or changes you could be making - may still push you to "think bigger".

It's be great if each time we ran a test we could know for certain whether we moved the needle by 1% or greater. This is certainly the reality for companies like Facebook, Google, eBay etc that have so much traffic they really can detect such tiny changes. For the bulk of web sites out there, however, they may have to think bigger -- or else be very, very patient waiting to rack up large sample sizes.

P.S. If these numbers look funny to you and you'd like a second opinion, here's a link I found to another good post on the estimating sample size.

Trackback URL for this post: