## Sample Size Calculator & Statistical Power

Calculating sample size is an important first step in the process of running an a/b split test. Many people ignore this step, or do not understand the purpose it serves. Let’s explore why to calculate sample size and how to do it.

What is Statistical Power
First let’s explain why to calculate sample size. It’s all about the test’s “Power”. By using a sample size calculator before your marketing experiment you can ensure a certain “power”. What is Statistical Power? It is the chance that your test will detect a difference if that difference exists. A test with a small sample size may have low power. Also a test that is trying to pick up a very weak signal may also have low power.

An experiment’s “power” is 1 minus the Type II Error. The Type II Error is when you declare the challenger had no effect, but in fact you were wrong. So you’re ok “missing” an effect 20% of the time, then the test’s statistical power is 80%.

Example A/B Test

Let’s say you have a website that is fairly popular, but the page you are doing your test on is a few clicks away from the home page. It only gets a small percentage of your site’s overall traffic. What you are wishing to test on that page is the product description copy. Perhaps the control version talks about “features” and the challenger talks about “benefits”. This marketing copy may have a small positive impact, but you believe the text changes alone are unlikely to say double your conversion rate. Maybe it’s reasonable to assume that if the new copy has an effect it may be around a 3% lift. This is the hypothesized Lift Threshold.

The question is... do you have enough traffic to do this a/b test? Or put another, how long will it take you to run this test and still have a reasonable chance of picking out the “signal” from the noise – if that signal really exists.

This is where a good sample size / power calculator comes into play. You’ll enter in a few numbers, and then you can make a call whether it makes sense to run this test or not.

How to Use a Sample Size Calculator / Statistical Power Calculator

Here’s the information you need when using such a calculator. You will need ALL BUT ONE of the following:

Baseline Proportion:
This is the conversion rate of the control.

Challenger Proportion: Whatever conversion rate you need to be able to detect. When expressed as a percentage change, this is called your Lift Threshold. Your test will also detect anything more extreme than that – that is, more different from your control. But you will NOT have a good chance of detecting a conversion rate less extreme.

Type I error: This is the chance of declaring a difference when there really isn’t one.

Power:
This is 1 minus your Type II error (the chance of declaring there is NOT a difference when there really is one). Typically experimenters like to run tests that have at least an 80% power.

Traffic: These calculators are typically geared for 2 recipe tests. If you have more recipes, they may not be appropriate or you’ll have to adjust the sample sizes.

The calculator will solve for whatever number you left blank.

Here’s a nice online sample size / power calculator you can use. Of course stats programs such as SPSS, SAS, Minitab, and R also can calculate power.

Using R to Calculate Sample Size & Power

If you want to use the open-source program R to do this calculation, here are some worked examples.

Example:
Recipe A: 6 conversions, n=24 (rate = 25%)
Recipe B: 7 conversions, n=24 (rate = 29%)
Total n = 48

Here we’re solving for power.

> power.prop.test(n = 24, p1 = 0.25, p2=0.29, alternative='two.sided', sig.level=0.05)

Two-sample comparison of proportions power calculation

n = 24
p1 = 0.25
p2 = 0.29
sig.level = 0.05
power = 0.04951966
alternative = two.sided

NOTE: n is number in *each* group

Here we’re solving for sample size (each group):

> power.prop.test(p1 = 0.25, p2=0.29, alternative='two.sided', sig.level=0.05, power=.80)

Two-sample comparison of proportions power calculation

n = 1932.588
p1 = 0.25
p2 = 0.29
sig.level = 0.05
power = 0.8
alternative = two.sided

NOTE: n is number in *each* group

Running an A/B Test With a Predefined Sample Size
So now you’ve put in your assumptions and the calculator spit out the sample size you need in order to have 80% Power for the expected lift threshold.

Now you need to run the test until you reach that sample size, then stop and perform your statistical test to find out where your p-value is. If the p-value is below say 0.05 you will reject the null hypothesis. If not, you’ll not reject the null hypothesis but you will know that your test was powerful enough to detect a change.

What if you know you are constrained to running the test inside of say two weeks? If you don't have the liberty of "solving for" sample size, then you can solve for Lift Threshold. This requires a bit of working backwards but you can do it. First figure out how much traffic you get in those two weeks, then plug that into the calculator. Keep your Power at .80 and your Type I error at .05. Input your baseline conversion. Now the calculator will solve for your Conversion Rate in the Challenger recipe. What this tells you is that "within 2 weeks your test has an 80% chance of finding a conversion rate change as different as... [whatever number it split out]." If that number is outrageous and you don't believe the challenger has a snowball's chance in hell of producing that result, then maybe you should rethink the test.

Conclusion
Understanding the test's power prevents you from putting a lot of time and effort into a test that doesn't have a good shot of finding a difference. And if you use it on all your tests you can get consistency across what test results mean - and that's not a bad thing either.

If you want to learn more about the steps involved with running an a/b test using statistics, check out this online course A/B Testing: Test Design & Statistics .