What to Consider When Building Your Own A/B Test System

Building Your Own A/B Testing SystemSometimes Adobe Test & Target is the right tool for your A/B testing needs. Or maybe Google Website Optimizer fits the need (and budget) better. How about SiteSpect or Optimost? But what if your particular business requires you to build your own A/B testing or MVT framework from scratch? That's the situation that confronted me during the past year. It can feel a little daunting, but what I've learned is that if you stick with it, it could be worth the effort to code your own website optimization system.

Why would you want to build your own website optimization platform?

When you’re optimizing a landing page for leads or sales, and those leads or sales are going to happen right away, an off-the-shelf solution may be just perfect. Adobe Test & Target will track how many people come into each test recipe, and how many people convert. You’ve got a conversion rate on your control and your test treatments. Voila – you’ve got a working A/B testing solution.

For subscription businesses or software-as-a-service (SaaS) businesses, this may not be so simple. A given recipe may outperform control on some short-term measure like signups, but then underperform on a longer-term measurement like retention or lifetime value. If you’ve got a ton of customer data that lives in your internal database and you want to connect that with the originating test treatments over a longer period of time, you may have to go rogue and roll your own experimentation platform. Trying to integrate your propriety data into the pre-built solutions is possible, but may not save you much effort. In fact, most of the really huge web companies that do a ton of testing have all built their own systems (e.g., Yahoo, Ebay, Netflix, Zynga, etc.).

What are the requirements of your home-grown optimization platform?

This post is only the start of a discussion around the questions you should consider before you get into your project. You won’t find all the answers here. That’s a deeper dive and the solutions will vary depending the business and the development team, but the questions should be fairly universal across businesses.

How are you going to randomize visitors into test recipes?

This is a fairly easy one but there are a couple of different options. Hill climbing? Based off session id? Round robin? (Again, I won’t discuss all the pros and cons of the solutions, simply discuss some key questions.)

How do you treat returning visitors?

A brand new visitor comes to your site and the system you build looks at the test recipes that are live on the site and then assigns this visitor at random to one of them. You store the id of the test recipe in the cookie. When the visitors returns to the site sometime later, he gets placed back into the recipe based on the information in his cookie.

But what if that recipe is no longer live? Maybe that test has ended. What do you do? Well of course you place him in one of the tests that are live at the time he revisits. But if he converts now, to which test do you attribute the conversion – the original, the new, a blend, or none?

More broadly, do you build into your system a trailing period for a test by which the visitors can still be maintained into their original test, but that test is closed off to new entrants? Off-the-shelf tools often include this modality, but it is sometime that can be forgotten (or just deemed unimportant) when building your own.

How do you correct for cookie acceptance issues?

What if the user doesn't accept any cookies? This could be a big problem when you have a lot of bots or crawlers coming to your site. Can you check for this and exclude them from the test metrics?

What if the visitor accepts cookies but frequently deletes them? Harder to work around, but some possibilities (e.g., Flash cookies) may exist for you.

What are your needs for filtering the traffic that qualifies for your tests?

What if you really only care about truly new traffic in your tests? It’s often the case that the long term customers or frequent visitors are less influenced by changes to your site since they’ve already built up a pretty strong impression about your offering. Some of the companies that do lots of A/B testing only include new traffic (or new subscribers) for most of their experiments.

If you do choose to include only “new” traffic into your tests, the issues outlined above related to cookie acceptance may then resurface. How do you really know who is new? Once you start to go down this path, you may determine you can exclude anyone that comes to the site and logs in to an existing account and has not signed up for an account in that session (conclusion: they must have been an existing customer, so exclude them).

But if you haven’t designed your A/B testing system for this it could be difficult to add later. After all, that’s not trivial. It’s no longer a capture-as-you-go situation. Now, there’s post-processing on all the traffic. You have to wait until the session is over, then pour through the events that occurred, categorize the visits, and then decide whether to include the data in your summary tables or not.

Of course you may care more about retention, customer satisfaction or lifetime value. Then you want to include only existing customers.

Another possible qualification criteria may be around geography or language. Website internationalization (I18N) is typically on a different development cycle than the site optimization team. So perhaps your tests are only going out in English. In that case, it doesn’t make much sense to include your German traffic.

Similarly, you may want to qualify traffic into tests by browser. The rationale here is that it is time consuming and costly to “browser test” every test variant. If you can stick to just the most popular browsers, you can get your creatives through development more quickly and not worry about bugs cropping up due to browser issues. If your traffic is large enough, and the results should carry across all traffic, this may be worth the trade-off in the running time of the experiment.

Or you may want to qualify traffic based on device type. The computer is no longer the only crucial device out there. When you port your UI over to iPhone, iPad, Android, Wii, and who knows what else, the experiments you run will be platform specific.

How Will You Be Counting Unique Visitors?

Ideally you can evaluate a test based on unique visitors (really, unique cookies or unique named accounts). But you need to store a lot of the underlying data here to make this happen. And then compute summaries at report run time. Your developers may wonder if keeping daily summary data by test recipe and conversion step is sufficient. It’s easier to code, store, and report on. But what if one test recipe causes more frequent visits over many days, but doesn’t increase conversion rate? You may wind up with the wrong answer if your data model is uniquing only by day and not over the custom period of the test.

How Many Tests Do You Need To Run?

Are you ok running only one a/b test a time? If not, the complexity of your problem has just increased quite a bit. The cheat is to build a system where you can run many tests, but each visitor can only be in one test at a time. You’re now dividing up your eligible traffic by the number of tests, slowing down each one. Unless you are one of the 25 biggest sites on the web, that may not be workable. For many types of experimental questions you can reasonably assume independence and put one visitor in multiple tests; that is, if your tracking and reporting can support it. Getting into multiple tests for the same visitor may lead you to build a hierarchical mapping of experiences to decide which wins in a conflict. Or else stay clear of the conflicts at the tests design and scheduling phase.


How Good Is Your Data Quality?

The last thing you want is to put all the work into creating a good test hypothesis, pushing it through management approvals, then development cycles, and then analyze it to find that the data is bad. One or two incorrectly administered or tracked tests can destroy the confidence your peers will have in the results from future tests. What safeguards can you put in place to ensure all components of the optimization platform are performing their job? Running A/A tests may be a piece of the answer, but that only diagnoses certain kinds of failures. What are the ways a test can be corrupted? Also, can something pass QA in development environment, but crash when it goes live in production? How can you eliminate those scenarios?

Obviously there are many more questions you’ll need to consider when designing your own optimization platform. What types of reporting is needed? Without an independent company and vetted documentation, how dependent will you be on the architect of the system? What sort of admin interface is required for business managers? On and on….


Is there a middle ground between building and buying?

Perhaps you don't need to write it all from scratch. Although that's the route we took, you may choose to start from a code base and customize only as necessary. If your platform is Ruby on Rails, you have a couple of choices for pre-built a/b testing frameworks. ABingo and SevenMinuteAbs are two of them. Similar packages may be available for your platform, whether Java or PHP.

I hope this was helpful to get the wheels turning. If you’ve built your own A/B testing system, I’d love to hear from you what’s worked, what hasn’t, and what you’ve learned from the experience.

Trackback URL for this post:

http://www.benchmark-analytics.com/d/?q=trackback/49

Nice Post

Nice post - I agree with all of your points. Sometimes rolling your own A/B testing platform makes more sense. Especially when testing alternative designs, customer journeys and purchase paths.