• We’ve heard that A/B tests are awesome and all, but should we use them for everything?
  • And if not, then when should we run an A/B test?


  • We should treat A/B testing just like any other decision making tool – it has costs and benefits and we should use A/B testing when the benefits outweigh the costs
  • Like all other tools it might yield incorrect results (false positives or false negatives) and it carries costs (implementation complexity and time, waiting for results).
  • When deciding whether or not to run an A/B test we should consider the tradeoffs versus alternative weighs to make a decision.

Decision making tools

Here’s a quick summary of a handful of decision making tools and when to use each. At TaskRabbit, we generally like to use gut calls, qualitative research, and A/B tests to make most business and product decisions.

Couple of observations:
  • As a startup, we’re OK with trading off decision-making confidence for speed (i.e., if an A/B test is looking particularly hairy to set up and run, then we’re perfectly fine with making the call based on gut or small-sample qualitative research).
  • The folks who work at TaskRabbit generally start to develop pretty good intuition about where problems/opportunities are in our space because we use our own service a lot. So, if making a gut call can instantaneously yield a ~80% confidence decision, then running an A/B test for a week to get an incremental 10% more confidence is a pretty high price to pay.

* = our preferred decision-making methods

Coin Flip Gut Call* Majority Vote Qualitative Research* Survey A/B Test*

  • Instaneous decisions

  • Fast decisions
  • You could be the next Steve Jobs if your gut calls are right 90% of the time

  • Team buy-in
  • Access and summarize an entire group’s expertise

  • Get to know users and build empathy with their concerns
  • Observing users is a great way to discover and surface unknown problems/opportunities
  • Very difficult to argue with video footage or direct quotes from real users

  • Good way to gauge sentiment of a large sample of people
  • Medium-speed decisions; it’s fairly easy/fast to recruit and drive traffic to a survey

  • Statistically significant results can be tuned to make them as confident as you want them to be (most teams use a 90%+ confidence interval)
  • Collects very reliable data on real users interacting with a live product
  • Very difficult to argue with statistically significant results

  • 50/50 chance of correct decision
  • Only handles binary-choices, not so good with multi-variate permutations

  • Difficult to justify decisions to a skeptical audience
  • Incorrect decisions made this way get attributed to you, personally

  • Can turn into a slow, political decision making process
  • Groupthink can de-rail rational decision making

  • The research subjects you choose may not be representative of the entire userbase
  • Recruiting research subjects and running tests takes time and budget

  • What people say they would do in a situation often turns out to be different than what they’ll actually do when faced with a real, working product or a real, live human being
  • A skeptical audience will argue with your survey design

  • Takes time, code, and sometimes marketing coordination to set up an A/B test
  • Once a test is running, takes a while (a week or more) to reach a large enough sample size
  • Can only run a limited number of A/B tests at one time
  • It’s a bit like panning for gold – many/most tests will be inconclusive, i.e., will not yield statistically significant differences

Hope that helps! Please feel free to comment…

Read more at the source