An experiment, not a shortcut — and knowing the difference is what separates useful results from misleading ones.
A/B testing is a controlled experiment that compares two versions of a design to determine which performs better against a defined metric.
What it is
A/B testing is a controlled experiment that compares two versions of a design — a control (A) and a variant (B) — to determine which performs better against a pre-defined metric. Traffic is split between the two versions, and the result is measured once enough data has accumulated to reach statistical confidence.
The key phrase is "pre-defined metric." A test without a specified success criterion isn't an experiment — it's a fishing expedition. You'll find something that "won," but you won't know if it matters or if you're just looking at noise.
What A/B testing can and can't tell you
A/B tests are very good at telling you what users do differently when the experience changes. They're poor at telling you why.
If version B reduces drop-off at a checkout step, you know the variant performs better on that metric. You don't know whether users found the new layout clearer, whether the changed copy reduced anxiety, or whether the button placement was the driver. You get the signal without the mechanism — which makes it hard to generalize the learning or apply it to a different context.
For the "why" layer, combine A/B tests with Usability Testing. One tells you something works; the other tells you why.
The statistics you actually need to understand
Two things matter before calling a winner: statistical significance and practical significance.
Statistical significance tells you whether the difference between A and B is unlikely to be random chance. The conventional threshold is 95% confidence — there's less than a 5% chance you're seeing noise. Running a test until one version is "winning" and then stopping is called peeking, and it inflates false positive rates significantly. Calculate required sample size before the test runs, not after.
Practical significance is whether the difference is large enough to act on. A test that reaches significance with a 0.2% improvement in conversion may not be worth shipping. Statistical significance doesn't tell you whether a result matters — only whether it's real.
UX A/B testing vs. conversion optimization
These often get conflated, but they come from different traditions with different goals.
Conversion rate optimization (CRO) focuses narrowly on measurable outcomes — sign-ups, purchases, clicks — sometimes at the expense of the broader experience. UX A/B testing is more interested in whether a design change improves the experience as a whole, even when that's harder to quantify.
A CRO test might show that a manipulative dark pattern increases sign-ups in the short term. A UX lens asks whether those users are satisfied, retained, or whether you've traded long-term trust for a short-term metric lift.
Mistakes that make results worthless
- Underpowered tests. Not enough traffic means the test will never reach significance, or will produce false results. Sample size should be calculated upfront.
- Testing too many things at once. Changing three elements between A and B makes it impossible to know which one drove the result.
- Stopping early. Calling a winner the first time one version pulls ahead violates the statistics the method depends on.
- Shipping without understanding. A result you can't explain is a result you can't build on. If you don't know why B won, you can't apply the learning anywhere else.
- Novelty effects. Users sometimes behave differently with new things for a short window. Short-term lifts occasionally disappear once the initial curiosity fades.