Library · book

Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing

Ron Kohavi, Diane Tang & Ya Xu
2020·Cambridge University Press

Source: https://www.cambridge.org/core/books/trustworthy-online-controlled-experiments/D97B26382EB0EB2DC2019A7A7B518F59

Kohavi ran experimentation at Amazon, Microsoft and LinkedIn; Tang and Xu built Google's and LinkedIn's.

Between them they have run tens of thousands of experiments, and the book condenses that experience into a practical manual on running A/B tests without fooling yourself.

The most useful chapters are the ones on pitfalls — multiple testing, sample ratio mismatch, novelty effects, network interference — precisely the failure modes that product teams discover painfully over years.

For product direction this is the most authoritative book on experimentation currently in print.

Treat it as reference material: long sections are dense, the indexing is good, and most teams will need different parts at different stages.

Central argument

Kohavi, Tang, and Xu argue that most organisations running A/B tests are systematically deceiving themselves, and that rigorous experimentation requires institutional discipline as much as statistical knowledge. The central thesis is that trustworthy results depend on correctly diagnosing failure modes — sample ratio mismatch, novelty effects, multiple testing, network interference — that silently corrupt experiment validity long before anyone questions the analysis. The book makes the case that experimentation at scale is an engineering and organisational problem, not merely a methodological one, and that the difference between teams that learn from experiments and teams that merely run them lies in how rigorously they operationalise these safeguards.

Critique

The book's authority comes partly from its provenance — Amazon, Microsoft, Google, LinkedIn — and that same provenance is a limitation: the frameworks are calibrated for organisations with millions of daily active users, mature data infrastructure, and dedicated experimentation platforms. Teams working at lower traffic volumes, in B2B contexts, or without dedicated tooling will find that many of the statistical power thresholds and detection mechanisms described are practically unreachable, and the book offers limited guidance on what principled experimentation looks like under those constraints. There is also a relative silence on the qualitative complement to controlled experiments — when quantitative signals are too weak or too slow, the book has little to say about how to make product decisions.

Why it matters for product

For a CPO, the chapters on pitfalls function as an organisational audit tool: if your team cannot explain sample ratio mismatch or has no protocol for multiple comparisons, your experimentation programme is generating false confidence rather than signal, which means strategic bets are being validated on corrupted evidence. The book's framing of experimentation as infrastructure — something that requires investment in platforms, governance, and statistical literacy before it returns reliable learning — directly informs decisions about when and how much to invest in experimentation capability versus other forms of product discovery. It also sharpens the metrics conversation: Kohavi's treatment of guardrail metrics versus success metrics gives product leaders a concrete vocabulary for negotiating what a test is actually trying to prove.