Doing Data Science: Straight Talk from the Frontline
Source: https://www.oreilly.com/library/view/doing-data-science/9781449363871/ ↗
Based on Schutt's Columbia University course and O'Neil's experience as a data scientist at various startups, this book captures the discipline of data science at the moment it was coalescing from statistics, machine learning, and domain expertise into something with its own identity.
Each chapter addresses a practical topic — exploratory data analysis, regression, classification, recommendation engines, MapReduce — but what distinguishes the book is its honesty about the human and organisational dimensions: what gets lost in the pipeline, how models encode assumptions, why communication with non-technical stakeholders matters as much as the algorithm.
O'Neil, who would later write Weapons of Math Destruction, already shows her concern with the ethical implications of modelling decisions.
The book reads less like a textbook and more like an oral history of a discipline being invented, told by practitioners who are candid about what they do not yet understand.
Central argument
O'Neil and Schutt argue that data science is not a unified technical discipline but a practice assembled from statistics, machine learning, and domain knowledge, held together by judgment calls that are rarely made explicit. Their central claim is that the real difficulty in data science lies not in the algorithms but in the human decisions surrounding them: what assumptions are baked into a model, what gets dropped in the pipeline, how findings are communicated to people who will act on them without understanding them. O'Neil, foreshadowing her later work, insists that modelling choices are never neutral — they encode priorities and carry consequences that practitioners must own.
Critique
Because the book emerged from a single Columbia course and a specific moment in the field's formation, its scope is shaped by a particular institutional and cultural context — elite academia intersecting with New York tech startups — that it does not interrogate. Practitioners working in regulated industries, large enterprises, or non-Western contexts may find the organisational candour less transferable than it appears. The book's oral-history format, while its greatest strength for texture, also means that its normative claims about how data science should be practised remain implicit rather than systematically argued.
Why it matters for product
For a CPO, the book's most actionable argument is that the distance between a model and a decision is where organisational failure lives — data teams can produce technically sound outputs that product and business stakeholders misread, misapply, or launder of their assumptions. This has direct implications for how product leaders should structure data partnerships: not as a service relationship where analysts deliver numbers on request, but as a collaborative process where model assumptions and limitations are surfaced as part of the deliverable. The book also puts early pressure on the metrics layer of product development — if the measures you are optimising encode unexamined choices, the roadmap built on top of them inherits those choices invisibly.