Library · paper

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

Dat Tran & Douwe Kiela
2026·arXiv preprint (2604.02460)

Source: https://arxiv.org/abs/2604.02460

Full text: arXiv preprint

The paper that forced the multi-agent debate to control for what it should have controlled from the start: computational budget.

Tran and Kiela gave single and multi-agent LLM systems identical reasoning token budgets and measured performance on multi-hop reasoning tasks across three model families.

The single agent matched or outperformed multi-agent variants in nearly every condition.

The theoretical backbone is the Data Processing Inequality: every inter-agent handoff can only lose information, never create it.

The study also identifies the regime where multi-agent becomes competitive — degraded contexts where no single agent can maintain coherence — turning a binary debate into a boundary question.

For product leaders deploying agentic systems, the core lesson is Brooks's law restated for the AI era: coordination has costs, and those costs are invisible until you measure them.

Central argument

Tran and Kiela argue that multi-agent LLM systems (MAS) do not offer inherent architectural advantages over single-agent systems (SAS) for multi-hop reasoning—their apparent gains are largely an artifact of consuming more computation tokens. Using the Data Processing Inequality as a theoretical foundation, they show that any message passed between agents is a lossy transformation of the full context, meaning MAS cannot increase mutual information with the correct answer. Empirically, across Qwen3, DeepSeek-R1-Distill, and Gemini 2.5, SAS consistently matches or outperforms MAS once thinking token budgets are held equal. MAS only becomes competitive when a single agent's context utilization degrades—for instance, under long or noisy inputs.

Critique

The theoretical argument hinges on the assumption of 'perfect context utilization' by the single agent, yet the authors themselves acknowledge that real LLMs suffer attention dilution, positional bias, and 'lost in the middle' effects that violate this assumption in production settings. This creates a tension: the information-theoretic case for SAS supremacy holds in an idealized regime that rarely exists at scale, while the conditions under which MAS becomes competitive—degraded context—are precisely the conditions most commonly encountered with complex, long-horizon product tasks. The study also focuses exclusively on multi-hop reasoning benchmarks, leaving open whether its conclusions generalize to tasks involving parallel tool use, long-running workflows, or specialization across knowledge domains, which are common MAS deployment scenarios.

Why it matters for product

For a CPO designing AI-powered product capabilities, this paper is a direct challenge to the reflexive instinct to decompose complex tasks into multi-agent pipelines as a proxy for sophistication or scalability—the finding that coordination overhead introduces information loss means that architectural complexity should be justified by a concrete context-degradation problem, not assumed to be beneficial. Concretely, before investing in orchestration infrastructure, product teams should first ask whether a single capable model with a well-structured prompt and sufficient compute already saturates performance on their task; the paper's diagnostic on API-based budget artifacts also warns that vendor-reported token usage may not reflect actual reasoning compute, which has direct implications for cost modeling and build-vs-buy decisions in AI feature development.

Referenced in