The Hidden Cost of Agentic Failure – O’Reilly

Agentic AI has clearly moved beyond buzzword status. McKinsey’s November 2025 survey shows that 62% of organizations are already experimenting with AI agents, and the top performers are pushing them into core workflows in the name of efficiency, growth, and innovation.

However, this is also where things can get uncomfortable. Everyone in the field knows LLMs are probabilistic. We all track leaderboard scores, but then quietly ignore that this uncertainty compounds when we wire multiple models together. That’s the blind spot. Most multi-agent systems (MAS) don’t fail because the models are bad. They fail because we compose them as if probability doesn’t compound.

The Architectural Debt of Multi-Agent Systems

The hard truth is that improving individual agents does very little to improve overall system-level reliability once errors are allowed to propagate unchecked. The core problem of agentic systems in production isn’t model quality alone; it’s composition. Once agents are wired together without validation boundaries, risk compounds.

In practice, this shows up in looping supervisors, runaway token costs, brittle workflows, and failures that appear intermittently and are nearly impossible to reproduce. These systems often work just well enough to pass benchmarks, then fail unpredictably once they are placed under real operational load.

If you think about it, every agent handoff introduces a chance of failure. Chain enough of them together, and failure compounds. Even strong models with a 98% per-agent success rate can quickly degrade overall system success to 90% or lower. Each unchecked agent hop multiplies failure probability and, with it, expected cost. Without explicit fault tolerance, agentic systems aren’t just fragile. They are economically problematic.

This is the key shift in perspective. In production, MAS shouldn’t be thought of as collections of intelligent components. They behave like probabilistic pipelines, where every unvalidated handoff multiplies uncertainty and expected cost.

This is where many organizations are quietly accumulating what I call architectural debt. In software engineering, we are comfortable talking about technical debt: development shortcuts that make systems harder to maintain over time. Agentic systems introduce a new form of debt. Every unvalidated agent boundary adds probabilistic risk that doesn’t show up in unit tests but surfaces later as instability, cost overruns, and unpredictable behavior at scale. And unlike technical debt, this one doesn’t get paid down with refactors or cleaner code. It accumulates silently, until the math catches up with you.

The Multi-Agent Reliability Tax

If you treat each agent’s task as an independent Bernoulli trial, a simple experiment with a binary outcome of success (p) or failure (q), probability becomes a harsh mistress. Look closely and you’ll find yourself at the mercy of the product reliability rule once you start building MAS. In systems engineering, this effect is formalized by Lusser’s law, which states that when independent components are executed in sequence, overall system success is the product of their individual success probabilities. While this is a simplified model, it captures the compounding effect that is otherwise easy to underestimate in composed MAS.

Consider a high-performing agent with a single-task accuracy of p = 0.98 (98%). If you apply the product rule for independent events to a sequential pipeline, you can model how your total system accuracy unfolds. That is, if you assume each agent succeeds with probability p_i, your failure probability is q_i= 1 − p_i. Applied to a multi-agent pipeline, this gives you:

P(\text{\,system success\,}) = \prod_{i=1}^{N} p_i

Table 1 illustrates how your agent system propagates errors through your system without validation.

# of agents (n)	Per-agent accuracy (p)	System accuracy (pⁿ)	Error rate
1 agent	98%	98.0%	2.0%
3 agents	98%	∼94.1%	∼5.9%
5 agents	98%	∼90.4%	∼9.6%
10 agents	98%	∼81.7%	∼18.3%

Table 1. System accuracy decay in a sequential multi-agent pipeline without validation

In production, LLMs aren’t 98% reliable on structured outputs in open-ended tasks. Because they have no single correct output, so correctness must be enforced structurally rather than assumed. Once an agent introduces a wrong assumption, a malformed schema, or a hallucinated tool result, every downstream agent conditions on that corrupted state. This is why you should insert validation gates to break the product rule of reliability.

From Stochastic Hope to Deterministic Engineering

If you introduce validation gates, you change how failure behaves inside your system. Instead of allowing one agent’s output to become the unquestioned input for the next, you force every handoff to pass through an explicit boundary. The system no longer assumes correctness. It verifies it.

In practice, you’d want to have a schema-enforced generation via libraries like Pydantic and Instructor. Pydantic is a data validation library for Python, which helps you define a strict contract for what is allowed to pass between agents: Types, fields, ranges, and invariants are checked at the boundary, and invalid outputs are rejected or corrected before they can propagate. Instructor moves that same contract into the generation step itself by forcing the model to retry until it produces a valid output or exhausts a bounded retry budget. Once validation exists, the reliability math fundamentally changes. Validation catches failures with probability v, now each hop becomes:

p\,{\text{effective}} = p + (1-p)\,·\,v

Again, assume you have a per-agent accuracy of p = 0.98, but now you have a validation catch rate of v = 0.9, then you get:

p\,{\text{effective}}=0.98+0.02\,\cdot\,0.9=0.998

The +0.02 · 0.9 term reflects recovered failures, since these events are disjoint. Table 2 shows how this changes your systems behavior.

# of agents (n)	Per-agent accuracy (p)	System accuracy (pⁿ)	Error rate
1 agent	99.8%	99.8%	0.2%
3 agents	99.8%	∼99.4%	∼0.6%
5 agents	99.8%	∼99.0%	∼1.0%
10 agents	99.8%	∼98.0%	∼2.0%

Table 2. System accuracy decay in a sequential multi-agent pipeline with validation

Comparing Table 1 and Table 2 makes the effect explicit: Validation fundamentally changes how failure propagates through your MAS. It’s no longer a naive multiplicative decay, it’s a controlled reliability amplification. If you want a deeper, implementation-level walkthrough of validation patterns for MAS, I cover it in AI Agents: The Definitive Guide. You can also find a notebook in the GitHub repository to run the computation from Table 1 and Table 2. Now, you might ask what you can do, if you can’t make your models 100% perfect. The good news is that you can make the system more resilient through specific architectural shifts.

From Deterministic Engineering to Exploratory Search

While validation keeps your system from breaking, it doesn’t necessarily help the system find the right answer when the task is difficult. For that, you need to move from filtering to searching. Now you give your agent a way to generate multiple candidate paths to replace fragile one-shot execution with a controlled search over alternatives. This is commonly referred to as test-time compute. Instead of committing to the first sampled output, the system allocates additional inference budget to explore multiple candidates before making a decision. Reliability improves not because your model is better but because your system delays commitment.

At the simplest level, this doesn’t require anything sophisticated. Even a basic best-of-N strategy already improves system stability. For instance, if you sample multiple independent outputs and select the best one, you reduce the chance of committing to a bad draw. This alone is often enough to stabilize brittle pipelines that fail under single-shot execution.

One effective approach to select the best one out of multiple samples is to use frameworks like RULER. RULER (Relative Universal LLM-Elicited Rewards) is a general-purpose reward function which uses a configurable LLM-as-judge along with a ranking rubric you can adjust based on your use case. This works because ranking several related candidate solutions is easier than scoring each one in isolation. By looking at multiple solutions side by side, this allows the LLM-as-judge to identify deficiencies and rank them accordingly. Now you get evidence-anchored verification. The judge doesn’t just agree; it verifies and compares outputs against each other. This acts as a “circuit breaker” for error propagation, by resetting your failure probability at every agent boundary.

Amortized Intelligence with Reinforcement Learning

As a next possible step you could use group-based reinforcement learning (RL), such as group relative policy optimization (GRPO)¹ and group sequence policy optimization (GSPO)² to turn that search into a learned policy. GRPO works on the token level, while GSPO works on the sequence level. You can take your “golden traces” found by your search and adjust your base agents. The golden traces are your successful reasoning paths. Now you aren’t just filtering errors anymore; you’re training the agents to avoid making them in the first place, because your system internalizes those corrections into its own policy. The key shift is that successful decision paths are retained and reused rather than rediscovered repeatedly at inference time.

From Prototypes to Production

If you want your agentic systems to behave reliably in production, I recommend you approach agentic failure in this order:

Introduce strict validation between agents. Enforce schemas and contracts so failures are caught early instead of propagating silently.
Use simple best-of-N sampling or tree-based search with lightweight judges such as RULER to score multiple candidates before committing.
If you need consistent behavior at scale use RL to teach your agents how to behave more reliably for your specific use case.

The reality is you won’t be able to fully eliminate uncertainty in your MAS, but these methods give you real leverage over how uncertainty behaves. Reliable agentic systems are build by design, not by chance.

References

Zhihong Shao et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,” 2024, https://arxiv.org/abs/2402.03300.
Chujie Zheng et al. “Group Sequence Policy Optimization,” 2025, https://arxiv.org/abs/2507.18071.