
For most of the history of software engineering, we’ve built systems around a simple and comforting assumption: Given the same input, a program will produce the same output. When something went wrong, it was usually because of a bug, a misconfiguration, or a dependency that wasn’t behaving as advertised. Our tools, testing strategies, and even our mental models evolved around that expectation of determinism.
AI quietly breaks that assumption.
As large language models and AI services make their way into production systems, they often arrive through familiar shapes. There’s an API endpoint, a request payload, and a response body. Latency, retries, and timeouts all look manageable. From an architectural distance, it feels natural to treat these systems like libraries or external services.
In practice, that familiarity is misleading. AI systems behave less like deterministic components and more like nondeterministic collaborators. The same prompt can produce different outputs, small changes in context can lead to disproportionate shifts in results, and even retries can change behavior in ways that are difficult to reason about. These characteristics aren’t bugs; they’re inherent to how these systems work. The real problem is that our architectures often pretend otherwise. Instead of asking how to integrate AI as just another dependency, we need to ask how to design systems around components that do not guarantee stable outputs. Framing AI as a nondeterministic dependency turns out to be far more useful than treating it like a smarter API.
One of the first places where this mismatch becomes visible is retries. In deterministic systems, retries are usually safe. If a request fails due to a transient issue, retrying increases the chance of success without changing the outcome. With AI systems, retries don’t simply repeat the same computation. They generate new outputs. A retry might fix a problem, but it can just as easily introduce a different one. In some cases, retries quietly amplify failure rather than mitigate it, all while appearing to succeed.
Testing reveals a similar breakdown in assumptions. Our existing testing strategies depend on repeatability. Unit tests validate exact outputs. Integration tests verify known behaviors. With AI in the loop, those strategies quickly lose their effectiveness. You can test that a response is syntactically valid or conforms to certain constraints, but asserting that it is “correct” becomes far more subjective. Matters get even more complicated as models evolve over time. A test that passed yesterday may fail tomorrow without any code changes, leaving teams unsure whether the system regressed or simply changed.
Observability introduces an even subtler challenge. Traditional monitoring excels at detecting loud failures. Error rates spike. Latency increases. Requests fail. AI-related failures are often quieter. The system responds. Downstream services continue. Dashboards stay green. Yet the output is incomplete, misleading, or subtly wrong in context. These “acceptable but wrong” outcomes are far more damaging than outright errors because they erode trust gradually and are difficult to detect automatically.
Once teams accept nondeterminism as a first-class concern, design priorities begin to shift. Instead of trying to eliminate variability, the focus moves toward containing it. That often means isolating AI-driven functionality behind clear boundaries, limiting where AI outputs can influence critical logic, and introducing explicit validation or review points where ambiguity matters. The goal isn’t to force deterministic behavior from an inherently probabilistic system but to prevent that variability from leaking into parts of the system that aren’t designed to handle it.
This shift also changes how we think about correctness. Rather than asking whether an output is correct, teams often need to ask whether it is acceptable for a given context. That reframing can be uncomfortable, especially for engineers accustomed to precise specifications, but it reflects reality more accurately. Acceptability can be constrained, measured, and improved over time, even if it can’t be perfectly guaranteed.
Observability needs to evolve alongside this shift. Infrastructure-level metrics are still necessary, but they’re no longer sufficient. Teams need visibility into outputs themselves: how they change over time, how they vary across contexts, and how those variations correlate with downstream outcomes. This doesn’t mean logging everything, but it does mean designing signals that surface drift before users notice it. Qualitative degradation often appears long before traditional alerts fire, if anyone is paying attention.
One of the hardest lessons teams learn is that AI systems don’t offer guarantees in the way traditional software does. What they offer instead is probability. In response, successful systems rely less on guarantees and more on guardrails. Guardrails constrain behavior, limit blast radius, and provide escape hatches when things go wrong. They don’t promise correctness, but they make failure survivable. Fallback paths, conservative defaults, and human-in-the-loop workflows become architectural features rather than afterthoughts.
For architects and senior engineers, this represents a subtle but important shift in responsibility. The challenge isn’t choosing the right model or crafting the perfect prompt. It’s reshaping expectations, both within engineering teams and across the organization. That often means pushing back on the idea that AI can simply replace deterministic logic, and being explicit about where uncertainty exists and how the system handles it.
If I were starting again today, there are a few things I would do earlier. I would document explicitly where nondeterminism exists in the system and how it’s managed rather than letting it remain implicit. I would invest sooner in output-focused observability, even if the signals felt imperfect at first. And I would spend more time helping teams unlearn assumptions that no longer hold, because the hardest bugs to fix are the ones rooted in outdated mental models.
AI isn’t just another dependency. It challenges some of the most deeply ingrained assumptions in software engineering. Treating it as a nondeterministic dependency doesn’t solve every problem, but it provides a far more honest foundation for system design. It encourages architectures that expect variation, tolerate ambiguity, and fail gracefully.
That shift in thinking may be the most important architectural change AI brings, not because the technology is magical but because it forces us to confront the limits of determinism we’ve relied on for decades.