Keep Deterministic Work Deterministic – O’Reilly

This is the second article in a series on agentic engineering and AI-driven development. Read part one here, and look for the next article on April 2 on O’Reilly Radar.

The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time.
—Tom Cargill, Bell Labs

One of the experiments I’ve been running as part of my work on agentic engineering and AI-driven development is a blackjack simulation where an LLM plays hundreds of hands against blackjack strategies written in plain English. The AI uses those strategy descriptions to decide how to make hit/stand/double-down decisions for each hand, while deterministic code deals the cards, checks the math, and verifies that the rules were followed correctly.

Early runs of my simulation had a 37% pass rate. The LLM would add up card totals wrong, skip the dealer’s turn entirely, or ignore the strategy it was supposed to follow. The big problem was that these errors compounded: If the model miscounted the player’s total on the third card, every decision after that was based on a wrong number, so the whole hand was garbage even if the rest of the logic was fine.

There’s a useful way to think about reliability problems like that: the March of Nines. Getting an LLM-based system to 90% reliability is the first nine, and it’s the “easy” one. Getting from 90% to 99% takes roughly the same amount of engineering effort. So does getting from 99% to 99.9%. Each nine costs about as much as the last, and you never stop marching. Andrej Karpathy coined the term from his experience building self-driving systems at Tesla, where they spent years earning two or three nines and still had more to go.

Here’s a small exercise that shows how that kind of failure compounding works. Open any AI chatbot running an early 2026 model (I used ChatGPT 5.3 Instant) and paste the following eight prompts one at a time, each in a separate message. Go ahead, I’ll wait.

Prompt 1: Track a running “score” through a 7-step game. Do not use code, Python, or tools. Do this entirely in your head. For each step, I will give you a sentence and a rule.

CRITICAL INSTRUCTION: You must reply with ONLY the mathematical equation showing how you updated the score. Example format: 10 + 5 = 15 or 20 / 2 = 10. Do not list the words you counted, do not explain your reasoning, and do not write any other text. Just the equation.

Start with a score of 10. I’ll give you the first step in the next prompt.

Prompt 2: “The sudden blizzard chilled the small village communities.” Add the number of words containing double letters (two of the exact same letter back-to-back, like ‘tt’ or ‘mm’).

Prompt 3: “The clever engineer needed seven perfect pieces of cheese.” If your score is ODD, add the number of words that contain EXACTLY two ‘e’s. If your score is EVEN, subtract the number of words that contain EXACTLY two ‘e’s. (Do not count words with one, three, or zero ‘e’s).

Prompt 4: “The good sailor joined the eager crew aboard the wooden boat.” If your score is greater than 10, subtract the number of words containing consecutive vowels (two different or identical vowels back-to-back, like ‘ea’, ‘oo’, or ‘oi’). If your score is 10 or less, multiply your score by this number.

Prompt 5: “The quick brown fox jumps over the lazy dog.” Add the number of words where the THIRD letter is a vowel (a, e, i, o, u).

Prompt 6: “Three brave kings stand under black skies.” If your score is an ODD number, subtract the number of words that have exactly 5 letters. If your score is an EVEN number, multiply your score by the number of words that have exactly 5 letters.

Prompt 7: “Look down, you shy owl, go fly away.” Subtract the number of words that contain NONE of these letters: a, e, or i.

Prompt 8: “Green apples fall from tall trees.” If your score is greater than 15, subtract the number of words containing the letter ‘a’. If your score is 15 or less, add the number of words containing the letter ‘l’.

The exercise tracks a running score through seven steps. Each step gives the model a sentence and a counting rule, and the score carries forward. The correct final score is 60. Here’s the answer key: start at 10, then 16 (10+6), 12 (16−4), 5 (12−7), 10 (5+5), 70 (10×7), 63 (70−7), 60 (63−3).

I ran this twice at the same time (using ChatGPT 5.3 Instant), and got two completely different wrong answers the first time I tried it. Neither run reached the correct score of 60:

Step Correct Run 1 (transcript) Run 2 (transcript)
1. Double letters 10 + 6 = 16 10 + 2 = 12 ❌ 10 + 5 = 15 ❌
2. Exactly two ‘e’s 16 − 4 = 12 12 − 4 = 8 ❌ 15 + 4 = 19 ❌
3. Consecutive vowels 12 − 7 = 5 8 × 7 = 56 ❌ 19 − 5 = 14 ❌
4. Third letter vowel 5 + 5 = 10 56 + 5 = 61 ❌ 14 + 3 = 17 ❌
5. Exactly 5 letters 10 × 7 = 70 61 − 7 = 54 ❌ 17 − 4 = 13 ❌
6. No a, e, or i 70 − 7 = 63 54 − 7 = 47 ❌ 13 − 3 = 10 ❌
7. Words with ‘a’ or ‘i’ 63 − 3 = 60 47 − 3 = 44 ❌ 10 + 4 = 14 ❌

The two runs tell very different stories. In Run 1, the model miscounted in Step 1 (found 2 double-letter words instead of 6) but actually got the later counts right. It didn’t matter. The wrong score in Step 1 flipped a branch in Step 3, triggering a multiply instead of a subtract, and the score never recovered. One early mistake threw off the entire chain, even though the model was doing good work after that.

Run 2 was a disaster. The model miscounted at almost every step, compounding errors on top of errors. It ended at 14 instead of 60. That’s closer to what Karpathy is describing with the March of Nines: Each step has its own reliability ceiling, and the longer the chain, the higher the chance that at least one step fails and corrupts everything downstream.

What makes this insidious: Both runs look the same from the outside. Each step produced a plausible answer, and both runs produced final results. Without the answer key (or some tedious manual checking), you’d have no way of knowing that Run 1 was a near-miss derailed by a single early error and Run 2 was wrong at nearly every step. This is typical of any process where the output of one LLM call becomes the input for the next one.

These failures don’t demonstrate the March of Nines itself—that’s specifically about the engineering effort to push reliability from 90% to 99% to 99.9%. (It’s possible to reproduce the full compounding-reliability problem in a chat, but a prompt that did it reliably would be far too long to put in an article.) Instead, I opted for a shorter exercise which you can easily try out yourself that demonstrates the underlying problem that makes the march so hard: cascading failures. Each step asks the model to count letters inside words, which is deterministic work that a short Python script handles perfectly. LLMs, on the other hand, don’t actually treat words as strings of characters; they see them as tokens. Spotting double letters means unpacking a token into its characters, and the model gets that wrong just often enough to reliably screw it up. I added branching logic where each step’s result determines the next step’s operation, so a single miscount in Step 1 cascades through the entire sequence.

I also want to be clear about exactly what a deterministic version of this simulation looks like. Luckily, the AI can help us with that. Go to either run (or your own) and paste one more prompt into the chat:

Prompt 9: Now write a short Python script that does exactly what you just did: start with a score of 10, apply each of the seven rules to the seven sentences, and print the equation at each step.

Run the script. It should print the correct answer for every step, ending at 60. The same AI that just failed the exercise can write code that does it flawlessly, because now it’s generating deterministic logic instead of trying to count characters through its tokenizer.

Reproducing a cascading failure in a chat

I deliberately engineered the exercise earlier to give you a way to experience the cascading failure problem behind the March of Nines yourself. I took advantage of something current LLMs genuinely suck at: parsing characters inside tokens. Future models might do a much better job with this specific kind of failure, but the cascading failure problem doesn’t go away when the model gets smarter. As long as LLMs are nondeterministic, any step that relies on them has a reliability ceiling below 100%, and those ceilings still multiply. The specific weakness changes; the math doesn’t.

I also specifically asked the model to show only the equation and skip all intermediate reasoning to prevent it from using chain of thought (or CoT) to self-correct. Chain of thought is a technique where you require the model to show its work step by step (for example, listing the words it counted and explaining why each one qualifies), which helps it catch its own mistakes along the way. CoT is a common way to improve LLM accuracy, and it works. As you’ll see later when I talk about the evolution of my blackjack simulation, CoT cut certain errors roughly in half. But “half as many errors” is still not zero. Plus, it’s expensive: It costs more tokens and more time. A Python script that counts double letters gets the right answer on every run, instantly, for zero AI API costs (or, if you’re running the AI locally, for orders of magnitude less CPU usage). That’s the core tension: You can spend engineering effort making the LLM better at deterministic work, or you can just hand it to code.

Every step in this exercise is deterministic work that code handles flawlessly. But most interesting LLM tasks aren’t like that. You can’t write a deterministic script that plays a hand of blackjack using natural-language strategy rules, or decides how a character should respond in dialogue. Real work requires chaining multiple steps together into a pipeline, or a reproducible series of steps (some deterministic, some requiring an LLM) that lead to a single result, where each step’s output feeds the next. If that sounds like what you just saw in the exercise, it is. Except real pipelines are longer, more complex, and much harder to debug when something goes wrong in the middle.

LLM pipelines are especially susceptible to the March of Nines

I’ve been spending a lot of time thinking about LLM pipelines, and I suspect I’m in the minority. Most people using LLMs are working with single prompts or short conversations. But once you start building multistep workflows where the AI generates structured data that feeds into the next step—whether that’s a content generation pipeline, a data processing chain, or a simulation—you run straight into the March of Nines. Each step has a reliability ceiling, and those ceilings multiply. The exercise you just tried had seven steps. The blackjack pipeline has more, and I’ve been running it hundreds of times per iteration.

The blackjack pipeline in Octobatch
The blackjack pipeline in Octobatch, an open source batch orchestrator for multistep LLM workflows that I introduced in “The Accidental Orchestrator.”

That’s a screenshot of the blackjack pipeline in Octobatch, the tool I built to run these pipelines at scale. That pipeline deals cards deterministically, asks the LLM to play each hand following a strategy described in plain English, then validates the results with deterministic code. Octobatch makes it easy to change the pipeline and rerun hundreds of hands, which is how I iterated through eight versions—and how I really learned the hard way that the March of Nines wasn’t just a theoretical problem but something I could watch happening in real time across hundreds of data points.

Running pipelines at scale made the failures obvious and immediate, which, for me, really underscored an effective approach to minimizing the cascading failure problem: make deterministic work deterministic. That means asking whether every step in the pipeline actually needs to be an LLM call. Checking that a jack, a five, and an eight add up to 23 doesn’t require a language model. Neither does looking up whether standing on 15 against a dealer 10 follows basic strategy. That’s arithmetic and a lookup table—work that ordinary code does perfectly every time. And as I learned over the course of improving the failure rate for the pipeline, every step you pull out of the LLM and make deterministic goes to 100% reliability, which stops it from contributing to the compound failure rate.

Relying on the AI for deterministic work is the computation side of a pattern I wrote about for data in “AI, MCP, and the Hidden Costs of Data Hoarding.” Teams dump everything into the AI’s context because the AI can handle it—until it can’t. The same thing happens with computation: Teams let the AI do arithmetic, string matching, or rule evaluation because it mostly works. But “mostly works” is expensive and slow, and a short script does it perfectly. Better yet, the AI can write that script for you—which is exactly what Prompt 9 demonstrated.

Getting cascading failures out of the blackjack pipeline

I pushed the blackjack pipeline through eight iterations, and the results taught me more about earning nines than I expected. That’s why I’m writing this article—the iteration arc turned out to be one of the clearest illustrations I’ve found of how the principle works in practice.

I addressed failures two ways, and the distinction matters.

Some failures called for making work deterministic. Card dealing runs as a local expression step, which doesn’t require an API call, so it’s free, instant, and 100% reproducible. There’s a math verification step that uses code to recalculate totals from the actual cards dealt and compares them against what the LLM reported, and a strategy compliance step checks the player’s first action against a deterministic lookup table. Neither of those steps require any AI to make a judgment call; when I originally ran them as LLM calls, they introduced errors that were hard to detect and expensive to debug.

Other failures called for structural constraints that made specific error patterns harder to produce. Chain of thought format forced the LLM to show its work instead of jumping to conclusions. The rigid dealer output structure made it mechanically difficult to skip the dealer’s turn. Explicit warnings about counterintuitive rules gave the LLM a reason to override its training priors. These don’t eliminate the LLM from the step—they make the LLM more reliable within it.

But before any of that mattered, I had to face the uncomfortable fact that measurements themselves can be wrong, especially when relying on AI to take those measurements. For example, the first run reported a 57% pass rate, which was great! But when I looked at the data myself, a lot of runs were obviously wrong. It turned out that the pipeline had a bug: Verification steps were running, but the AI step that was supposed to enforce didn’t have adequate guardrails, so almost every hand passed regardless of the actual data. I asked three AI advisors to review the pipeline, and none of them caught it. The only thing that exposed it was checking the aggregate numbers, which didn’t add up. If you let probabilistic behavior into a step that should be deterministic, the output will look plausible and the system will report success, but you have no way to know something’s wrong until you go looking for it.

Once I fixed the bug, the real pass rate emerged: 31%. Here’s how the next seven iterations played out:

  • Restructuring the data (31% → 37%). The LLM kept losing track of where it was in the deck, so I restructured the data it received to eliminate the bookkeeping. I also removed split hands entirely, because tracking two simultaneous hands is stateful bookkeeping that LLMs reliably botch. Each fix came from looking at what was actually failing and asking whether the LLM needed to be doing that work at all.
  • Chain of thought arithmetic (37% → 48%). Instead of letting the LLM jump to a final card total, I required it to show the running math at every step. Forcing the model to trace its own calculations cut multidraw errors roughly in half. CoT is a structural constraint, not a deterministic replacement; it makes the LLM more reliable within the step, but it’s also more expensive because it uses more tokens and takes more time.
  • Replacing the LLM validator with deterministic code (48% → 79%). This was the single biggest improvement in the entire arc. The pipeline had a second LLM call that scored how accurately the player followed strategy, and it was wrong 73% of the time. It applied its own blackjack intuitions instead of the rules I’d given it. But there’s a right answer for every situation in basic strategy, and the rules can be written as a lookup table. Replacing the LLM validator with a deterministic expression step recovered over 150 incorrectly rejected hands.
  • Rigid output format (79% → 81%). The LLM kept skipping the dealer’s turn entirely, jumping straight to declaring a winner. Requiring a step-by-step dealer output format made it mechanically difficult to skip ahead.
  • Overriding the model’s priors (81% → 84%). One strategy required hitting on 18 against a high dealer card, which any conventional blackjack wisdom says is terrible. The LLM refused to do it. Restating the rule didn’t help. Explaining why the counterintuitive rule exists did: The prompt had to tell the model that the bad play was intentional.
  • Switching models (84% → 94%). I switched from Gemini Flash 2.0 to Haiku 4.6, which was easy to do because Octobatch lets you run the same pipeline with any model from Gemini, Anthropic, or OpenAI. I finally earned my first nine.

Find the best ways to earn your nines

If you’re building anything where LLM output feeds into the next step, the same question applies to every step in your chain: Does this actually require judgment, or is it deterministic work that ended up in the LLM because the LLM can do it? The strategy validator felt like a judgment call until I looked at what it was actually doing, which was checking a hand against a lookup table. That one recognition was worth more than all the prompt engineering combined. And as Prompt 9 showed, the AI is often the best tool for writing its own deterministic replacement.

I learned this lesson through my own work on the blackjack pipeline. It went through eight iterations, and I think the numbers tell a story. The fixes fell into two categories: making work deterministic (pulling it out of the LLM entirely) and adding structural constraints (making the LLM more reliable within a step). Both earn nines, but pulling work out of the LLM entirely earns those nines faster. The biggest single jump in the whole arc—48% to 79%—came from replacing an LLM validator with a 10-line expression.

Here’s the bottom line for me: If you can write a short function that does the job, don’t give it to the LLM. I initially reached for the LLM for strategy validation because it felt like a judgment call, but once I looked at the data I realized it wasn’t at all. There was a right answer for every hand, and a lookup table found it more reliably than a language model.

At the end of eight iterations, the pipeline passed 94% of hands. The 6% that still fail may be honest limits of what the model can do with multistep arithmetic and state tracking in a single prompt. But they may just be the next nine that I need to earn.

The next article looks at the other side of this problem: Once you know what to make deterministic, how do you make the whole system legible enough that an AI can help your users build with it? The answer turns out to be a kind of documentation you write for AI to read, not humans—and it changes the way you think about what a user manual is for.

Leave a Comment