AI Is Writing Our Code Faster Than We Can Verify It – O’Reilly

This is the third article in a series on agentic engineering and AI-driven development. Read part one here, part two here, part three here, and look for the next article on April 30 on O’Reilly Radar.

Here’s the dirty secret of the AI coding revolution: most experienced developers still don’t really trust the code the AI writes for us.

If I’m being honest, that’s not actually a particularly well-guarded secret. It feels like every day there’s a new breathless “I don’t have a lick of development experience but I just vibe coded this amazing application” article. And I get it—articles like that get so much engagement because everyone is watching carefully as the drama of AIs getting better and better at writing code unfolds. We’ve had decades of shows and movies, from WarGames to Hackers to Mr. Robot, portraying developers as reclusive geniuses doing mysterious but incredible stuff with computers. The idea that we’ve coded ourselves out of existence is fascinating to people.

The flip side of that pop-culture phenomenon is that when there are problems caused by agentic engineering gone wrong (like the equally popular “I trusted an AI agent and it deleted my entire production database” articles), everyone seems to find out about it. And, unfortunately, that newly emerging trope is much closer to reality. Most of us who do agentic engineering have seen our own AI-generated code go off the rails. That’s why I built and maintain the Quality Playbook, an open-source AI skill that uses quality engineering techniques that go back over fifty years to help developers working in any language verify the quality of their AI-generated code. I was as surprised as anyone to discover that it actually works.

I’ve talked often about how we need a “trust but verify” mindset when using AI to write code. In the past, I’ve mostly focused on the “trust” aspect, finding ways to help developers feel more comfortable adopting AI coding tools and using them for production work. But I’m increasingly convinced that our biggest problem with AI-driven development is that we don’t have a reliable way to check the quality of code from agentic engineering at scale. AI is writing our code faster than we can verify it, and that is one of AI’s biggest problems right now.

A false choice

After I got my first real taste of using AI for development in a professional setting, it felt like I was being asked to make a critical choice: either I had to outsource all of my thinking to the AI and just trust it to build whatever code I needed, or I had to review every single file it generated line by line.

A lot of really good, really experienced senior engineers I’ve talked to feel the same way. A small number of experienced developers fully embrace vibe coding and basically fire off the AI to do what it needs to, depending on a combination of unit tests and solid, decoupled architecture (and a little luck, maybe) to make sure things go well. But more frequently, the senior, experienced engineers I’ve talked to, folks who’ve been developing for a really long time, go the other way. When I ask them if they’re using AI every day, they’ll almost always say something like, “Yeah, I use AI for unit tests and code reviews.” That’s almost always a tell that they don’t trust the AI to build the really important code that’s at the core of the application. They’re using AI for things that won’t cause production bugs if they go wrong.

I think this excerpt from a recent (and excellent) article in Ars Technica, “Cognitive surrender” leads AI users to abandon logical thinking, sums up how many experienced developers feel about working with AI:

When it comes to large language model-powered tools, there are generally two broad categories of users. On one side are those who treat AI as a powerful but sometimes faulty service that needs careful human oversight and review to detect reasoning or factual flaws in responses. On the other side are those who routinely outsource their critical thinking to what they see as an all-knowing machine.

I agree that those are two options for dealing with AI. But I also believe that’s a false choice. “Cognitive surrender,” as the research referenced by the article puts it, is not a good outcome. But neither is reviewing every line of code the AI writes, because that’s so effort-intensive that we may as well just write it all ourselves. (And I can almost hear some of you asking, “What so bad about that?”)

This false choice is what really drives a lot of really good, very experienced senior engineers away from AI-driven development today. We see those two options, and they are both terrible. And that’s why I’m writing this article (and the next few in this Radar series) about quality.

Some shocking numbers about AI coding tools

The Quality Playbook is an open-source skill for AI coding tools like GitHub Copilot, Cursor, Claude Code, and Windsurf. You point it at a codebase, and it generates a complete quality engineering infrastructure for that project: test plans traced to requirements, code review protocols, integration tests, and more. More importantly, it brings back quality engineering practices that much of the industry abandoned decades ago, using AI to do a lot of the quality-related work that used to require a dedicated team.

I built the Quality Playbook as part of an experiment in AI-driven development and agentic engineering, building an open-source project called Octobatch and writing about the process in this ongoing Radar series. The playbook emerged directly from that experiment. The ideas behind it are over fifty years old, and they work.

Along the way, I ran into a shocking statistic.

We already know that many (most?) developers these days use AI coding tools like GitHub Copilot, Claude Code, Gemini, ChatGPT, and Cursor to write production code. But do we trust the code those tools generate? “Trust in these systems has collapsed to just 33%, a sharp decline from over 70% in 2023.”

That quote is from a Gemini Deep Research report I generated while doing research for this article. 70% dropping to 33%—that sounds like a massive collapse, right?

The thing is, when I checked the sources Gemini referenced, the truth wasn’t nearly as clear-cut. That “over 70% in 2023” number came from a Stack Overflow survey measuring how favorably developers view AI tools. The “33%” number came from a Qodo survey asking whether developers trust the accuracy of AI-generated code. Gemini grabbed both numbers, stripped the context, and stitched them into a single decline narrative. No single study ever measured trust dropping from over 70% to 33%. Which means we’ve got an apples-to-oranges comparison, and it might even technically be accurate (sort of?), but it’s not really the headline-grabber that it seemed to be.

So why am I telling you about it?

Because there are two important lessons from that “shocking” stat. The first is that the overall idea rings true, at least for me. Almost all of us have had the experience of generating code with AI faster than we can verify it, and we ship features before we fully review them.

The second is that when Gemini created the report, the AI fabricated the most alarming version of the story from real but unrelated data points. If I’d just cited it without checking the sources, there’s a pretty good chance it would get published, and you might even believe it. That’s ironically self-referential, because it’s literally the trust problem the survey is supposedly measuring. The AI produced something that looked authoritative, felt correct, and was wrong in ways that only careful verification could catch. If you want to understand why over 70% of developers don’t fully trust AI-generated code, you just watched it happen.

One reason many of us don’t trust AI-generated choice is because there’s a growing gap between how fast AI can generate code and how well we can verify that the code actually does what we intended. The usual response to this verification gap is to adopt better testing tools. And there are plenty of them: test stub generators, diff reviewers, spec-first frameworks. These are useful, and they solve real problems. But they generally share a blind spot: they work with what the code does, not with what it’s supposed to do. Luckily, the intent is sitting right there: in the specs, the schemas, the defensive code, the history of the AI chats about the project, even the variable names and filenames. We just need a way to use it.

AI-driven development needs its own quality practices, and the discipline we need already exists. It was just (unfairly) considered too expensive to use… until AI made it cheap.

(Re-)introducing quality engineering

There’s a difference between knowing that code works and knowing that it does what it’s supposed to do. It’s the difference between “does this function return the right value?” and “does this system fulfill its purpose?”—and as it turns out, that’s one of the oldest problems in software engineering. In fact, as I talked about in a previous Radar article, Prompt Engineering Is Requirements Engineering, it was the source of the original “software crisis.”

The software crisis was the term people used across our industry back in the 1960s when they were coming to grips with large software projects around the world that were routinely delivered late, over budget, and delivering software that didn’t do what it was supposed to do. At the 1968 NATO Software Engineering Conference—the conference that introduced the term “software engineering”—some of the top experts in the industry talked about how the crisis was caused by the developers and their stakeholders had trouble understanding the problems they were solving, communicating those needs clearly, and making sure that the systems they delivered actually met their users’ needs. Nearly two decades later, Fred Brooks made the same argument in his pioneering essay, No Silver Bullet: no tool can, on its own, eliminate the inherent difficulty of understanding what needs to be built and communicating that intent clearly. And now that we talk to our AI development tools the same way we talk to our teammates, we’re more susceptible than ever to that underlying problem of communication and shared understanding.

An important part of the industry’s response to the software crisis was quality engineering, a discipline built specifically to close the gap between intent and implementation by defining what “correct” means up front, tracing tests back to requirements, and verifying that the delivered system actually does what it’s supposed to do. For years it was standard practice for software engineering teams to include quality engineering phases in all projects. But few teams today do traditional quality engineering. Understanding why it got left behind by so many of us, more importantly, what it can do for us now, can make a huge difference for agentic engineering and AI-driven development today.

Starting in the 1950s, three thinkers built the intellectual foundation that manufacturing used to become dramatically more reliable.

W. Edwards Deming argued that quality is built into the process, not inspected in after the fact. He taught us that you don’t test your way to a good product; you design the system that produces it.
Joseph Juran defined quality as fitness for use: not just “does it work?” but ”does it do what it’s supposed to do, under real conditions, for the people who actually use it?”
Philip Crosby made the business case: quality is free, because building it in costs less than finding and fixing defects after the fact. By the time I joined my first professional software development team in the 1990s, these ideas were standard practice in our industry.

These ideas revolutionized software quality, and the people who put them into practice were called quality engineers. They built test plans traced to requirements, ran functional testing against specifications, and maintained living documentation that defined what “correct” meant for each part of the system.

So why did all of this disappear from most software teams? (It’s still alive in regulated industries like aerospace, medical devices, and automotive, where traceability is mandated by law, and a few brave holdouts throughout the industry.) It wasn’t because it didn’t work. Quality engineering got cut because it was perceived as expensive. Crosby was right that quality is free: the cost of building it in is far more than made up for by the savings you get from not finding and fixing defects later. But the costs come at the beginning of the project and the savings come at the end. In practice, that means when the team blows a deadline and the manager gets angry and starts looking for something to cut, the testing and QA activities are easy targets because the software already seems to be complete.

On top of the perceived expense, quality engineering required specialists. Building good requirements, designing test plans, and planning and running functional and regression testing are real, technical skills, and most teams simply didn’t have anyone (or, more specifically, the budget for anyone) who could do those jobs.

Quality engineering may have faded from our projects and teams over time, but the industry didn’t just give up on many of its best ideas. Developers are nothing if not resourceful, and we built our own quality practices—three of the most popular are test-driven development, behavior-driven development, agile-style iteration—and these are genuinely good at what they do. TDD keeps code honest by making you write the test before the implementation. BDD was specifically designed to capture requirements in a form that developers, testers, and stakeholders can all read (though in practice, most teams strip away the stakeholder involvement and it devolves into another flavor of integration testing). Agile iteration tightens the feedback loop so you catch problems earlier.

Those newer quality practices are practical and developer-focused, and they’re less expensive to adopt than traditional quality engineering in the short run because they live inside the development cycle. The upside of those practices is that development teams can generally implement them on their own, without asking for permission or requiring experts. The tradeoff, however, is that those practices have limited scope. They verify that the code you’re writing right now works correctly, but they don’t step back and ask whether the system as a whole fulfills its original intent. Quality engineering, on the other hand, establishes the intent of the system before the development cycle even begins, and keeps it up to date and feeds it back to the team as the project progresses. That’s a huge piece of the puzzle that got lost along the way.

Those highly effective quality engineering practices got cut from most software engineering teams because they were viewed as expensive, not because they were wrong. When you’re doing AI-driven development, you’re actually running into exactly the same problem that quality engineering was built to solve. You have a “team”—your AI coding tools—and you need a structured process to make sure that team is building what you actually intend. Quality engineering is such a good fit for AI-driven development because it’s the discipline that was specifically designed to close that gap between what you ask for and what gets built.

What nobody expected is that AI would make it cheap enough in the short run to bring quality engineering back to our projects.

Introducing the Quality Playbook

I’ve long suspected that quality engineering would be a perfect fit for AI-driven development (AIDD), and I finally got a chance to test that hypothesis. As part of my experiment with AIDD and agentic engineering (which I’ve been writing about in The Accidental Orchestrator and the rest of this series), I built the Quality Playbook, a skill for AI tools like Cursor, GitHub Copilot, and Claude Code that lets you bring these highly effective quality practices to any project, using AI to do the work that used to require a dedicated quality engineering team. Like other AI skills and agents, it’s a structured document that plugs into an AI coding agent and teaches it a specific capability. You point it at a codebase, and the AI explores the code, reads whatever specifications and documentation it can find, and generates a complete quality infrastructure tailored to that project. The Quality Playbook is now part of awesome-copilot, a collection of community-contributed agents (and I’ve also opened a pull request to add it to Anthropic’s repository of Claude Code skills).

What does “quality infrastructure” actually mean? Think about what a quality engineering team would build if you hired one. A good quality engineer would start by defining what “correct” means for your project: what the system is supposed to do, grounded in your requirements, your domain, what your users actually need. From there, they’d write tests traced to those requirements, build a code review process that checks whether the code implements what it’s supposed to, design integration tests that verify the whole system works together, and set up an audit process where independent reviewers check the code against its original intent.

That’s what the playbook generates. Developers using AI tools have been rediscovering the value of requirements, and spec-driven development (SDD) has become very popular. You don’t need to be practicing strict spec-driven development to use it. The playbook infers your project’s intent from whatever artifacts are available: chat logs, schemas, README files, code comments, and even defensive code patterns. If you have formal specs, great; if not, the AI pieces together what “correct” means from the evidence it can find.

Once the playbook figures out the intent of the code, it creates quality infrastructure for the project. Specifically, it generates ten deliverables:

Exploration and requirements elicitation (EXPLORATION.md): Before the playbook writes anything, it spends an entire phase reading the code, documentation, specs, and schemas, and writes a structured exploration document that maps the project’s architecture and domain. The most common failure mode in AI-generated quality work is producing generic content that could apply to any project. The exploration phase forces the AI to ground everything in this specific codebase, and serves as an audit trail: if the requirements end up wrong, you can trace the problem back to what the exploration discovered or missed.
Testable requirements (REQUIREMENTS.md): The most important deliverable. Building on the exploration, a five-phase pipeline extracts the actual intent of the project from code, documentation, AI chats, messages, support tickets, and any other project artifacts you can give it. The result is a specification document that a new team member or AI agent can read top-to-bottom and understand the software. Each requirement is tagged with an authority tier and linked to use cases that become the connective tissue tying requirements to integration tests to bug reports.
Quality constitution (QUALITY.md): Defines what “correct” means for your specific project, grounded in your actual domain. Every standard has a rationale explaining why it matters, because without the rationale, a future AI session will argue the standard down.
Spec-traced functional tests: Tests generated from the requirements, not from source code. That difference matters: a test generated from source code verifies that the code does what the code does, while a test traced to a spec verifies that the code does what you intended.
Three-pass code review protocol with bug reports and regression tests: Three mandatory review passes, each using a different lens: structural review with anti-hallucination guardrails, requirement verification (where you catch things the code doesn’t do that it was supposed to), and cross-requirement consistency checking. Every confirmed bug gets a regression test and a patch file.
Consolidated bug report (BUGS.md): Every confirmed bug with full reproduction details, severity calibrated to real-world impact, and a spec basis citing the specific documentation the code violates. Maintainers respond differently to ”your code violates section X.Y of your own spec” than to ”this looks like it might be a bug.”
TDD red/green verification: For each confirmed bug, a regression test runs against unpatched code (must fail), then the fix is applied and the test reruns (must pass). When you tell a maintainer ”here’s a test that fails on your current code and passes with this one-line fix,” that’s qualitatively different from a bug report.
Integration test protocol: A structured test matrix that an AI agent can pick up and execute autonomously, without asking clarifying questions. Every test specifies the exact command, what it proves, and specific pass/fail criteria. Field names and types are read from actual source files, not recalled from memory, as an anti-hallucination mechanism.
Council of Three multi-model spec audit: Three independent AI models audit the codebase against the requirements. The triage uses confidence weighting, not majority vote: findings from all three are near-certain, two are high-confidence, and findings from only one get a verification probe rather than being dismissed. The most valuable findings are often the ones only one model catches.
AGENTS.md bootstrap file: A context file that future AI sessions read first, so they inherit the full quality infrastructure. Without it, every new session starts from zero. With it, the quality constitution, requirements, and review protocols carry forward automatically across every session that touches the codebase.

The third option

I started this article by talking about a false choice: either we surrender our judgment to the AI, or get stuck reviewing every line of code it writes. The reality is much more nuanced, and, in my opinion, a lot more interesting, if we have a trustworthy way to verify that the code we worked with the AI to build actually does what we intended. It’s not a coincidence that this is one of the oldest problems in software engineering, and not surprising that AI can help us with it.

The Quality Playbook leans heavily on classic quality engineering techniques to do that verification. Those techniques work very well, and that gives us the more nuanced option: using AI to help us write our code, and then using it to help us trust what it built.

That’s not a gimmick or a paradox. It works because verification is exactly the kind of structured, specification-driven work that AI is good at. Writing tests traced to requirements, reviewing code against intent, checking that the system does what it’s supposed to do under real conditions. These are the things quality engineers used to do across the whole industry (and still do in the highly regulated parts of it). They’re also things that AI can do well, as long as we tell it what “correct” means.

The experienced engineers I talked about at the beginning of this article, the ones who only use AI for unit tests and code reviews, aren’t wrong to be cautious. They’re right that we can’t just trust whatever output the AI spits out. But limiting AI to just the “safe” parts of our projects keeps us from taking advantage of such an important set of tools. The way out of this quagmire is to build the infrastructure that makes the rest of it trustworthy too. Quality engineering gives us that infrastructure, and AI makes it cheap enough to actually use on all of our projects every day.

In the next few articles, I’ll show you what happened when I pointed the Quality Playbook at real, mature open-source codebases and it started finding real bugs, how the playbook emerged from my AI-driven development experiment, what the quality engineering mindset looks like in practice, and how we can learn important lessons from that experience that apply to all of our projects.

The Quality Playbook is open source and works with GitHub Copilot, Cursor, and Claude Code. It’s also available as part of awesome-copilot. You can try it out today by downloading it into your project and asking the AI to generate the quality playbook. The whole process takes about 10-15 minutes for a typical codebase. I’ll cover more details on running it in future articles in this series.