Bringing Engineering Discipline to Prompts—Part 3 – O’Reilly

The following is Part 3 of 3 from Addy Osmani’s original post “Context Engineering: Bringing Engineering Discipline to Parts.” Part 1 can be found here and Part 2 here.

Context engineering is crucial, but it’s just one component of a larger stack needed to build full-fledged LLM applications—alongside things like control flow, model orchestration, tool integration, and guardrails.

In Andrej Karpathy’s words, context engineering is “one small piece of an emerging thick layer of non-trivial software” that powers real LLM apps. So while we’ve focused on how to craft good context, it’s important to see where that fits in the overall architecture.

A production-grade LLM system typically has to handle many concerns beyond just prompting, for example:

Problem decomposition and control flow: Instead of treating a user query as one monolithic prompt, robust systems often break the problem down into subtasks or multistep workflows. For instance, an AI agent might first be prompted to outline a plan, then in subsequent steps be prompted to execute each step. Designing this flow (which prompts to call in what order, how to decide branching or looping) is a classic programming task—except the “functions” are LLM calls with context. Context engineering fits here by making sure each step’s prompt has the info it needs, but the decision to have steps at all is a higher-level design. This is why you see frameworks where you essentially write a script that coordinates multiple LLM calls and tool uses.
Model selection and routing: You might use different AI models for different jobs. Perhaps a lightweight model for simple tasks or preliminary answers, and a heavyweight model for final solutions. Or a code-specialized model for coding tasks versus a general model for conversational tasks. The system needs logic to route requests to the appropriate model. Each model might have different context length limits or formatting requirements, which the context engineering must account for (e.g., truncating context more aggressively for a smaller model). This aspect is more engineering than prompting: think of it as matching the tool to the job.
Tool integrations and external actions: If your AI can perform actions (like calling an API, database queries, opening a web page, running code), your software needs to manage those capabilities. That includes providing the AI with a list of available tools and instructions on usage, as well as actually executing those tool calls and capturing the results. As we discussed, the results then become new context for further model calls. Architecturally, this means your app often has a loop: prompt model → if model output indicates a tool to use → execute tool → incorporate result → prompt model again. Designing that loop reliably is a challenge.
User interaction and UX flows: Many LLM applications involve the user in the loop. For example, a coding assistant might propose changes and then ask the user to confirm applying them. Or a writing assistant might offer a few draft options for the user to pick from. These UX decisions affect context too. If the user says “Option 2 looks good but shorten it,” you need to carry that feedback into the next prompt (e.g., “The user chose draft 2 and asked to shorten it.”). Designing a smooth human-AI interaction flow is part of the app, though not directly about prompts. Still, context engineering supports it by ensuring each turn’s prompt accurately reflects the state of the interaction (like remembering which option was chosen or what the user edited manually).
Guardrails and safety: In production, you have to consider misuse and errors. This might include content filters (to prevent toxic or sensitive outputs), authentication and permission checks for tools (so the AI doesn’t, say, delete a database because it was in the instructions), and validation of outputs. Some setups use a second model or rules to double-check the first model’s output. For example, after the main model generates an answer, you might run another check: “Does this answer contain any sensitive info? If so, redact it.” Those checks themselves can be implemented as prompts or as code. In either case, they often add additional instructions into the context (like a system message: “If the user asks for disallowed content, refuse.” is part of many deployed prompts). So the context might always include some safety boilerplate. Balancing that (ensuring the model follows policy without compromising helpfulness) is yet another piece of the puzzle.
Evaluation and monitoring: Suffice to say, you need to constantly monitor how the AI is performing. Logging every request and response (with user consent and privacy in mind) allows you to analyze failures and outliers. You might incorporate real-time evals—e.g., scoring the model’s answers on certain criteria and if the score is low, automatically having the model try again or route to a human fallback. While evaluation isn’t part of generating a single prompt’s content, it feeds back into improving prompts and context strategies over time. Essentially, you treat the prompt and context assembly as something that can be debugged and optimized using data from production.

We’re really talking about a new kind of application architecture. It’s one where the core logic involves managing information (context) and adapting it through a series of AI interactions, rather than just running deterministic functions. Karpathy listed elements like control flows, model dispatch, memory management, tool use, verification steps, etc., on top of context filling. All together, they form what he jokingly calls “an emerging thick layer” for AI apps—thick because it’s doing a lot! When we build these systems, we’re essentially writing metaprograms: programs that choreograph another “program” (the AI’s output) to solve a task.

For us software engineers, this is both exciting and challenging. It’s exciting because it opens capabilities we didn’t have—e.g., building an assistant that can handle natural language, code, and external actions seamlessly. It’s challenging because many of the techniques are new and still in flux. We have to think about things like prompt versioning, AI reliability, and ethical output filtering, which weren’t standard parts of app development before. In this context, context engineering lies at the heart of the system: If you can’t get the right information into the model at the right time, nothing else will save your app. But as we see, even perfect context alone isn’t enough; you need all the supporting structure around it.

The takeaway is that we’re moving from prompt design to system design. Context engineering is a core part of that system design, but it lives alongside many other components.

Conclusion

Key takeaway: By mastering the assembly of complete context (and coupling it with solid testing), we can increase the chances of getting the best output from AI models.

For experienced engineers, much of this paradigm is familiar at its core—it’s about good software practices—but applied in a new domain. Think about it:

We always knew garbage in, garbage out. Now that principle manifests as “bad context in, bad answer out.” So we put more work into ensuring quality input (context) rather than hoping the model will figure it out.
We value modularity and abstraction in code. Now we’re effectively abstracting tasks to a high level (describe the task, give examples, let AI implement) and building modular pipelines of AI + tools. We’re orchestrating components (some deterministic, some AI) rather than writing all logic ourselves.
We practice testing and iteration in traditional dev. Now we’re applying the same rigor to AI behaviors, writing evals and refining prompts as one would refine code after profiling.

In embracing context engineering, you’re essentially saying, “I, the developer, am responsible for what the AI does.” It’s not a mysterious oracle; it’s a component I need to configure and drive with the right data and rules.

This mindset shift is empowering. It means we don’t have to treat the AI as unpredictable magic—we can tame it with solid engineering techniques (plus a bit of creative prompt artistry).

Practically, how can you adopt this context-centric approach in your work?

Invest in data and knowledge pipelines. A big part of context engineering is having the data to inject. So build that vector search index of your documentation, or set up that database query that your agent can use. Treat knowledge sources as core features in development. For example, if your AI assistant is for coding, make sure it can pull in code from the repo or reference the style guide. A lot of the value you’ll get from an AI comes from the external knowledge you supply to it.
Develop prompt templates and libraries. Rather than ad hoc prompts, start creating structured templates for your needs. You might have a template for “answer with citation” or “generate code diff given error.” These become like functions you reuse. Keep them in version control. Document their expected behavior. This is how you build up a toolkit of proven context setups. Over time, your team can share and iterate on these, just as they would on shared code libraries.
Use tools and frameworks that give you control. Avoid “just give us a prompt, we do the rest” solutions if you need reliability. Opt for frameworks that let you peek under the hood and tweak things—whether that’s a lower-level library like LangChain or a custom orchestration you build. The more visibility and control you have over context assembly, the easier to debug when something goes wrong.
Monitor and instrument everything. In production, log the inputs and outputs (within privacy limits) so you can later analyze them. Use observability tools (like LangSmith, etc.) to trace how context was built for each request. When an output is bad, trace back and see what the model saw—was something missing? Was something formatted poorly? This will guide your fixes. Essentially, treat your AI system as a somewhat unpredictable service that you need to monitor like any other—dashboards for prompt usage, success rates, etc.
Keep the user in the loop. Context engineering isn’t just about machine-machine info; it’s ultimately about solving a user’s problem. Often, the user can provide context if asked the right way. Think about UX designs where the AI asks clarifying questions or where the user can provide extra details to refine the context (like attaching a file, or selecting which codebase section is relevant). The term “AI-assisted” goes both ways—AI assists the user, but the user can assist AI by supplying context. A well-designed system facilitates that. For example, if an AI answer is wrong, let the user correct it and feed that correction back into context for next time.
Train your team (and yourself). Make context engineering a shared discipline. In code reviews, start reviewing prompts and context logic too. (“Is this retrieval grabbing the right docs? Is this prompt section clear and unambiguous?”) If you’re a tech lead, encourage team members to surface issues with AI outputs and brainstorm how tweaking context might fix it. Knowledge sharing is key because the field is new—a clever prompt trick or formatting insight one person discovers can likely benefit others. I’ve personally learned a ton just reading others’ prompt examples and postmortems of AI failures.

As we move forward, I expect context engineering to become second nature—much like writing an API call or a SQL query is today. It will be part of the standard repertoire of software development. Already, many of us don’t think twice about doing a quick vector similarity search to grab context for a question; it’s just part of the flow. In a few years, “Have you set up the context properly?” will be as common a code review question as “Have you handled that API response properly?”

In embracing this new paradigm, we don’t abandon the old engineering principles—we reapply them in new ways. If you’ve spent years honing your software craft, that experience is incredibly valuable now: It’s what allows you to design sensible flows, spot edge cases, and ensure correctness. AI hasn’t made those skills obsolete; it’s amplified their importance in guiding AI. The role of the software engineer is not diminishing—it’s evolving. We’re becoming directors and editors of AI, not just writers of code. And context engineering is the technique by which we direct the AI effectively.

Start thinking in terms of what information you provide to the model, not just what question you ask. Experiment with it, iterate on it, and share your findings. By doing so, you’ll not only get better results from today’s AI but also be preparing yourself for the even more powerful AI systems on the horizon. Those who understand how to feed the AI will always have the advantage.

Happy context-coding!

I’m excited to share that I’ve written a new AI-assisted engineering book with O’Reilly. If you’ve enjoyed my writing here you may be interested in checking it out.

AI tools are quickly moving beyond chat UX to sophisticated agent interactions. Our upcoming AI Codecon event, Coding for the Agentic World, will highlight how developers are already using agents to build innovative and effective AI-powered experiences. We hope you’ll join us on September 9 to explore the tools, workflows, and architectures defining the next era of programming. It’s free to attend. Register now to save your seat.

Bringing Engineering Discipline to Prompts—Part 3 – O’Reilly

Conclusion

Leave a Comment Cancel reply