How Agent Skills Create Specialized AI Without Training – O’Reilly

Our previous article framed the Model Context Protocol (MCP) as the toolbox that provides AI agents tools and Agent Skills as materials that teach AI agents how to complete tasks. This is different from pre- or posttraining, which determine a model’s general behavior and expertise. Agent Skills do not “train” agents. They soft-fork agent behavior at runtime, telling the model how to perform specific tasks that it may need.

The term soft fork comes from open source development. A soft fork is a backward-compatible change that does not require upgrading every layer of the stack. Applied to AI, this means skills modify agent behavior through context injection at runtime rather than changing model weights or refactoring AI systems. The underlying model and AI systems stay unchanged.

The architecture maps cleanly to how we think about traditional computing. Models are CPUs—they provide raw intelligence and compute capability. Agent harnesses like Anthropic’s Claude Code are operating systems—they manage resources, handle permissions, and coordinate processes. Skills are applications—they run on top of the OS, specializing the system for specific tasks without modifying the underlying hardware or kernel.

You don’t recompile the Linux kernel to run a new application. You don’t rearchitect the CPU to use a different text editor. You install a new application on top, using the CPU’s intelligence exposed and orchestrated by the OS. Agent Skills work the same way. They layer expertise on top of the agent harness, using the capabilities the model provides, without updating models or changing harnesses.

This distinction matters because it changes the economics of AI specialization. Fine-tuning demands significant investment in talent, compute, data, and ongoing maintenance every time the base model updates. Skills require only Markdown files and resource bundles.

How soft forks work

Skills achieve this through three mechanisms—the skill package format, progressive disclosure, and execution context modification.

The skill package is a folder. At minimum, it contains a SKILL.md file with frontmatter metadata and instructions. The frontmatter declares the skill’s name, description, allowed-tools, and versions, followed by the actual expertise: context, problem solving approaches, escalation criteria, and patterns to follow.

Frontmatter for Anthropic's skill-creator package — Figure 2. Frontmatter for Anthropic’s `skill-creator` package. The frontmatter lives at the top of Markdown files. Agents choose skills based on their descriptions.

The folder can also include reference documents, templates, resources, configurations, and executable scripts. It contains everything an agent needs to perform expert-level work for the specific task, packaged as a versioned artifact that you can review, approve, and deploy as a .zip file or .skill file bundle.

Individual skill object — Figure 3. A Skill Object for Anthropic’s `skill-creator`. skill-creator contains `SKILL.md`, `LICENSE.txt`, Python scripts, and reference files.

Because the skill package format is just folders and files, you can use all the tooling we have built for managing code—track changes in Git, roll back bugs, maintain audit trails, and all of the best practices of software engineering development life cycle. This same format is also used to define subagents and agent teams, meaning a single packaging abstraction governs individual expertise, delegated workflows, and multi-agent coordinations alike.

Progressive disclosure keeps skills lightweight. Only the frontmatter of SKILL.md loads into the agent’s context at session start. This respects the token economics of limited context windows. The metadata contains name, description, model, license, version, and very importantly allowed-tools. The full skill content loads only when the agent determines relevance and decides to invoke it. This is similar to how operating systems manage memory; applications load into RAM when launched, not all at once. You can have dozens of skills available without overwhelming the model’s context window, and the behavioral modification is present only when needed, never permanently resident.

Figure 4. Agent Skill execution flow. At session start, only frontmatter is loaded. Once the agent chooses a skill, it reads the full SKILL.md and executes with the skill’s permissions.

Execution context modification controls what skills can do. When agents invoke a skill, the permission system changes to the scope of the skill’s definition, specifically, model and allowed-tools declared in its frontmatter. It reverts after execution completes. A skill could use a different model and a different set of tools from the parent session. This sandboxed the permission environment so skills get only scoped access, not arbitrary system control. This ensures the behavioral modification operates within boundaries.

This is what separates skills from earlier approaches. OpenAI’s custom GPTs and Google’s Gemini Gems are useful but opaque, nontransferable, and impossible to audit. Skills are readable because they are Markdown. They are auditable because you can apply version control. They are composable because skills can stack. And they are governable because you can build approval workflows and rollback capability. You can read a SKILL.md to understand exactly why an agent behaves a certain way.

What the data shows

Building skills is easy with coding agents. Knowing whether they work is the hard part. Traditional software testing does not apply. You cannot write a unit test asserting that expert behavior occurred. The output might be correct while reasoning was shallow, or the reasoning might be sophisticated while the output has formatting errors.

SkillsBench is a benchmarking effort and framework designed to address this. It uses paired evaluation design where the same tasks are evaluated with and without skill augmentation. The benchmark contains 85 tasks, stratified across domains and difficulty levels. By comparing the same agent on the same task with the only variable being the presence of a skill, SkillsBench isolates the causal effect of skills from model capability and task difficulty. Performance is measured using normalized gain, the fraction of possible improvement the skill actually captured.

The findings from SkillsBench challenge our presumption that skills universally improve performance.

Skills improve average performance by 13.2 percentage points. But 24 of 85 tasks got worse. Manufacturing tasks gained 32 points. Software engineering tasks lost 5. The aggregate number hides variances that domain-level evaluation reveals. This is precisely why soft forks need evaluation infrastructure. Unlike hard forks where you commit fully, soft forks let you measure before you deploy widely. Organizations should segment evaluations by domains and by tasks and test for regression, not just improvements. As an example, what improves document processing might degrade code generation.

Compact skills outperform comprehensive ones by nearly 4x. Focused skills with dense guidance showed +18.9 percentage point improvement. Comprehensive skills covering every edge case showed +5.7 points. Using two to three skills per task is optimal, with four or more showing diminishing returns. The temptation when building skills is to include everything. Every caveat, every exception, every piece of relevant context. Resist it. Let the model’s intelligence do the work. Small, targeted behavioral changes outperform comprehensive rewrites. Skill builders should start with minimum viable guidance and add detail only when evaluation shows specific gaps.

Models cannot reliably self-generate effective skills. SkillsBench tested a “bring your own skill” condition where agents were prompted to generate their own procedural knowledge before attempting tasks. Performance stayed at baseline. Effective skills require human-curated domain expertise that models cannot reliably produce for themselves. AI can help with packaging and formatting, but the insight has to come from people who actually have the expertise. Human-labeled insight is the bottleneck of building effective skills, not the packaging or deployment.

Skills can partially substitute for model scale. Claude Haiku, a small model, with well-designed skills achieved a 25.2% pass rate. This slightly exceeded Claude Opus, the flagship model, without skills at 23.6%. Packaged expertise compensates for model intelligence on procedural tasks. This has cost implications: Smaller models with skills may outperform larger models without them at a fraction of the inference cost. Soft forks democratize capability. You do not need the biggest model if you have the right expertise packaged.

Open questions

Many challenges remain unresolved. What happens when multiple skills conflict with each other during a session? How should organizations govern skill portfolios when teams each deploy their own skills onto shared agents? How quickly does encoded expertise become outdated, and what refresh cadence keeps skills effective without creating maintenance burden? Skills inherit whatever biases exist in their authors’ expertise, so how do you audit that? And as the industry matures, how should evaluation infrastructure such as SkillsBench scale to keep pace with the growing complexity of skill augmented systems?

These are not reasons to avoid skills. They are reasons to invest in evaluation infrastructure and governance practices alongside skill development. The capability to measure performance must evolve in lockstep with the technology itself.

Agent Skills advantage

Fine-tuning models for a single use case is no longer the only path to specialization. It demands significant investment in talent, compute, and data and creates a permanent divergence that requires reevaluation and potential retraining every time the base model updates. Fine-tuning across a broad set of capabilities to improve a foundation model remains sound, but fine-tuning for one narrow workflow is exactly the kind of specialization that skills can now achieve at a fraction of the cost.

Skills are not maintenance free. Just as applications sometimes break when operating systems update, skills need reevaluation when the underlying agent harness or model changes. But the recovery path is lighter: update the skills package, rerun the evaluation harness, and redeploy rather than retrain from a new checkpoint.

Mainframes gave way to client-server. Monoliths gave way to microservices. Specialized fine-tuned models are now giving way to agents augmented by specialized expertise artifacts. Models provide intelligence, agent harnesses provide runtime, skills provide specialization, and evaluation tells you whether it all works together.

How Agent Skills Create Specialized AI Without Training – O’Reilly

How soft forks work

What the data shows

Open questions

Agent Skills advantage

Leave a Comment Cancel reply