Agent Skills Work but the Research Shows Most Teams Are Building Them Wrong – O’Reilly

This post was originally published on The Nuanced Perspective and is being reposted here with the authors’ permission.

Agent skills are everywhere right now. Atlassian built them into Rovo so agents can automatically triage Jira tickets, draft Confluence pages, and route service requests without anyone typing a prompt. Canva and Figma use them so Claude can interact with design files directly. Stripe published skills for payment workflow automation. When Anthropic launched the Agent Skills open standard in December 2025, Microsoft adopted it in VS Code and GitHub within weeks.

The idea is elegantly simple. Instead of building a new specialized agent for every use case, you write a skill once, and any agent that understands the standard can use it. A code reviewer, a PR generator, a deployment checklist, a sprint planner. Each lives in a folder, triggers when relevant, and brings your team’s specific way of doing things into the agent’s context.

But the research on whether skills actually work, and what causes them to fail, is only catching up to adoption now. Four recent papers take the first systematic look at skills in practice: what the benchmarks show, how libraries break down as they grow, and what a more principled approach to orchestration looks like.

Three findings that will change how you think about skills:

  • Curated skills raised the rate at which agents successfully completed tasks by 16.2% on average across 84 tasks. Model-written skills showed no consistent benefit across any configuration tested.
  • As skill libraries grow, the agent’s ability to find the right skill on demand breaks down. When it scans every skill description in one pass, similar-sounding skills start colliding. Organizing skills into a hierarchy rather than a flat list is what the research shows actually fixes this.
  • A large-scale security study of ~31K community skills found that more than one in four contain exploitable vulnerabilities, spanning prompt injection, data exfiltration, and privilege escalation.

This is what those papers found, and what it means for anyone building with skills today.

What a skill is

Your team has a specific way of reviewing PRs. Particular checks, a specific order, standards that go beyond what any generic reviewer would know. You’ve explained it to every new engineer who joined. A skill is how you stop explaining it and let the agent carry it instead. In practice it’s a folder with a SKILL.md file at the center: a description that acts as the trigger condition, a body with step-by-step instructions, and optionally scripts and reference documents that load only when needed. A scoped set of tools and instructions the agent can invoke.

At session startup, the agent reads only the name and description from each installed skill, which is about 100 tokens per skill. The full instructions load only when the skill activates, and scripts run without being read into context at all. A large skill library costs almost nothing at initialization. The context budget only gets spent when a skill is actually running.

That’s progressive disclosure, and it’s what makes skills different from system prompts, which load everything globally every session, or tools, which are API calls that give the agent direct capabilities. The distinction that holds up for MCPs is that MCP gives the agent abilities, say, a shell, an API connection, or access to a database, whereas skills encode the knowledge of how to use those abilities well for a specific workflow. Block’s engineering team put it well that skills are like GitHub Actions YAML, and MCP is the runner. One describes the workflow and the other makes it possible.

Some concrete examples of what this looks like in practice, from teams that have shipped skills in production:

  • A PR review skill that loads your org’s specific style guide, flagging violations and blockers according to your team’s standards rather than generic best practices
  • A deployment checklist skill that runs your team’s exact predeploy sequence, covering environment checks, rollback verification, and the three Slack channels to notify in order
  • A data reporting skill that knows your company’s metric definitions, so when someone asks for “revenue,” it pulls the right number rather than the closest approximation
  • A sprint planning skill that fetches the backlog, applies your team’s capacity rules, and proposes a plan structured the way your team runs standups

The value in each of these isn’t the task itself. Any agent can attempt a PR review or a sprint plan. The value is the organizational knowledge baked into how the skill executes it, your style rules, your deploy sequence, your metric definitions, your team’s way of running things. That specificity is also what makes skills hard to get right, as the benchmarks show.

What the benchmarks show

SkillsBench is the first benchmark built specifically to measure whether agent skills actually improve performance. It tested 84 tasks across 11 domains, running each task under three conditions: no skill, a curated skill, and a self-generated skill. The results are worth sitting with.

Curated skills raised average pass rates by 16.2%. However, the gains were uneven across domains. Software engineering tasks improved by 4.5%, while healthcare tasks saw nearly 52% improvement. The domains where skills helped most were the ones with highly structured workflows and domain-specific conventions the base model doesn’t carry natively.

The less-cited result is that self-generated skills, where the model writes its own skill rather than a human curating one, provided no average benefit across configurations (“SkillsBench,” Table 3). Some model configurations saw small gains; others saw small losses. The paper’s conclusion was that models cannot reliably author the procedural knowledge they benefit from consuming. The trajectory analysis in the benchmark identified two failure modes:

  • Models either generate imprecise procedures lacking specific API patterns, or
  • Fail to recognize what domain knowledge the task actually requires.

The benchmark’s self-generation condition has also drawn pushback from practitioners. One engineer writing on HackerNoon argues the test doesn’t reflect how skilled teams actually build skills. The benchmark prompted a fresh agent to write a skill and immediately use it, which is closer to asking a model to think harder before attempting a task than to building a skill from real execution experience. His own replication, using skills built from actual debugging sessions, showed much stronger results. The distinction matters because a skill captures what a fresh model wouldn’t know. If the model could have reasoned its way there anyway, the skill wasn’t needed.

The practical consequence is that self-generation is the obvious shortcut. You finish a workflow, ask the agent to extract it as a skill, and move on. The benchmark says that without a human review step, you’re not getting the gains you’d expect. The skills look complete. They often cover the main path. What they miss are the edge cases, the exceptions, the three things your team does differently that the model has no way of knowing, and those are exactly the things that make a skill valuable.

One finding worth noting for anyone building with skills: focused skills with two to three modules consistently outperformed comprehensive documentation (“SkillsBench,” Section 4.2). More coverage in a single skill didn’t help; more focused, well-scoped skills did. The benchmark also found that smaller models running with curated skills could match larger models running without them, which is a meaningful cost implication for anyone running skills at scale (“SkillsBench,” Section 4.2.3, Finding 7).

Questions that come up when building with skills

These questions show up every time a team starts building a skill library.

When does something become a skill versus staying in a workflow or system prompt?
The cleaner test is whether this is a recurring task that your team has a specific, repeatable way of doing. If yes, it’s a skill candidate. If it’s a one-time flow or something where general reasoning is sufficient, it probably doesn’t need one. The key difference between a skill and a workflow tool like n8n is flexibility. A workflow executes a fixed sequence and breaks when inputs change, while a skill gives the agent procedural guidance it can apply to variations of the same task. Similarly, agentic workflows can chain multiple agents and tasks together, but each agent still benefits from skills that encode the org-specific knowledge for its part of the chain. When you want the what to be consistent but the agent to handle the how intelligently, that’s a skill.

How narrow or broad should a skill be?
The SkillsBench finding that focused skills with two to three modules outperform comprehensive ones is directly relevant here (“SkillsBench,” Section 4.2). A skill that tries to cover an entire domain tends to underperform one that handles a specific thing well. The more practical question is whether to put a full workflow (data fetch, format, generate PDF) into one skill or split it. Current research supports splitting because, then, each piece becomes reusable, easier to update when something changes, and less likely to create unexpected behavior when one module’s scope drifts.

What about skills for noncoders or nonsoftware workflows?
Skills are format-agnostic. They’re structured instructions plus optional scripts, and the domain can be anything. A customer support team can encode their escalation criteria, tone guidelines, and the specific conditions where a human always takes over. A legal team can encode their document review checklist. A design team can encode component standards so reviews stay consistent across contributors. Atlassian’s Rovo agents are a useful reference outside the coding context. Their skills handle ticket triage, Confluence page creation, and service request routing, none of which is software engineering.

When should you deprecate a skill?
This is the question that gets skipped most often. The “SoK” paper argues for treating skills like any other maintained artifact through discovery, refinement, evaluation, update, and eventually deprecation (see Figure 2 in the paper). A skill that was compensating for a model capability gap six months ago may now be redundant, and worse than redundant if it’s overriding better native behavior. The practical test is to run the task with and without the skill and check if the skill still helps. If the gap has closed, retire it.

What breaks as the library grows

A single well-written skill works well. As libraries grow, flat retrieval breaks down, and the “AgentSkillOS” paper is the first to study this systematically across ecosystem scales from 200 to 200,000 skills.

Flat skill libraries don’t scale. When the agent scans a flat directory of, say, 80+ skills on every request, retrieval becomes unreliable. Two skills with similar descriptions start triggering interchangeably and behavior becomes nondeterministic for the same input. At the extreme, the orchestrator falls into routing collapse, where it consistently invokes the wrong skill because the semantic embeddings of two similar skills are indistinguishable. The output looks reasonable BUT the wrong skill ran.

The fix the paper proposes is capability trees: organize skills into a hierarchy rather than a flat list. Top-level domains like code, data, docs, with more specific skills as branches and leaves. The agent navigates from domain to branch to leaf instead of scanning everything. They also introduce a usage frequency queue, where skills that aren’t being invoked or aren’t improving outcomes get moved to a dormant index so they don’t pollute retrieval for active skills.

Testing this across ecosystems ranging from 200 to over 200,000 skills, the structured approach consistently outperformed flat invocation, and the gap widened as library size grew.

This pattern shows up in how production teams manage their libraries too. Atlassian recommends fewer than five skills per Rovo agent. OpenHands maintains a curated extensions repository with separate skill packages for discrete workflows rather than one monolithic skill set. Across all of them, scoped purposeful skill sets outperform comprehensive ones. More skills isn’t more capable. Past a point, it’s just more noise.

How orchestration can work differently

This section uses a different definition of skill than the rest of the article, so the distinction matters upfront.

In the “SkillOrchestra” paper, a skill isn’t a SKILL.md file. It’s a capability description used to match task requirements to individual agents in a multi-agent system (see Figure 3 in the paper). The concern isn’t procedural knowledge for one agent but figuring out which agent in a pool should handle a given task and why.

The problem it’s solving is that standard reinforcement learning approaches to multi-agent routing don’t hold up as systems grow. Adding a new agent or modifying a workflow means retraining from scratch. RL policies also tend to send everything to the highest-capability agent regardless of cost, which looks fine in evaluation but gets expensive when you’re running it in production.

SkillOrchestra’s alternative has each agent maintain a competence profile derived from its own execution history, specifically estimated success rates across different task types. The orchestrator routes incoming tasks to the agent whose profile best matches what the task actually demands, rather than the one with the highest raw capability. The routing logic stays current without retraining, and you can inspect why a task went where it went.

The same logic applies to SKILL.md-based systems. Tracking which skills actually improve outcomes for specific task types, and what they cost in tokens, gives you the foundation for better selection as your library grows. You don’t need SkillOrchestra’s full framework to benefit from the core idea.

The security problem

A large-scale security analysis of 31,132 community-sourced skills found that 26.1% contain at least one exploitable vulnerability, spanning prompt injection, data exfiltration, privilege escalation, and supply chain risks. More than one in four.

The attack patterns aren’t exotic. Prompt injection hidden in skill descriptions that manipulate agent behavior once the skill loads. Scripts that execute against filesystem permissions broader than the skill needs. Tool authorizations scoped to the entire workspace when the task only requires one directory.

The core issue is that an external skill isn’t a document you’re reading. It’s code running with your agent’s permissions. Importing a skill from a public repository without reviewing it is like doing an npm install from an unknown author. You wouldn’t do that without at least checking what the package does. That framing changes what due diligence looks like. It means checking the scripts folder before installing, verifying that the permissions the skill requests match what the task actually requires, and sandboxing execution where your environment allows.

The tooling for auditing skills at install time doesn’t exist at the level it should yet. Until it does, the due diligence is manual. OpenHands’ extensions repository and Atlassian’s open source skill package are reasonable references for how production-grade community skills scope permissions. Claude Code’s built-in skill creator also helps here, since it structures permission scoping explicitly from the start.

3 things to do differently

Across all four papers, three recommendations are consistent.

Write skills from real execution. Do the workflow manually with an agent, correct it as you go, then extract it as a skill. The agent has full context of what worked. Skills built from real runbooks, incident reports, and accumulated corrections outperform skills written from scratch. The org-specific edge cases are exactly what the base model doesn’t already know. The general workflow it can handle; the three exceptions your team deals with differently are what the skill needs to capture.

Treat the description as routing logic. The description isn’t a label. It’s how the skill gets triggered at all. Specific phrases, explicit activation conditions, context that distinguishes this skill from adjacent ones. If a skill isn’t firing when you expect it to, or fires when it shouldn’t, rewrite the description first. That’s almost always where the problem is.

Plan for the full lifecycle. Creation is the easy part. Skills drift out of relevance as models improve. A skill that compensated for something Claude couldn’t do eight months ago may now be actively overriding better native behavior. They need to be evaluated against actual task outcomes, updated when workflows change, and retired when they stop earning their place. The teams that treat their skill libraries the way good engineering teams treat their codebase, with reviews, with metrics, with a process for deprecation, are the ones whose libraries stay useful as they grow.

Where this is heading

The shift from prompt engineering to tool use to skill engineering has followed a pattern. Each era produces artifacts that persist longer than the last. Prompts lived in conversations. Tools live in configurations. Skills live in libraries, versioned, shared, maintained, and eventually retired. They behave like code.

Most teams aren’t treating them that way yet. Skills get written quickly, without evaluation criteria, without any plan for what happens when they stop being useful. That’s worked so far because most skill libraries are still small enough to hold in your head. It won’t hold as they become infrastructure.

The teams building durable agent systems won’t be the ones with the most skills. They’ll be the ones who figured out earlier that a skill library needs to be maintained, not just populated, and who started building the discipline to do that before it became urgent.


This article grew out of a live “Chai & AI” session conducted by Prahitha Movva where practitioners debated whether agent skills actually deliver on the hype, or just add another layer of complexity.

Leave a Comment