Large language models have a fundamental limitation baked into their architecture: they’re frozen in time. Every LLM is trained at a specific point, and the moment that training completes, the model’s knowledge begins aging. For most domains, this gradual staleness is manageable. For software development? It’s a serious problem.
Google DeepMind recently published their findings on using agent skills to address this knowledge gap, and the results are worth examining—both for what worked and what didn’t.
The Knowledge Gap Problem
Software engineering moves fast. New libraries ship daily. Best practices evolve. SDK interfaces change. An LLM trained six months ago might confidently generate code using deprecated APIs, outdated patterns, or—in Google’s case—SDKs that no longer exist.
DeepMind sees this firsthand with their own models: Gemini doesn’t inherently know about itself when trained, and it isn’t necessarily aware of subtle changes like thought circulation patterns or recent SDK updates. The model that’s supposed to help developers write Gemini code doesn’t actually know the current Gemini APIs.
Several solutions exist for bridging this gap—web search tools, dedicated MCP services, retrieval-augmented generation—but agent skills have emerged as a particularly lightweight approach that deserves attention.
What Google Built
To help coding agents work with the Gemini API, DeepMind built a skill that covers four key areas:
- Feature overview — High-level explanation of API capabilities
- Current models and SDKs — Up-to-date information for each supported language
- Sample code — Basic demonstrations for each SDK
- Documentation entry points — Pointers to official sources of truth
The design philosophy here is important: the skill doesn’t try to encode every API detail. Instead, it provides primitive instructions that guide agents toward current models and SDKs while actively encouraging them to retrieve fresh documentation. It’s a scaffold, not a replacement for actual docs.
The skill is available on GitHub and can be installed directly:
# Install with Vercel skills
npx skills add google-gemini/gemini-skills --skill gemini-api-dev --global
# Install with Context7 skills
npx ctx7 skills install /google-gemini/gemini-skills gemini-api-dev
The Evaluation
DeepMind created a 117-prompt evaluation harness covering Python and TypeScript code generation tasks. The prompts span multiple categories: agentic coding, chatbot development, document processing, streaming content, and specific SDK features.
The test methodology compared “vanilla” mode (direct model prompting) against skill-enabled mode. For the skill tests, models received the same system instruction used by the Gemini CLI, plus two tools: activate_skill and fetch_url for downloading documentation.
A prompt fails if the generated code uses deprecated SDKs.
The Results: Skills Work, But Reasoning Matters
The baseline numbers reveal how severe the knowledge gap is: without skills, both Gemini 3.0 Pro and Flash achieved only 6.8% accuracy on the evaluation. Gemini 3.1 Pro fared better at 28%, but that’s still failing more than two-thirds of the time.
With the skill enabled, the transformation is dramatic. The Gemini 3.x models achieved what DeepMind describes as “excellent results”—with SDK Usage, the lowest-performing category, still hitting 95%. The older 2.5 series improved but saw nowhere near the same gains.
The pattern is clear: modern models with strong reasoning capabilities benefit enormously from skills. The older 2.5 series improves, but nowhere near as dramatically. This suggests that the ability to reason about when and how to use supplementary information is itself a capability that improves with model advancement.
The skill proved effective across almost all evaluation categories, with SDK Usage showing the lowest pass rate at 95%. The failures weren’t systematic—they covered a range of tasks including some that explicitly requested Gemini 2.0 models, where the skill correctly shouldn’t override the user’s explicit intent.
The Honest Limitations
What makes this post valuable isn’t just the positive results—it’s DeepMind’s transparency about the limitations.
AGENTS.md can outperform skills. Vercel’s research found that direct instruction through AGENTS.md files can be more effective than skill-based approaches. Skills provide a standardized, portable format, but sometimes a well-crafted system prompt specific to your workflow beats a generic skill.
The update story is weak. Skills don’t auto-update. Users must manually refresh them, which means workspaces can accumulate stale skill information over time. In the long run, outdated skills could cause more harm than good—the model might confidently follow obsolete guidance rather than admitting uncertainty.
MCP might be the better path. DeepMind is exploring using MCPs directly for documentation retrieval, which could provide fresher information without the staleness problem inherent to packaged skills.
What This Means for Developers
If you’re building with coding agents, several takeaways emerge:
Model selection matters for skill utilization. Don’t expect older or smaller models to leverage skills as effectively. The reasoning capability to know when to activate a skill, what information to extract, and how to apply it appears to scale with model capability.
Skills are a starting point, not an endpoint. They’re lightweight, easy to install, and provide immediate value—but they’re not a complete solution to the knowledge gap problem. Combining skills with documentation access tools creates a more robust system.
Consider the maintenance burden. Before adopting skills across your workflow, think about how you’ll keep them updated. A skill that was accurate six months ago might be actively harmful today.
Direct instruction still has a place. If you have specific, stable requirements, encoding them directly in your system prompts or AGENTS.md might be more effective than relying on generic skills.
The Bigger Picture
Agent skills represent an interesting middle ground in the LLM ecosystem. They’re more structured than ad-hoc prompt engineering but lighter weight than full MCP implementations. They’re portable across agents but not automatically maintained. They bridge knowledge gaps but create new maintenance burdens.
What Google’s evaluation demonstrates is that the approach works—demonstrably, measurably works—when paired with capable models. The 6.8% to near-100% improvement for Gemini 3.x models is not marginal. For teams building with modern LLMs, skills provide a practical mechanism for keeping agents current with rapidly evolving APIs and best practices.
The long-term question is whether the skills ecosystem will mature to handle the update problem, or whether alternative approaches like live MCP services will prove more sustainable. For now, skills offer immediate, tangible benefits with known tradeoffs—exactly the kind of pragmatic tool that belongs in a serious development workflow.
If you’re working with the Gemini API specifically, the gemini-api-dev skill is worth testing. The barrier to entry is a single npx command, and the potential upside—especially if you’re using Gemini 3.x models—is substantial.