Rubber Duck in GitHub Copilot CLI is now available in experimental mode. Share your feedback with us in the discussion.
This feature addresses self-reflection bias in AI agents by using heterogeneous model families for peer review. It significantly improves accuracy in complex, multi-file coding tasks, helping engineers catch architectural flaws and silent bugs before they compound into major technical debt.
When you ask a coding agent to build a data pipeline, it may not use the best structure. But what if the agent got a second opinion before it executed the plan?
Today, in GitHub Copilot CLI, we’re introducing Rubber Duck in experimental mode. Rubber Duck leverages a second model from a different AI family to act as an independent reviewer, assessing the agent’s plans and work at the moments where feedback matters most.
To catch different kinds of errors, a different perspective matters. Our evaluations show that Claude Sonnet + Rubber Duck makes up 74.7% of the performance gap between Sonnet and Opus alone, achieving better results for tackling difficult multi-file and long-running tasks. Use /experimental in Copilot CLI to access Rubber Duck alongside our other experimental features.
Today’s coding agents follow a clear loop. First, the agent assesses the task, then drafts a plan, implements, tests, and iterates if necessary. It’s a powerful flow that works well, but it has blind spots. Any decision an agent makes early on, especially in the planning stage, is the foundation you’re building upon. Assumptions and inefficiencies become dependencies, and by the time you notice, you may have to fix more than just the small mistake at the start.
Using self-reflection and having the agent review its own output before moving forward is a proven technique. However, a model reviewing its own work is still bounded by its own training biases: the same training data and techniques, the same blind spots.
Rubber Duck is a focused review agent, powered by a model from a complementary family to your primary Copilot session. When you’ve selected a Claude model from the model picker to use as your orchestrator, Rubber Duck will be GPT-5.4. As we experiment with Rubber Duck, we are exploring other model families for the orchestrator and for the Rubber Duck. The job of Rubber Duck is to check the agent’s work and surface a short, focused list of high-value concerns: details that the primary agent may have missed, assumptions worth questioning, and edge cases to consider.
We evaluated Rubber Duck on SWE-Bench Pro, a benchmark of large, difficult, real-world coding problems drawn from open-source repositories. Here’s what we found:
Claude Sonnet 4.6 paired with Rubber Duck running GPT-5.4 achieved a resolution rate approaching Claude Opus 4.6 running alone, closing 74.7% of the performance gap between Sonnet and Opus.
We noticed that Rubber Duck tends to help more with difficult problems, ones that span 3+ files and would normally take 70+ steps. On these problems, Sonnet + Rubber Duck scores 3.8% higher than the Sonnet baseline, and 4.8% higher on the hardest problems identified across three trials. Here are a few examples of what Rubber Duck finds:
dict key on every iteration. Three of four Solr facet categories were being dropped from every search query, with no error thrown.GitHub Copilot can call Rubber Duck automatically, both proactively and reactively, and it can be triggered by a user at any time to critique and revise its work.
For complex work, GitHub Copilot may seek a critique automatically at the checkpoints where feedback has the highest return:
The agent can also seek a critique reactively if it gets stuck in a loop or can’t make progress. Consulting Rubber Duck can break the logjam.
As a user, you can request a critique at any point. Copilot will query Rubber Duck, reason over the feedback, and show you what changed and why.
We made a key design choice: the agent invokes Rubber Duck sparingly, targeting the moments where the signal is highest, without getting in the way. For the technically curious: Rubber Duck is invoked through Copilot’s existing task tool—the same infrastructure used for other subagents.
For now, we are enabling Rubber Duck for all Claude family models (Opus, Sonnet, and Haiku) used as orchestrators in the model picker. We are already exploring other model families for the Rubber Duck to pair with GPT-5.4 as the orchestrator.
Rubber Duck is available today in experimental mode.
To start using it, install GitHub Copilot CLI, and run the /experimental slash command. Rubber Duck will be available when you select any Claude model from the model picker and have access enabled to GPT-5.4. You’ll see critiques surface in two ways:
Where Rubber Duck helps most:
Rubber Duck in GitHub Copilot CLI is now available in experimental mode. Share your feedback with us in the discussion.
The post GitHub Copilot CLI combines model families for a second opinion appeared first on The GitHub Blog.
Continue reading on the original blog to support the author
Read full articleThis change reflects the increasing cost of running agentic AI models. For engineers, it introduces a metered cost structure, requiring better management of AI consumption while enabling access to high-compute agentic features without the previous hard gates on usage.
This demonstrates how AI-assisted development and specialized SDKs can drastically reduce the time needed to build functional internal tools. It highlights the shift from manual coding to high-level planning and architectural review using modern LLMs.
This highlights how AI-driven workflows and the Model Context Protocol (MCP) enable engineers to rapidly build custom productivity tools. It showcases a shift toward 'plan-then-implement' development, allowing developers to focus on architecture while AI handles the implementation details.
GitHub Copilot CLI streamlines development by bringing AI-powered code generation and autonomous agents directly into the terminal. This reduces context switching, enabling faster iterative building and automated error correction within the local environment.