Rubber Duck in GitHub Copilot CLI is now available in experimental mode. Share your feedback with us in the discussion.
This feature addresses self-reflection bias in AI agents by using heterogeneous model families for peer review. It significantly improves accuracy in complex, multi-file coding tasks, helping engineers catch architectural flaws and silent bugs before they compound into major technical debt.
When you ask a coding agent to build a data pipeline, it may not use the best structure. But what if the agent got a second opinion before it executed the plan?
Today, in GitHub Copilot CLI, we’re introducing Rubber Duck in experimental mode. Rubber Duck leverages a second model from a different AI family to act as an independent reviewer, assessing the agent’s plans and work at the moments where feedback matters most.
To catch different kinds of errors, a different perspective matters. Our evaluations show that Claude Sonnet + Rubber Duck makes up 74.7% of the performance gap between Sonnet and Opus alone, achieving better results for tackling difficult multi-file and long-running tasks. Use /experimental in Copilot CLI to access Rubber Duck alongside our other experimental features.
Today’s coding agents follow a clear loop. First, the agent assesses the task, then drafts a plan, implements, tests, and iterates if necessary. It’s a powerful flow that works well, but it has blind spots. Any decision an agent makes early on, especially in the planning stage, is the foundation you’re building upon. Assumptions and inefficiencies become dependencies, and by the time you notice, you may have to fix more than just the small mistake at the start.
Using self-reflection and having the agent review its own output before moving forward is a proven technique. However, a model reviewing its own work is still bounded by its own training biases: the same training data and techniques, the same blind spots.
Rubber Duck is a focused review agent, powered by a model from a complementary family to your primary Copilot session. When you’ve selected a Claude model from the model picker to use as your orchestrator, Rubber Duck will be GPT-5.4. As we experiment with Rubber Duck, we are exploring other model families for the orchestrator and for the Rubber Duck. The job of Rubber Duck is to check the agent’s work and surface a short, focused list of high-value concerns: details that the primary agent may have missed, assumptions worth questioning, and edge cases to consider.
We evaluated Rubber Duck on SWE-Bench Pro, a benchmark of large, difficult, real-world coding problems drawn from open-source repositories. Here’s what we found:
Claude Sonnet 4.6 paired with Rubber Duck running GPT-5.4 achieved a resolution rate approaching Claude Opus 4.6 running alone, closing 74.7% of the performance gap between Sonnet and Opus.
We noticed that Rubber Duck tends to help more with difficult problems, ones that span 3+ files and would normally take 70+ steps. On these problems, Sonnet + Rubber Duck scores 3.8% higher than the Sonnet baseline, and 4.8% higher on the hardest problems identified across three trials. Here are a few examples of what Rubber Duck finds:
dict key on every iteration. Three of four Solr facet categories were being dropped from every search query, with no error thrown.GitHub Copilot can call Rubber Duck automatically, both proactively and reactively, and it can be triggered by a user at any time to critique and revise its work.
For complex work, GitHub Copilot may seek a critique automatically at the checkpoints where feedback has the highest return:
The agent can also seek a critique reactively if it gets stuck in a loop or can’t make progress. Consulting Rubber Duck can break the logjam.
As a user, you can request a critique at any point. Copilot will query Rubber Duck, reason over the feedback, and show you what changed and why.
We made a key design choice: the agent invokes Rubber Duck sparingly, targeting the moments where the signal is highest, without getting in the way. For the technically curious: Rubber Duck is invoked through Copilot’s existing task tool—the same infrastructure used for other subagents.
For now, we are enabling Rubber Duck for all Claude family models (Opus, Sonnet, and Haiku) used as orchestrators in the model picker. We are already exploring other model families for the Rubber Duck to pair with GPT-5.4 as the orchestrator.
Rubber Duck is available today in experimental mode.
To start using it, install GitHub Copilot CLI, and run the /experimental slash command. Rubber Duck will be available when you select any Claude model from the model picker and have access enabled to GPT-5.4. You’ll see critiques surface in two ways:
Where Rubber Duck helps most:
Rubber Duck in GitHub Copilot CLI is now available in experimental mode. Share your feedback with us in the discussion.
The post GitHub Copilot CLI combines model families for a second opinion appeared first on The GitHub Blog.
Continue reading on the original blog to support the author
Read full articleCustom agents reduce friction by embedding team-specific context and standards directly into the CLI. This allows engineers to automate repetitive tasks with consistent, reviewable, and version-controlled AI workflows, ensuring high-quality outputs across the entire development lifecycle.
GitHub Universe 2026 highlights the shift toward agentic workflows, where AI agents become core collaborators in software development. For engineers, it's a chance to move from AI demos to practical, integrated workflows while networking with peers solving similar scale problems.
This app shifts AI from simple chat prompts to autonomous agents handling complex workflows. By providing isolated environments and visual collaboration tools, it reduces the cognitive load of managing multiple AI-driven tasks while maintaining human oversight and code quality.
AI is evolving from simple autocomplete to autonomous agents that handle complex SDLC tasks. GitHub's leadership highlights the shift toward orchestrating outcomes rather than just writing code, promising significant productivity gains and better governance for enterprise engineering teams.