Lessons From Anthropics Multi Agent Research System

Posted Jun 18, 2025

By packapun

2 min read

I found myself reading Anthropic’s post on their multi-agent research system this week. I’ve been using Claude 4 for personal projects and so far the results have been beyond amazing.

Here’s what stood out for me.

✨ Highlights

Multi-agent systems consist of several LLM-based agents autonomously using tools in a loop to collaboratively solve tasks.
Research workflows require flexibility, enabling agents to pivot or follow tangents as investigations evolve.
The essence of search is compression—distilling relevant insights from large corpora into actionable summaries.
These systems excel at breadth-first queries, where many independent directions must be explored in parallel.
Claude Sonnet 4 subagents outperformed the single-agent Claude Opus 4 by 90.2% on internal research evaluations.

📊 Key Findings

Token usage explains ~80% of performance variance; the number of tool calls and model choice also matter significantly.
Claude Sonnet 4 offers better performance gains than simply doubling the token budget on Sonnet 3.7.
On average:
- Agents use 4× more tokens than chat-based interactions.
- Multi-agent systems use 15× more tokens than typical chats.

🤖 Agent Design & Behavior

Agents are steered by prompts, and prompt engineering was the primary lever to improve performance.
Without detailed task descriptions, agents tend to duplicate work, leave gaps, or miss critical information.
Prompt templates included rules to scale effort with task complexity, helping agents better allocate focus.

🧪 Evaluation and Feedback

The team built LLM judges to evaluate agent outputs using a rubric:
- Factual accuracy
- Citation accuracy
- Completeness
- Source quality
- Tool efficiency
A single LLM prompt providing a 0.0–1.0 score and a pass/fail verdict aligned best with human judgment.
Tool ergonomics testing (e.g., dozens of dry runs) led to 40% faster task completion by improving descriptions and minimizing errors.

🧱 System Design Innovations

Context Limit Management:
When hitting token limits, agents spawn fresh subagents with clean contexts while handing off prior progress.
Artifact-Based Outputs:
Subagents write outputs directly to a filesystem instead of funneling everything through the lead agent—avoiding the “game of telephone” and reducing token bloat.
Rainbow Deployments:
They used rainbow deployments to update agent infrastructure without disrupting active runs, gradually shifting traffic across versions.
Privacy-Preserving Monitoring:
Agent decision patterns and interaction structures are monitored without inspecting content, preserving user privacy.

🔁 Interleaved Thinking

Each subagent performs web searches, runs tool queries, and evaluates results using interleaved thinking before passing results to the LeadResearcher.

Overall, there’s a strong case for using multiple agents when the benefits outweigh the total cost of compute. I’m going to explore how MCP tool descriptions can improve AI agent performance in my projects.

Here’s a link to the blog. https://lnkd.in/gqvEaaWp

This post is licensed under CC BY 4.0 by the author.