AI Coding Agent Architecture: Simple Beats Complex (The Benchmark Data)
Simple agent baselines achieve 93.2% on HumanEval at $2.45. LATS costs $134.50 for 88%. The data shows when complexity helps and when it doesn't. 6 patterns with real benchmarks.
A simple baseline agent achieves 93.2% on HumanEval at $2.45. LATS, a state-of-the-art complex architecture, achieves 88% at $134.50—50x the cost for worse accuracy.
The research paper "AI Agents That Matter" revealed what many suspected: complex agent architectures often don't outperform simple baselines. The architecture you choose should be driven by data, not hype.
This guide covers the six core design patterns with real benchmark numbers, so you can pick the right architecture for your use case.
The Benchmark Reality
Before diving into patterns, look at the data:
| Architecture | HumanEval Accuracy | Cost | Relative Cost |
|---|---|---|---|
| Warming Baseline | 93.2% | $2.45 | 1x |
| LDB (GPT-4/3.5) | 91.0% | $2.19 | 0.9x |
| Zero-shot GPT-4 | 89.6% | $1.93 | 0.8x |
| Reflexion | 87.8% | $3.90 | 1.6x |
| LATS | 88.0% | $134.50 | 55x |
The "warming" baseline—gradually increasing temperature across retries—beats all tested complex architectures while costing 50x less than LATS.
Complex architectures claiming state-of-the-art results may simply benefit from higher computational budgets rather than genuinely superior designs.
Pattern 1: Single Agent + Tools (Start Here)
The simplest architecture. One agent, one loop, external tools.
User Request → Agent → [Tool Calls] → Response
Google's Recommendation
Google Cloud's architecture guidance explicitly recommends starting with single-agent systems:
"If you're early in your agent development, we recommend that you start with a single agent. Focus on refining the core logic, prompt, and tool definitions before adding more complex architectural components."
When It Works
- Tasks requiring multiple steps and external data access
- Well-defined problem spaces
- Moderate tool counts (under 10)
- Latency-sensitive applications
When It Breaks
Effectiveness decreases with more tools and task complexity. Google notes that single agents struggle when:
- Tool count exceeds 10-15
- Tasks require domain specialization
- Complex orchestration is needed
Code Example
from anthropic import Anthropic
def single_agent(request: str, tools: list[Tool]) -> str:
client = Anthropic()
messages = [{"role": "user", "content": request}]
while True:
response = client.messages.create(
model="claude-sonnet-4-20250514",
messages=messages,
tools=tools,
max_tokens=4096
)
if response.stop_reason == "end_turn":
return response.content[0].text
# Execute tool calls
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": [
{"type": "tool_result", "tool_use_id": block.id, "content": result}
]})
Pattern 2: ReAct (Reasoning + Acting)
The foundational agentic pattern. The agent explicitly reasons about what to do, takes an action, observes the result, and repeats.
Think → Act → Observe → Think → Act → Observe → ...
How It Works
ReAct alternates between:
- Reasoning: "I need to find the function definition first"
- Acting: Execute a tool (search, read file, edit)
- Observing: Process the tool's output
- Reasoning: "Now I see the function. I should modify line 42..."
The Research
From "AI Agents That Matter": ReAct cuts down on hallucinations by forcing the model to "show its work" before executing. But it can get stuck and often needs human feedback to recover.
Google's Assessment
"ReAct pattern enables dynamic adaptation through iterative thought-action-observation loops. Trade-off: Higher latency than single queries; effectiveness depends on model reasoning quality."
When to Use ReAct
- Tasks requiring exploration and discovery
- Unknown codebases where context must be discovered
- Complex edits spanning multiple files
- When auditability matters (reasoning is visible)
Code Example
def react_agent(goal: str, max_iterations: int = 10) -> str:
observations = []
for i in range(max_iterations):
# Reason about current state
thought = reason(goal, observations)
if thought.is_complete:
return thought.final_answer
# Take action based on reasoning
action = thought.next_action
observation = execute(action)
observations.append({
"thought": thought.content,
"action": action,
"observation": observation
})
return synthesize(observations)
The Tradeoff
ReAct's explicit reasoning chain means higher latency and token usage. But for complex tasks requiring exploration, the visibility into agent thinking is invaluable for debugging.
Pattern 3: Planning First
The agent creates a complete plan before executing any actions. Good for user visibility and approval workflows.
Input → Plan → [Execute Step 1] → [Execute Step 2] → ... → Output
Google's Hierarchical Task Decomposition
For complex problems, Google recommends breaking down into:
- High-level strategy
- Task decomposition
- Step-by-step execution
- Validation and synthesis
"Use for complex, ambiguous problems requiring extensive multi-level planning. Trade-off: Significant architectural complexity and high model call volumes."
When to Use Planning
- Complex multi-step tasks
- When users need to see/approve what will happen
- Tasks where step order matters critically
- High-stakes operations requiring review
Code Example
def planning_agent(request: str) -> Result:
# Phase 1: Create the plan
plan = create_plan(request)
# Returns: ["Find auth module", "Add validation", "Update imports", "Add tests"]
# Optional: Get user approval
if requires_approval(plan):
approved = await user_approval(plan)
if not approved:
return abort()
# Phase 2: Execute each step
results = []
for step in plan.steps:
result = execute_step(step, context=results)
results.append(result)
# Re-plan if needed
if result.requires_replanning:
plan = replan(plan, results)
return combine_results(results)
The Tradeoff
Planning adds initial latency. And plans often need revision mid-execution. But for user-facing workflows, the visibility is worth it.
Pattern 4: Self-Reflection
The agent evaluates its own outputs and generates improvement feedback. A quality gate before delivery.
Generate → Evaluate → Reflect → Improve → Output
How It Works
Three components:
- Actor: Generates the initial output
- Evaluator: Scores output quality against criteria
- Reflector: Produces specific, actionable feedback
Benchmark Data: Reflexion
From the research:
- Reflexion accuracy: 87.8%
- Cost: $3.90 per task
- Comparison: Simple baseline achieves 93.2% at $2.45
Reflexion is a self-reflection architecture. It doesn't beat simple baselines on HumanEval—but for tasks requiring correctness validation (security review, compliance checking), the explicit evaluation step adds value.
Code Example
def reflective_agent(request: str, quality_threshold: float = 0.9) -> str:
output = generate(request)
for iteration in range(3):
evaluation = evaluate(output, request)
if evaluation.score >= quality_threshold:
return output
reflection = reflect(output, evaluation)
# "The edit doesn't handle null case on line 15.
# Add a guard clause before accessing the property."
output = improve(output, reflection)
return output
When to Use Reflection
- High-stakes outputs requiring validation
- Code generation with security implications
- Tasks where the model can meaningfully evaluate its own work
- When you need explainable quality gates
Pattern 5: Review and Critique
Google's recommended pattern for high-accuracy outputs. A separate agent validates work before delivery.
Generate → Review Agent → [Approve/Reject] → Refine → Output
Google's Description
"Use for high-accuracy outputs requiring validation before use, such as code generation with security checks. Improves reliability; increases latency and cost per iteration."
Different from Self-Reflection
Review and Critique uses a separate agent (potentially a different model) for validation. This avoids the self-serving bias of self-reflection.
Code Example
def review_critique_agent(request: str) -> str:
generator = Agent(model="claude-sonnet-4-20250514")
reviewer = Agent(model="claude-opus-4-20250514") # Different model
output = generator.generate(request)
for iteration in range(3):
review = reviewer.review(output, criteria=[
"correctness",
"security",
"performance",
"style"
])
if review.approved:
return output
output = generator.refine(output, review.feedback)
return output
When to Use Review and Critique
- Security-critical code generation
- Compliance-sensitive outputs
- When you can afford the latency/cost overhead
- High-stakes decisions requiring independent validation
Pattern 6: Multi-Agent Systems
Multiple specialized agents work together. The most complex pattern—use only when specialization clearly helps.
Router → [Agent A] ←→ [Agent B] ←→ [Agent C] → Aggregator
The Cost Reality
From the benchmarks:
- LATS (multi-agent): 88% accuracy, $134.50 cost
- Simple baseline: 93.2% accuracy, $2.45 cost
Multi-agent architectures are 50x more expensive and often produce worse results. The research is clear: "State-of-the-art agent architectures do not outperform simple baselines" on standard benchmarks.
When Multi-Agent Actually Helps
Multi-agent systems win when:
- Domain specialization genuinely improves quality (security expert, testing expert)
- Parallel execution reduces total latency
- Compliance requirements mandate separation of concerns
- Scale requires distributing work across services
Google's Patterns
Sequential: Agents process in order, each building on previous work
Researcher → Writer → Reviewer → Publisher
Parallel: Agents work simultaneously on independent aspects
┌→ Security Agent ──┐
Request →├→ Testing Agent ──┼→ Aggregator
└→ Coding Agent ──┘
Coordinator: Central agent routes to specialists
Coordinator → [Route by type] → Specialist → Response
Code Example
def multi_agent_edit(request: str) -> Result:
# Parallel specialist execution
tasks = []
if needs_security_review(request):
tasks.append(security_agent.review(request))
if needs_test_updates(request):
tasks.append(testing_agent.generate(request))
# Core edit (always)
tasks.append(coding_agent.edit(request))
# Execute in parallel
results = await asyncio.gather(*tasks)
# Aggregate results
return aggregate(results)
The Debugging Tax
Debugging multi-agent systems means analyzing 10-50+ LLM calls across agents. Every agent you add multiplies debugging complexity.
Google warns:
"Swarm pattern (collaborative multi-agent) is the most complex and costly pattern with unproductive loop risks."
| Pattern | Latency | Cost | Debug Complexity |
|---|---|---|---|
| Single Agent | Low | Low | Simple |
| ReAct | Medium | Medium | Moderate |
| Planning | Medium | Medium | Moderate |
| Reflection | High | Medium-High | Moderate |
| Review/Critique | High | High | Complex |
| Multi-Agent | Variable | Very High | Very Complex |
The Morphcode Approach
For code editing, we've found specialized simplicity beats general-purpose complexity.
Parallel Single-Agent
Instead of multi-agent coordination, we run the same optimized agent on multiple independent tasks:
Request → [Task 1 Agent] → ─┐
→ [Task 2 Agent] → ─┼→ Merge
→ [Task 3 Agent] → ─┘
No coordination overhead. No inter-agent communication latency. Each agent is identical but tasks are independent.
Speculative Execution
Predict likely edits and pre-compute before user confirms:
User types... → Predict intent → Pre-compute top 3 → User confirms → Instant apply
This is how we achieve 10,500 tok/s—we're not waiting for the user.
Direct Execution
No orchestration framework overhead:
# Not this:
chain = LangChain() | Parser() | Editor() | Validator()
# This:
edit = fast_apply(context, request) # Single optimized call
Simple Architecture, Maximum Speed
Morphcode proves simple beats complex. 10,500 tok/s code editing without framework overhead.
Try FreeChoosing Your Architecture
Decision Framework
-
Start with Single Agent + Tools
- 93.2% accuracy at $2.45 on HumanEval
- Add complexity only when it clearly fails
-
Add ReAct for Exploration
- Unknown codebases, discovery tasks
- When auditability matters
-
Add Planning for User Visibility
- Approval workflows, high-stakes operations
- When users need to see what will happen
-
Add Reflection/Review for Quality Gates
- Security-critical code
- Compliance requirements
- When you can afford 2-3x latency
-
Go Multi-Agent Only When:
- Domain specialization clearly improves quality
- Parallel execution reduces total latency
- Compliance requires separation
- Simple approaches have demonstrably failed
The Hard Truth
The research shows complex agent architectures often just burn more tokens without improving results. The agents that matter are the ones that solve problems efficiently—not the ones with the most sophisticated designs.
Start simple. Measure everything. Add complexity only when the data demands it.
Sources: "AI Agents That Matter" (arXiv:2407.01502), Google Cloud Architecture Center, Berkeley Function Calling Leaderboard.