Architecture

AI Coding Agent Architecture: Simple Beats Complex (The Benchmark Data)

Simple agent baselines achieve 93.2% on HumanEval at $2.45. LATS costs $134.50 for 88%. The data shows when complexity helps and when it doesn't. 6 patterns with real benchmarks.

Published2025-12-30

Read12 min

ByTejas Shah/ Founder

A simple baseline agent achieves 93.2% on HumanEval at $2.45. LATS, a state-of-the-art complex architecture, achieves 88% at $134.50—50x the cost for worse accuracy.

The research paper "AI Agents That Matter" revealed what many suspected: complex agent architectures often don't outperform simple baselines. The architecture you choose should be driven by data, not hype.

This guide covers the six core design patterns with real benchmark numbers, so you can pick the right architecture for your use case.

The Benchmark Reality

Before diving into patterns, look at the data:

Architecture	HumanEval Accuracy	Cost	Relative Cost
Warming Baseline	93.2%	$2.45	1x
LDB (GPT-4/3.5)	91.0%	$2.19	0.9x
Zero-shot GPT-4	89.6%	$1.93	0.8x
Reflexion	87.8%	$3.90	1.6x
LATS	88.0%	$134.50	55x

The "warming" baseline—gradually increasing temperature across retries—beats all tested complex architectures while costing 50x less than LATS.

Complex architectures claiming state-of-the-art results may simply benefit from higher computational budgets rather than genuinely superior designs.

Pattern 1: Single Agent + Tools (Start Here)

The simplest architecture. One agent, one loop, external tools.

User Request → Agent → [Tool Calls] → Response

Google's Recommendation

Google Cloud's architecture guidance explicitly recommends starting with single-agent systems:

"If you're early in your agent development, we recommend that you start with a single agent. Focus on refining the core logic, prompt, and tool definitions before adding more complex architectural components."

When It Works

Tasks requiring multiple steps and external data access
Well-defined problem spaces
Moderate tool counts (under 10)
Latency-sensitive applications

When It Breaks

Effectiveness decreases with more tools and task complexity. Google notes that single agents struggle when:

Tool count exceeds 10-15
Tasks require domain specialization
Complex orchestration is needed

Code Example

from anthropic import Anthropic

def single_agent(request: str, tools: list[Tool]) -> str:
    client = Anthropic()
    messages = [{"role": "user", "content": request}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            messages=messages,
            tools=tools,
            max_tokens=4096
        )

        if response.stop_reason == "end_turn":
            return response.content[0].text

        # Execute tool calls
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                messages.append({"role": "assistant", "content": response.content})
                messages.append({"role": "user", "content": [
                    {"type": "tool_result", "tool_use_id": block.id, "content": result}
                ]})

Single Agent PerformanceHumanEval (warming baseline)

93.2%

Cost$2.45%

ComplexityMinimal%

Debug timeLow%

Pattern 2: ReAct (Reasoning + Acting)

The foundational agentic pattern. The agent explicitly reasons about what to do, takes an action, observes the result, and repeats.

Think → Act → Observe → Think → Act → Observe → ...

How It Works

ReAct alternates between:

Reasoning: "I need to find the function definition first"
Acting: Execute a tool (search, read file, edit)
Observing: Process the tool's output
Reasoning: "Now I see the function. I should modify line 42..."

The Research

From "AI Agents That Matter": ReAct cuts down on hallucinations by forcing the model to "show its work" before executing. But it can get stuck and often needs human feedback to recover.

Google's Assessment

"ReAct pattern enables dynamic adaptation through iterative thought-action-observation loops. Trade-off: Higher latency than single queries; effectiveness depends on model reasoning quality."

When to Use ReAct

Tasks requiring exploration and discovery
Unknown codebases where context must be discovered
Complex edits spanning multiple files
When auditability matters (reasoning is visible)

Code Example

def react_agent(goal: str, max_iterations: int = 10) -> str:
    observations = []

    for i in range(max_iterations):
        # Reason about current state
        thought = reason(goal, observations)

        if thought.is_complete:
            return thought.final_answer

        # Take action based on reasoning
        action = thought.next_action
        observation = execute(action)
        observations.append({
            "thought": thought.content,
            "action": action,
            "observation": observation
        })

    return synthesize(observations)

The Tradeoff

ReAct's explicit reasoning chain means higher latency and token usage. But for complex tasks requiring exploration, the visibility into agent thinking is invaluable for debugging.

Pattern 3: Planning First

The agent creates a complete plan before executing any actions. Good for user visibility and approval workflows.

Input → Plan → [Execute Step 1] → [Execute Step 2] → ... → Output

Google's Hierarchical Task Decomposition

For complex problems, Google recommends breaking down into:

High-level strategy
Task decomposition
Step-by-step execution
Validation and synthesis

"Use for complex, ambiguous problems requiring extensive multi-level planning. Trade-off: Significant architectural complexity and high model call volumes."

When to Use Planning

Complex multi-step tasks
When users need to see/approve what will happen
Tasks where step order matters critically
High-stakes operations requiring review

Code Example

def planning_agent(request: str) -> Result:
    # Phase 1: Create the plan
    plan = create_plan(request)
    # Returns: ["Find auth module", "Add validation", "Update imports", "Add tests"]

    # Optional: Get user approval
    if requires_approval(plan):
        approved = await user_approval(plan)
        if not approved:
            return abort()

    # Phase 2: Execute each step
    results = []
    for step in plan.steps:
        result = execute_step(step, context=results)
        results.append(result)

        # Re-plan if needed
        if result.requires_replanning:
            plan = replan(plan, results)

    return combine_results(results)

The Tradeoff

Planning adds initial latency. And plans often need revision mid-execution. But for user-facing workflows, the visibility is worth it.

Pattern 4: Self-Reflection

The agent evaluates its own outputs and generates improvement feedback. A quality gate before delivery.

Generate → Evaluate → Reflect → Improve → Output

How It Works

Three components:

Actor: Generates the initial output
Evaluator: Scores output quality against criteria
Reflector: Produces specific, actionable feedback

Benchmark Data: Reflexion

From the research:

Reflexion accuracy: 87.8%
Cost: $3.90 per task
Comparison: Simple baseline achieves 93.2% at $2.45

Reflexion is a self-reflection architecture. It doesn't beat simple baselines on HumanEval—but for tasks requiring correctness validation (security review, compliance checking), the explicit evaluation step adds value.

Code Example

def reflective_agent(request: str, quality_threshold: float = 0.9) -> str:
    output = generate(request)

    for iteration in range(3):
        evaluation = evaluate(output, request)

        if evaluation.score >= quality_threshold:
            return output

        reflection = reflect(output, evaluation)
        # "The edit doesn't handle null case on line 15.
        #  Add a guard clause before accessing the property."

        output = improve(output, reflection)

    return output

When to Use Reflection

High-stakes outputs requiring validation
Code generation with security implications
Tasks where the model can meaningfully evaluate its own work
When you need explainable quality gates

Pattern 5: Review and Critique

Google's recommended pattern for high-accuracy outputs. A separate agent validates work before delivery.

Generate → Review Agent → [Approve/Reject] → Refine → Output

Google's Description

"Use for high-accuracy outputs requiring validation before use, such as code generation with security checks. Improves reliability; increases latency and cost per iteration."

Different from Self-Reflection

Review and Critique uses a separate agent (potentially a different model) for validation. This avoids the self-serving bias of self-reflection.

Code Example

def review_critique_agent(request: str) -> str:
    generator = Agent(model="claude-sonnet-4-20250514")
    reviewer = Agent(model="claude-opus-4-20250514")  # Different model

    output = generator.generate(request)

    for iteration in range(3):
        review = reviewer.review(output, criteria=[
            "correctness",
            "security",
            "performance",
            "style"
        ])

        if review.approved:
            return output

        output = generator.refine(output, review.feedback)

    return output

When to Use Review and Critique

Security-critical code generation
Compliance-sensitive outputs
When you can afford the latency/cost overhead
High-stakes decisions requiring independent validation

Pattern 6: Multi-Agent Systems

Multiple specialized agents work together. The most complex pattern—use only when specialization clearly helps.

Router → [Agent A] ←→ [Agent B] ←→ [Agent C] → Aggregator

The Cost Reality

From the benchmarks:

LATS (multi-agent): 88% accuracy, $134.50 cost
Simple baseline: 93.2% accuracy, $2.45 cost

Multi-agent architectures are 50x more expensive and often produce worse results. The research is clear: "State-of-the-art agent architectures do not outperform simple baselines" on standard benchmarks.

When Multi-Agent Actually Helps

Multi-agent systems win when:

Domain specialization genuinely improves quality (security expert, testing expert)
Parallel execution reduces total latency
Compliance requirements mandate separation of concerns
Scale requires distributing work across services

Google's Patterns

Sequential: Agents process in order, each building on previous work

Researcher → Writer → Reviewer → Publisher

Parallel: Agents work simultaneously on independent aspects

         ┌→ Security Agent ──┐
Request →├→ Testing Agent   ──┼→ Aggregator
         └→ Coding Agent   ──┘

Coordinator: Central agent routes to specialists

Coordinator → [Route by type] → Specialist → Response

Code Example

def multi_agent_edit(request: str) -> Result:
    # Parallel specialist execution
    tasks = []

    if needs_security_review(request):
        tasks.append(security_agent.review(request))

    if needs_test_updates(request):
        tasks.append(testing_agent.generate(request))

    # Core edit (always)
    tasks.append(coding_agent.edit(request))

    # Execute in parallel
    results = await asyncio.gather(*tasks)

    # Aggregate results
    return aggregate(results)

The Debugging Tax

Debugging multi-agent systems means analyzing 10-50+ LLM calls across agents. Every agent you add multiplies debugging complexity.

Google warns:

"Swarm pattern (collaborative multi-agent) is the most complex and costly pattern with unproductive loop risks."

Pattern	Latency	Cost	Debug Complexity
Single Agent	Low	Low	Simple
ReAct	Medium	Medium	Moderate
Planning	Medium	Medium	Moderate
Reflection	High	Medium-High	Moderate
Review/Critique	High	High	Complex
Multi-Agent	Variable	Very High	Very Complex

The Morphcode Approach

For code editing, we've found specialized simplicity beats general-purpose complexity.

Parallel Single-Agent

Instead of multi-agent coordination, we run the same optimized agent on multiple independent tasks:

Request → [Task 1 Agent] → ─┐
        → [Task 2 Agent] → ─┼→ Merge
        → [Task 3 Agent] → ─┘

No coordination overhead. No inter-agent communication latency. Each agent is identical but tasks are independent.

Speculative Execution

Predict likely edits and pre-compute before user confirms:

User types... → Predict intent → Pre-compute top 3 → User confirms → Instant apply

This is how we achieve 10,500 tok/s—we're not waiting for the user.

Direct Execution

No orchestration framework overhead:

# Not this:
chain = LangChain() | Parser() | Editor() | Validator()

# This:
edit = fast_apply(context, request)  # Single optimized call

Simple Architecture, Maximum Speed

Morphcode proves simple beats complex. 10,500 tok/s code editing without framework overhead.

Try Free

Choosing Your Architecture

Decision Framework

Start with Single Agent + Tools
- 93.2% accuracy at $2.45 on HumanEval
- Add complexity only when it clearly fails
Add ReAct for Exploration
- Unknown codebases, discovery tasks
- When auditability matters
Add Planning for User Visibility
- Approval workflows, high-stakes operations
- When users need to see what will happen
Add Reflection/Review for Quality Gates
- Security-critical code
- Compliance requirements
- When you can afford 2-3x latency
Go Multi-Agent Only When:
- Domain specialization clearly improves quality
- Parallel execution reduces total latency
- Compliance requires separation
- Simple approaches have demonstrably failed

The Hard Truth

The research shows complex agent architectures often just burn more tokens without improving results. The agents that matter are the ones that solve problems efficiently—not the ones with the most sophisticated designs.

Start simple. Measure everything. Add complexity only when the data demands it.

Sources: "AI Agents That Matter" (arXiv:2407.01502), Google Cloud Architecture Center, Berkeley Function Calling Leaderboard.