CMA vs Messages API

Spike Codebase research workload — Sonnet 4.6 — 3 prompts 2026-05-08
Cost reduction
19×
$9.72 → $0.51 across 3 prompts
Wall time
−18%
avg 172.3s → 140.7s per prompt
Tool calls
−58%
avg 64 → 27 per prompt
Output quality
indistinguishable on rubric
Validated claim: for codebase research as a workload, CMA is the right tool. The cost gap is dominated by automatic prompt caching in CMA sessions — a Messages-API path that opted into cache_control would narrow the gap to roughly 3–5×, but CMA gets caching for free where Messages API has to opt in deliberately. Spike does not validate CMA for writing code, running tests, or opening PRs. Those remain untested.

Aggregate comparison 3 prompts, totals

Total cost USD across 3 prompts
Messages API
$9.72 baseline
 
CMA
$0.51 −95% (19× cheaper)
Wall time total seconds
Messages API
517s baseline
 
CMA
422s −18% faster
Tool calls avg per prompt
Messages API
64 baseline
 
CMA
27 −58% fewer

Spike setup

Messages API path
  • Runtimedirect client.messages.create loop
  • Toolssearch_code (ripgrep), read_file, list_files
  • Repo accesslocal kendo clone on disk
  • Cachingnone — no cache_control markers
  • Stopstop_reason != "tool_use"
CMA path
  • RuntimeSession on a pre-created Agent + Environment
  • Toolsagent_toolset_20260401 (bash, read, write, edit, glob, grep)
  • Repo accessgithub_repository resource (read-only PAT)
  • Cachingautomatic, server-side
  • Stopsession.status_idle + end_turn
Both paths use Sonnet 4.6, share an output cap of MAX_TOKENS_PER_RUN=4096, and use aligned system prompts. The comparison is on the runtime, not the framing.

Per-prompt verbatim numbers 3 prompts

Prompt MA wall CMA wall MA tool calls CMA tool calls MA cost CMA cost Cost ratio
01-find-branch-linker-usages 135.3s 100.2s 42 11 $1.95 $0.07 28×
03-webhook-intake-pattern 154.4s 182.3s 57 32 $2.77 $0.24 12×
06-feature-flag-pattern 227.2s 139.6s 95 37 $5.01 $0.20 25×
Average 172.3s 140.7s 64 27 $3.24 $0.17 19×
On token columns: CMA reports ~10–20 input tokens per run because system prompt, tool definitions, and accumulated history all hit the prompt cache (cache reads ranged 105k–513k per prompt). Messages-API input tokens are cumulative — at 95 tool calls on prompt 06, conversation history alone reached 1.6M tokens of replayed input. That cumulative replay is the dominant cost driver, and it's exactly what cache markers would shrink.

Prompt 01 · branch linker

Narrow component-tracing question. CMA: 100s, 11 tool calls. MA: 135s, 42 tool calls.

CMA's tool-call efficiency comes from one-shot bash compositions (find ... | xargs grep ...) replacing what MA does as 5–7 sequential round-trips.

Prompt 03 · webhook intake

Multi-file pattern across routes, middleware, controllers, jobs, audit. The only prompt where CMA was slower (182s vs 154s).

CMA went deeper — 32 tool calls covering audit logger and queued job retry config. MA stopped earlier with a thinner answer. Wall-time loss paired with a richer report.

Prompt 06 · feature flag

Pennant flag enumeration. CMA: 140s, 37 tool calls. MA: 227s, 95 tool calls.

MA looped extensively — five separate searches for the #[Name] attribute, repeated reads of AppServiceProvider. Cumulative input hit 1.6M tokens. Cost gap most extreme here: 25×.

Findings

1
Cost — CMA wins decisively (~19×)
Caveat: includes the unfair-caching effect. A fair-fight version with cache_control on system prompts and tool defs would narrow to 3–5×. The rule generalises: CMA gets prompt caching for free; Messages API gets it only when you remember to opt in.
2
Speed — CMA wins on average (~18%), high variance
Prompt 03 was 28s slower on CMA. Mean tells the story; individual prompts swing both ways depending on how exploratory the agent gets. We didn't measure repeat-run variance.
3
Tool calls — CMA uses ~58% fewer
Structural, not caching-related. In-container bash composes (find | xargs grep, cat | head) replace what tool-calling Messages API needs multiple round-trips for. Effect persists in a fair-fight comparison.
4
Output quality — indistinguishable
Both paths produced clean structured reports with file:line refs. Minor differences in cited line numbers (off-by-2 here and there) on both sides. Neither wins on quality.

Honest limitations

  • Sample size of 3. Directional signal, not a tight measurement. Variance per prompt is real (prompt 03 reversed the speed result).
  • Codebase research only. Validates the read side. Writing code, running tests, opening PRs — all untested.
  • Sonnet 4.6 only. Opus would shift absolute numbers but probably keep the ratio.
  • Cold-start tax hidden. Every CMA run started cold (new session per prompt). Production might reuse sessions for warm calls.
  • MA didn't use prompt caching. Deliberate (matches how ResearchAction is wired today) but the dominant cause of the cost gap. The "real" gap is smaller.
  • Single-turn from the user's POV. Multi-turn iterative refinement was not tested.

Strategic read · what this spike justifies

Migrate

ResearchAction → CMA

The multi-turn codebase-exploration phase of app/Actions/Agent/StoryGenerationHarnessAction.php. It is the closest existing kendo precedent for the workload this spike validated.
  • Cost on the research phase drops 5–15× (conservative, after accounting for cache markers being available either runtime)
  • Latency on the research phase ties or improves
  • Tool-call count drops by half — fewer round-trips, less load on kendo backend's MCP tools
  • Production CMA infra (Agent + Environment + github_repository resource + webhook handler + audit logging) gets stood up on a real workload before betting the autonomous-execution feature on it
Keep

ValidateInput · DuplicateCheck · Classify · Write — stay on Messages API

The other 4 phases of the harness are essentially structured-output calls. They don't benefit from CMA — session-creation overhead would only add cost without speedup. CMA is the wrong shape for non-tool-heavy phases.
Don't ship yet

"Hand to Claude" autonomous-PR feature

We've only validated the read side. The write side — editing code, running tests, opening PRs from CMA — is still entirely untested in our context. Building the full autonomous-PR feature next would skip a layer of validation we haven't earned.

Reproduce

Spike folder: ~/Code/cma-spike/
# .env needs ANTHROPIC_API_KEY + GITHUB_TOKEN
# (read-only PAT scoped to script-development/kendo)
cd ~/Code/cma-spike
source .venv/bin/activate

python3 cma/setup.py                          # create Agent + Env, cache IDs
python3 messages_api/run.py prompts/<file>.md  # one MA run
python3 cma/run.py prompts/<file>.md           # one CMA run
python3 compare.py                            # results/comparison.md

# When done, stop the meter:
python3 cma/teardown.py                        # archive Agent + Env
Cost ceiling per run on Sonnet at the prompt sizes tested: $1–5 for MA, $0.05–0.50 for CMA. Drop more prompt files in prompts/ to extend the sample.

References

  • Capability survey written before this spike: research/managed-agents-kendo-evaluation.md — covers all 14 Anthropic doc pages, three plausible kendo applications, decision framework
  • Full spike write-up: research/2026-05-08-managed-agents-spike.md
  • Anthropic Managed Agents docs: platform.claude.com/docs/en/managed-agents/overview
  • Spike scaffold (local): ~/Code/cma-spike/
  • Spike Agent/Env IDs (cached, archive when done): ~/Code/cma-spike/.cache.json
  • Closest existing kendo precedent: backend/app/Actions/Agent/StoryGenerationHarnessAction.php