SpikeCodebase research workload — Sonnet 4.6 — 3 prompts2026-05-08
Cost reduction
19×
$9.72 → $0.51 across 3 prompts
Wall time
−18%
avg 172.3s → 140.7s per prompt
Tool calls
−58%
avg 64 → 27 per prompt
Output quality
≈
indistinguishable on rubric
Validated claim: for codebase research as a workload, CMA is the right tool. The cost gap is dominated
by automatic prompt caching in CMA sessions — a Messages-API path that opted into cache_control would narrow
the gap to roughly 3–5×, but CMA gets caching for free where Messages API has to opt in deliberately. Spike does not
validate CMA for writing code, running tests, or opening PRs. Those remain untested.
Aggregate comparison 3 prompts, totals
Total cost
USD across 3 prompts
Messages API
$9.72
baseline
CMA
$0.51
−95% (19× cheaper)
Wall time
total seconds
Messages API
517s
baseline
CMA
422s
−18% faster
Tool calls
avg per prompt
Messages API
64
baseline
CMA
27
−58% fewer
Spike setup
Messages API path
Runtimedirect client.messages.create loop
Toolssearch_code (ripgrep), read_file, list_files
Repo accesslocal kendo clone on disk
Cachingnone — no cache_control markers
Stopstop_reason != "tool_use"
CMA path
RuntimeSession on a pre-created Agent + Environment
Both paths use Sonnet 4.6, share an output cap of MAX_TOKENS_PER_RUN=4096,
and use aligned system prompts. The comparison is on the runtime, not the framing.
Per-prompt verbatim numbers 3 prompts
Prompt
MA wall
CMA wall
MA tool calls
CMA tool calls
MA cost
CMA cost
Cost ratio
01-find-branch-linker-usages
135.3s
100.2s
42
11
$1.95
$0.07
28×
03-webhook-intake-pattern
154.4s
182.3s
57
32
$2.77
$0.24
12×
06-feature-flag-pattern
227.2s
139.6s
95
37
$5.01
$0.20
25×
Average
172.3s
140.7s
64
27
$3.24
$0.17
19×
On token columns: CMA reports ~10–20 input tokens per run because system prompt, tool definitions, and accumulated history all hit the prompt cache (cache reads ranged 105k–513k per prompt). Messages-API input tokens are cumulative — at 95 tool calls on prompt 06, conversation history alone reached 1.6M tokens of replayed input. That cumulative replay is the dominant cost driver, and it's exactly what cache markers would shrink.
CMA's tool-call efficiency comes from one-shot bash compositions
(find ... | xargs grep ...)
replacing what MA does as 5–7 sequential round-trips.
Prompt 03 · webhook intake
Multi-file pattern across routes, middleware, controllers, jobs, audit. The only prompt where CMA was slower (182s vs 154s).
CMA went deeper — 32 tool calls covering audit logger and queued job retry config. MA stopped earlier with a thinner answer. Wall-time loss paired with a richer report.
MA looped extensively — five separate searches for the #[Name] attribute, repeated reads of AppServiceProvider. Cumulative input hit 1.6M tokens. Cost gap most extreme here: 25×.
Findings
1
Cost — CMA wins decisively (~19×)
Caveat: includes the unfair-caching effect. A fair-fight version with cache_control on system prompts and tool defs would narrow to 3–5×. The rule generalises: CMA gets prompt caching for free; Messages API gets it only when you remember to opt in.
2
Speed — CMA wins on average (~18%), high variance
Prompt 03 was 28s slower on CMA. Mean tells the story; individual prompts swing both ways depending on how exploratory the agent gets. We didn't measure repeat-run variance.
3
Tool calls — CMA uses ~58% fewer
Structural, not caching-related. In-container bash composes (find | xargs grep, cat | head) replace what tool-calling Messages API needs multiple round-trips for. Effect persists in a fair-fight comparison.
4
Output quality — indistinguishable
Both paths produced clean structured reports with file:line refs. Minor differences in cited line numbers (off-by-2 here and there) on both sides. Neither wins on quality.
Honest limitations
Sample size of 3. Directional signal, not a tight measurement. Variance per prompt is real (prompt 03 reversed the speed result).
Codebase research only. Validates the read side. Writing code, running tests, opening PRs — all untested.
Sonnet 4.6 only. Opus would shift absolute numbers but probably keep the ratio.
Cold-start tax hidden. Every CMA run started cold (new session per prompt). Production might reuse sessions for warm calls.
MA didn't use prompt caching. Deliberate (matches how ResearchAction is wired today) but the dominant cause of the cost gap. The "real" gap is smaller.
Single-turn from the user's POV. Multi-turn iterative refinement was not tested.
Strategic read · what this spike justifies
Migrate
ResearchAction → CMA
The multi-turn codebase-exploration phase of app/Actions/Agent/StoryGenerationHarnessAction.php.
It is the closest existing kendo precedent for the workload this spike validated.
Cost on the research phase drops 5–15× (conservative, after accounting for cache markers being available either runtime)
Latency on the research phase ties or improves
Tool-call count drops by half — fewer round-trips, less load on kendo backend's MCP tools
Production CMA infra (Agent + Environment + github_repository resource + webhook handler + audit logging) gets stood up on a real workload before betting the autonomous-execution feature on it
Keep
ValidateInput · DuplicateCheck · Classify · Write — stay on Messages API
The other 4 phases of the harness are essentially structured-output calls. They don't benefit from CMA — session-creation overhead would only add cost without speedup. CMA is the wrong shape for non-tool-heavy phases.
Don't ship yet
"Hand to Claude" autonomous-PR feature
We've only validated the read side. The write side — editing code, running tests, opening PRs from CMA — is still entirely untested in our context. Building the full autonomous-PR feature next would skip a layer of validation we haven't earned.
Reproduce
Spike folder: ~/Code/cma-spike/
# .env needs ANTHROPIC_API_KEY + GITHUB_TOKEN
# (read-only PAT scoped to script-development/kendo)cd ~/Code/cma-spike
source .venv/bin/activate
python3 cma/setup.py # create Agent + Env, cache IDspython3 messages_api/run.py prompts/<file>.md # one MA runpython3 cma/run.py prompts/<file>.md # one CMA runpython3 compare.py # results/comparison.md# When done, stop the meter:python3 cma/teardown.py # archive Agent + Env
Cost ceiling per run on Sonnet at the prompt sizes tested: $1–5 for MA, $0.05–0.50 for CMA. Drop more prompt files in prompts/ to extend the sample.
References
Capability survey written before this spike: research/managed-agents-kendo-evaluation.md — covers all 14 Anthropic doc pages, three plausible kendo applications, decision framework
Full spike write-up: research/2026-05-08-managed-agents-spike.md