Benchmark snapshot

How compact agent context compares to simpler baselines. Early evidence — not a final verdict.

Early evidence — live stats load with JavaScript

Benchmark snapshot

Aggregate stats from reviewed benchmark runs comparing compact agent context to simpler baselines. Token savings alone never count as a pass.

Loading live stats… Enable JavaScript for charts and KPIs, or inspect benchmark-summary.json directly.

Methodology & raw data

How we benchmark agent context

The same task runs three ways: task-only (no-cache), full catalogue records (raw-record), and selected compact packs (compact-pack). We track tokens, rubric scores, and review status — token savings alone never count as a pass.

What counts as proof?

Every run records model, mode, task, token usage, selected context IDs, prompt hash, wall time, and output text.
Unreviewed model completions are marked partial, not passed.
A strong claim needs task coverage, multiple models, complete mode comparisons, and reviewed rubric scores.
Read how benchmark evidence is collected and reviewed before citing numbers externally.

Machine-readable artifacts

JavaScript required for live stats

Methodology and raw data links above remain available without JavaScript. For the latest aggregate numbers, open /api/benchmark-summary.json or read benchmark-report.md.