# CacheSphere benchmark methodology

This page documents how CacheSphere collects, reviews, and reports benchmark evidence. Read it before interpreting the numbers on [the proof page](/proof) or in `public/api/benchmark-summary.json`.

## What we are trying to measure

CacheSphere's core claim is that **selected, compact context can produce equal or better agent decisions with smaller prompts than either no context or full raw records**. The benchmark loop tests that claim by running the same task under three modes:

- **no-cache** — task prompt only.
- **raw-record** — task prompt plus the full relevant CacheSphere records the task would touch.
- **compact-pack** — task prompt plus a small set of context packs chosen by the task-pack-map and Decision Brief logic.

The loop records **token usage**, **wall time**, **repair iterations**, and **rubric scores**. Token savings alone are not a pass; engineering quality matters.

## Task selection

A task enters the dogfood set only if it:

1. Is representative of real CacheSphere operator work (CLI tool, API implementation, type-safety hardening, database transaction, profiling, security review, etc.).
2. Has a clear, bounded acceptance checklist in `benchmarks/fixtures/tasks.json`.
3. Has a stable task-pack-map entry so the `compact-pack` mode is deterministic.
4. Produces artifacts that a human can inspect (code, tests, review notes).

We deliberately bias toward **small, repeatable tasks** rather than heroic multi-hour projects. Small tasks make it cheap to run many model×mode combinations and easy to review outputs.

## Model and provider constraints

Benchmark model costs are funded out of operator time and budget, not investor runway. To keep the loop reproducible without breaking the bank, runs use only these providers:

- **OpenAI Codex** — via `chatgpt.com` backend or `/v1` API when available.
- **LLM Gateway** — internal Nous gateway when reachable.
- **Ollama Cloud** — hosted models via `https://ollama.com/v1`.
- **OpenRouter free-tier models** — `:free` suffix models only.

We do not benchmark paid endpoints as the primary signal. If a paid run is captured for comparison, it is labeled `paid_baseline` in the notes and kept separate from the main free-tier evidence set.

## Capturing a run

Two scripts produce run files:

### Manual / reviewed capture

Use `capture-run.mjs` when a human has reviewed the model output and can assign rubric scores:

```bash
node benchmarks/scripts/capture-run.mjs \
  --suite phase7-first-pilot \
  --task cli-export-filter \
  --mode compact-pack \
  --provider openrouter \
  --model google/gemma-4-31b-it:free \
  --input-tokens 2163 \
  --output-tokens 0 \
  --status passed \
  --rubric-decision 4 \
  --rubric-implementation 4 \
  --rubric-constraint 4 \
  --rubric-efficiency 5 \
  --context-pack-ids agent-memory-compact,solo-agentic-coding
```

### Automatic calibration

Use `calibrate-model.mjs` to call a live API and record usage without human review:

```bash
node benchmarks/scripts/calibrate-model.mjs \
  --task cli-export-filter \
  --provider openrouter \
  --model google/gemma-4-31b-it:free \
  --modes no-cache,raw-record,compact-pack
```

Calibration outputs are always marked `status: partial` and tagged `unreviewed_model_output`. They are evidence of token behavior, not proof of engineering quality.

## Review and rubric

Every run must be reviewed before it can count as `passed`. Reviewers score four dimensions on a 1–5 scale:

| Dimension | Question |
|-----------|----------|
| **decisionQuality** | Did the model make the right high-level choices (stack, file shape, approach)? |
| **implementationQuality** | Is the produced code/test/review clear, correct, and idiomatic? |
| **constraintAdherence** | Did the output satisfy the task acceptance checks and avoid obvious oversights? |
| **contextEfficiency** | Did the attached context look well-matched to the task (relevant, minimal, no bloat)? |

A run with `rubricScores` and no `unreviewed_model_output` note counts as reviewed. A run without scores or with the unreviewed tag counts as `partial`.

## Normalization and summary

1. `benchmarks/scripts/calibrate-model.mjs` or `capture-run.mjs` writes individual `benchmarks/runs/*.json` files.
2. `benchmarks/scripts/write-results.mjs` merges raw runs into a normalized `phase7-first-pilot.mixed.results.json`.
3. `benchmarks/scripts/generate-proof-summary.mjs` produces `public/api/benchmark-summary.json` for the proof page.
4. `npm run build` regenerates the summary and bakes it into the static site.

The proof page renders:

- evidence verdict (`partial reviewed evidence`, `early evidence`, etc.);
- coverage counters (runs, models, complete task×mode triples);
- per-mode token usage;
- per-model coverage and median tokens;
- links to raw result files.

## What counts as credible evidence

A useful claim requires:

- **Task coverage** — the same task run across all three modes.
- **Multiple models** — at least two distinct providers or model families.
- **Complete triples** — a task with no-cache, raw-record, and compact-pack all captured.
- **Reviewed outputs** — rubric scores, not just token counts.
- **Raw artifacts** — output text and pack ids preserved so the evidence can be challenged.

Until all of these are true, the proof page shows a `partial` verdict and the summary caveats the data.

## Known limitations and caveats

- **Small sample.** The phase-7 first-pilot suite currently covers a handful of tasks. Do not generalize to all agent work.
- **Free-tier variability.** OpenRouter free models can rate-limit, change shape, or disappear. Results are snapshots, not longitudinal guarantees.
- **No automatic artifact execution.** We do not currently compile and run generated code in CI. Engineering quality is human-reviewed.
- **Selection bias.** Tasks were chosen by the CacheSphere team because they exercise CacheSphere's strengths. Independent task selection would be stronger.
- **Rubric subjectivity.** Scores are human judgments. Cross-reviewer calibration is future work.
- **Context length assumptions.** Token counts come from API usage fields where available; rough character÷4 estimates are used only when the API does not report usage.

## Operator run status (2026-06-23)

The catalogue now lists **37 benchmark tasks**, but reviewed runs still cover a **subset** (see `public/api/benchmark-summary.json`). Expanding reviewed coverage requires **real** agent executions — not synthetic JSON.

**Current blocker for autonomous expansion:** live model runs need provider credentials (`OPENAI_API_KEY`, OpenRouter/Ollama endpoints, etc.) and human review of outputs before rubric scores are recorded. This Cloud Agent / CI environment has **no `.env` credentials** and **must not** fabricate pass/fail or token counts.

**When unblocked, the minimal path is:**

1. Pick 2–3 new task ids from `benchmarks/fixtures/tasks.json` that lack complete mode triples.
2. Run each task in `no-cache`, `raw-record`, and `compact-pack` via the normal agent workflow.
3. Record reviewed results with `benchmarks/scripts/capture-run.mjs` (or calibrate first, then review).
4. Regenerate with `node benchmarks/scripts/generate-proof-summary.mjs` and verify public copy still separates catalogue size from reviewed-run count.

Until then, treat the proof page as **honest partial evidence**, not full catalogue coverage.

## Reproducing the evidence

```bash
git clone https://github.com/Idaluna-Labs/cachesphere.git
cd cachesphere
npm install
npm run build

# Inspect the generated summary
cat public/api/benchmark-summary.json

# Inspect the normalized results
cat benchmarks/runs/phase7-first-pilot.mixed.results.json

# Serve the proof page locally
python3 -m http.server 8765 --directory dist
# open http://127.0.0.1:8765/proof
```

## How to challenge or extend

- Add a new task in `benchmarks/fixtures/tasks.json`, add its task-pack-map entry, and run all three modes.
- Review an existing `partial` run and update it to `passed` with rubric scores.
- Capture a run with a different allowed model and regenerate the summary.
- Open an issue with the run id and the rubric score you disagree with.

The methodology itself is versioned informally by the date in this file. Major changes to task selection, rubric, or provider rules will be noted in a changelog section below.

---

*Last updated: 2026-06-23.*