cc-bench: benchmarking the config, not the vibes

Built in one session: a complete claude -p benchmark pipeline at ~/git/ccChat-general/cc-bench/ — config variants, matrix runner, blind judge, report.
First experiment is an ablation (current / no-floor / no-style / ungoverned config), not a model race.
23 real prompts mined from 8 months of Claude Code history, curated for replayability and diversity.
Key finding along the way: CLAUDE_CONFIG_DIR and HOME overrides both leak the real global CLAUDE.md — the runner swaps the actual file instead.
Ready to run on weekend subscription leftovers; fully resumable mid-batch.

01Why

The earlier claude.ai A/B/C test of the Object Floor prompt wasn't fair: not blind, one task type, single-turn. Meanwhile the measured Opus 4.6→4.8 regression (from the blind corpus analysis) is volume + hedging — +44% words/turn, 6–14× more hedging phrases — and grows over a session, which one-shot tests can't see.

cc-bench replays identical real prompts against config variants and scores them two ways, kept strictly separate: scripted objective metrics against Opus 4.6's measured setpoints, and a blind per-axis judge calibrated against Mike's own blind rankings.

02What was built

Prompts curated

from 52 mined / 7,015 raw

Config variants

current · no-floor · no-style · ungoverned

Runs per batch

20 singles × 4 variants × 1 sample

cc-bench
- plan.mdcheckbox plan — the source of truth
- setpoints.jsonOpus 4.6 measured targets
- prompts
  - set.jsoncurated core set (20 + 3 multi-turn)
  - candidates.json49-candidate reserve, mined from history
- scripts
  - build-configs.mjsgenerates CLAUDE.md variants from live global
  - run-bench.mjsmatrix runner — swap, cap, resume, restore
  - judge.mjsblind per-axis scoring, shuffled + label-stripped
  - report.mjsvariants vs setpoints + judge axes in one table
  - metrics.mjsrebuilt corpus extractor (originals were wiped)
- research
  - sonnet5-notes.mdvideo synthesis — Sonnet 5 traits

Judge axes (1–5, per axis, no composite yet): object contact, non-concession, prose quality, no-sycophancy, proportionality. Objective metrics: words/turn, bold/turn, hedging counts — targets are 4.6's numbers (63.5 w/t, 1.13 bold/t, ~9 hedges per 402 turns).

03The isolation gotcha

Warning

A marker test showed headless runs receiving both the variant CLAUDE.md and the real global one. Even HOME=/fake didn't stop it — the binary resolves the actual home for user memory (while auth follows $HOME). An "ungoverned" run would have been silently governed, invalidating the whole ablation while looking fine.

The fix: run-bench.mjs swaps the real ~/.claude/CLAUDE.md per variant batch — timestamped backup to ~/.claude/backups/ first, restore guaranteed via finally plus signal handlers. Consequence: no interactive Claude Code sessions while a batch runs.

04How to run it

Bash

cd ~/git/ccChat-general/cc-bench
node scripts/build-configs.mjs            # regenerate variants from live config
node scripts/run-bench.mjs                # 80 runs — Ctrl+C anytime, resumable
node scripts/judge.mjs results/<runId>    # blind scoring pass
node scripts/report.mjs results/<runId>   # the table

Budget guards baked in: 15-turn and 10-minute caps per run, per-file results so a batch stops and resumes across weekends without losing work.

05What's left

Workspace, prompt set, all four pipeline scripts, config variants
Isolation mechanism verified live (and its leak worked around)
Sonnet 5 research — Fable-like trait is orchestration, not prose; token-hungry, only Low/Medium tiers decent value
Pilot batch on weekend subscription leftovers → measure real burn per run
Mike blind-ranks ~10 pairs to calibrate the judge
Multi-turn drift runner (3 scripted sequences, --resume chaining)
Later: model comparison (Opus 4.8 vs Sonnet 5) on the winning config