Session readout
cc-bench:
benchmarking the config, not the vibes
A personal benchmark suite that replaces the claude.ai A/B with a blind, replayable ablation — is the Object Floor actually earning its place in CLAUDE.md?
- Built in one session: a complete
claude -pbenchmark pipeline at~/git/ccChat-general/cc-bench/— config variants, matrix runner, blind judge, report. - First experiment is an ablation (current / no-floor / no-style / ungoverned config), not a model race.
- 23 real prompts mined from 8 months of Claude Code history, curated for replayability and diversity.
- Key finding along the way:
CLAUDE_CONFIG_DIRandHOMEoverrides both leak the real global CLAUDE.md — the runner swaps the actual file instead. - Ready to run on weekend subscription leftovers; fully resumable mid-batch.
01Why
The earlier claude.ai A/B/C test of the Object Floor prompt wasn't fair: not blind, one task type, single-turn. Meanwhile the measured Opus 4.6→4.8 regression (from the blind corpus analysis) is volume + hedging — +44% words/turn, 6–14× more hedging phrases — and grows over a session, which one-shot tests can't see.
cc-bench replays identical real prompts against config variants and scores them two ways, kept strictly separate: scripted objective metrics against Opus 4.6's measured setpoints, and a blind per-axis judge calibrated against Mike's own blind rankings.
02What was built
- cc-bench
- plan.mdcheckbox plan — the source of truth
- setpoints.jsonOpus 4.6 measured targets
- prompts
- set.jsoncurated core set (20 + 3 multi-turn)
- candidates.json49-candidate reserve, mined from history
- scripts
- build-configs.mjsgenerates CLAUDE.md variants from live global
- run-bench.mjsmatrix runner — swap, cap, resume, restore
- judge.mjsblind per-axis scoring, shuffled + label-stripped
- report.mjsvariants vs setpoints + judge axes in one table
- metrics.mjsrebuilt corpus extractor (originals were wiped)
- research
- sonnet5-notes.mdvideo synthesis — Sonnet 5 traits
Judge axes (1–5, per axis, no composite yet): object contact, non-concession, prose quality, no-sycophancy, proportionality. Objective metrics: words/turn, bold/turn, hedging counts — targets are 4.6's numbers (63.5 w/t, 1.13 bold/t, ~9 hedges per 402 turns).
03The isolation gotcha
A marker test showed headless runs receiving both the variant CLAUDE.md and the real global one. Even HOME=/fake didn't stop it — the binary resolves the actual home for user memory (while auth follows $HOME). An "ungoverned" run would have been silently governed, invalidating the whole ablation while looking fine.
The fix: run-bench.mjs swaps the real ~/.claude/CLAUDE.md per variant batch — timestamped backup to ~/.claude/backups/ first, restore guaranteed via finally plus signal handlers. Consequence: no interactive Claude Code sessions while a batch runs.
04How to run it
cd ~/git/ccChat-general/cc-bench
node scripts/build-configs.mjs # regenerate variants from live config
node scripts/run-bench.mjs # 80 runs — Ctrl+C anytime, resumable
node scripts/judge.mjs results/<runId> # blind scoring pass
node scripts/report.mjs results/<runId> # the tableBudget guards baked in: 15-turn and 10-minute caps per run, per-file results so a batch stops and resumes across weekends without losing work.
05What's left
- Workspace, prompt set, all four pipeline scripts, config variants
- Isolation mechanism verified live (and its leak worked around)
- Sonnet 5 research — Fable-like trait is orchestration, not prose; token-hungry, only Low/Medium tiers decent value
- Pilot batch on weekend subscription leftovers → measure real burn per run
- Mike blind-ranks ~10 pairs to calibrate the judge
- Multi-turn drift runner (3 scripted sequences,
--resumechaining) - Later: model comparison (Opus 4.8 vs Sonnet 5) on the winning config