Session readout

cc-bench:
benchmarking the config, not the vibes

Updated: 2026-07-02· Version 1

A personal benchmark suite that replaces the claude.ai A/B with a blind, replayable ablation — is the Object Floor actually earning its place in CLAUDE.md?

01Why

The earlier claude.ai A/B/C test of the Object Floor prompt wasn't fair: not blind, one task type, single-turn. Meanwhile the measured Opus 4.6→4.8 regression (from the blind corpus analysis) is volume + hedging — +44% words/turn, 6–14× more hedging phrases — and grows over a session, which one-shot tests can't see.

cc-bench replays identical real prompts against config variants and scores them two ways, kept strictly separate: scripted objective metrics against Opus 4.6's measured setpoints, and a blind per-axis judge calibrated against Mike's own blind rankings.

02What was built

23
Prompts curated
from 52 mined / 7,015 raw
4
Config variants
current · no-floor · no-style · ungoverned
80
Runs per batch
20 singles × 4 variants × 1 sample
  • cc-bench
    • plan.mdcheckbox plan — the source of truth
    • setpoints.jsonOpus 4.6 measured targets
    • prompts
      • set.jsoncurated core set (20 + 3 multi-turn)
      • candidates.json49-candidate reserve, mined from history
    • scripts
      • build-configs.mjsgenerates CLAUDE.md variants from live global
      • run-bench.mjsmatrix runner — swap, cap, resume, restore
      • judge.mjsblind per-axis scoring, shuffled + label-stripped
      • report.mjsvariants vs setpoints + judge axes in one table
      • metrics.mjsrebuilt corpus extractor (originals were wiped)
    • research
      • sonnet5-notes.mdvideo synthesis — Sonnet 5 traits

Judge axes (1–5, per axis, no composite yet): object contact, non-concession, prose quality, no-sycophancy, proportionality. Objective metrics: words/turn, bold/turn, hedging counts — targets are 4.6's numbers (63.5 w/t, 1.13 bold/t, ~9 hedges per 402 turns).

03The isolation gotcha

Warning

A marker test showed headless runs receiving both the variant CLAUDE.md and the real global one. Even HOME=/fake didn't stop it — the binary resolves the actual home for user memory (while auth follows $HOME). An "ungoverned" run would have been silently governed, invalidating the whole ablation while looking fine.

The fix: run-bench.mjs swaps the real ~/.claude/CLAUDE.md per variant batch — timestamped backup to ~/.claude/backups/ first, restore guaranteed via finally plus signal handlers. Consequence: no interactive Claude Code sessions while a batch runs.

04How to run it

Bash
cd ~/git/ccChat-general/cc-bench
node scripts/build-configs.mjs            # regenerate variants from live config
node scripts/run-bench.mjs                # 80 runs — Ctrl+C anytime, resumable
node scripts/judge.mjs results/<runId>    # blind scoring pass
node scripts/report.mjs results/<runId>   # the table

Budget guards baked in: 15-turn and 10-minute caps per run, per-file results so a batch stops and resumes across weekends without losing work.

05What's left