ADR-046 — Golden dataset versioning (immutable, append-only)¶

Status · Accepted Date · 2026-04-28 TA anchor · /components/eval Related RFC · RFC-017 Inspired by · ADR-026 in chipi/podcast_scraper

Context¶

The eval harness (per RFC-017) runs the autonomous Mode B agent against a curated set of canonical scenarios — the "golden dataset." Each scenario is a (brief, taste fragments, raw, expected outcome) tuple. The harness produces metrics per run; comparing metrics across runs is what makes auto-research possible.

Cross-run comparison is meaningful only if the inputs stay constant. If scenario 3's brief gets edited between run A and run B, "did the new prompt produce better metrics on scenario 3?" becomes unanswerable — better than what? The brief changed. The same applies to taste fragments, expected primitives, the raw file itself.

The pattern in chipi/podcast_scraper's ADR-026 (explicit golden dataset versioning) addresses this: golden datasets are versioned, and once shipped, a version is frozen forever. Improvements ship as the next version.

Decision¶

Golden datasets are versioned (golden_v1, golden_v2, …) and immutable once shipped.

Layout:

data/eval/
├── golden_v1/                 # frozen forever once shipped
│   ├── manifest.json          # which scenarios + metric weights for this version
│   ├── scenarios/
│   │   ├── 001_iguana_warm/
│   │   ├── 002_manta_blue/
│   │   └── ...
│   └── README.md              # what's in this version + curation notes
├── golden_v2/                 # next version when v1's coverage proves insufficient
└── ...

Once golden_v1/ is shipped (a tagged release of the dataset, declared in manifest.json), no file under golden_v1/ may be edited. Improvements — better scenarios, corrected briefs, tighter expected-primitives lists, additional scenarios — go into golden_v2/.

Eval runs declare which golden version they ran against (per ADR-047's run manifests). Comparisons are valid only across runs against the same golden version.

Rationale¶

Immutability is what makes auto-research work. "Did prompt v3 beat v2?" requires the same inputs. Mutable scenarios destroy the comparison.
Append-only matches ADR-style discipline already used elsewhere in the project. Same mental model as ADRs themselves (once accepted, frozen) and prompt versions (per ADR-043).
Versioning by directory (not by git tag) keeps the comparison-relevant identifier visible. A scenario can be referenced by golden_v1/scenarios/003_evening_pelagic clearly; a git-tag-based version would require checking out historical state to inspect.
Multiple golden versions can coexist. New ML iterations might benefit from older scenarios (regression tests) AND newer ones (broader coverage). Both are accessible.
Curation effort is real and shouldn't be lost. Scenarios represent careful work — selecting representative cases, capturing photographer intent, recording expected primitives. Versioning preserves that work even when the dataset evolves.

Alternatives considered¶

Mutable golden dataset, git tag the release points: comparable runs require checking out historical state. Awkward; a manifest in the run output should be self-contained.
Single golden dataset, never updated: loses coverage as the project grows. Doesn't scale.
No golden dataset (per-run scenarios): no comparability between runs. Defeats RFC-017's auto-research goal.
Versioning per scenario (each scenario carries its own version): per-scenario versioning makes sense if scenarios evolve independently, but in practice the dataset evolves as a unit (a new release adds 3 scenarios and tightens 2 existing ones — that's one version bump). Simpler to version the set.

Consequences¶

Positive:

Cross-run metric comparison is sound
Curation effort is preserved across iterations
Multiple golden versions coexist for different testing needs (regression vs new coverage)
Reproducibility is honest — a months-old run manifest references a specific golden version that still exists in the tree

Negative:

Repository size grows with each golden version (mitigation: raws may live elsewhere via content-hash pointers; the markdown/TOML scenarios are tiny)
"Which version should I run against?" becomes a small choice for the eval harness user (mitigation: default to the latest, allow override)
Authoring overhead — improving a scenario means creating a new version, not editing in place (the discipline is the point)

Implementation notes¶

data/eval/golden_v1/manifest.json declares the version's scope:

{
  "version": "v1",
  "shipped_at": "2026-09-15",
  "shipped_at_git_sha": "...",
  "scenarios": ["001_iguana_warm", "002_manta_blue", ...],
  "metric_weights": {
    "vocab_purity": 1.0,
    "expected_primitives_used": 0.8,
    "brief_alignment": 0.6
  },
  "notes": "Phase 5 launch dataset. 8 scenarios spanning underwater wildlife, cold pelagic, mixed light."
}

A CI check (scripts/verify-golden-immutable.sh) compares each golden_vN/ directory against its shipped state (via the git tag matching the version) and fails if any file under a shipped version has changed.
Raw files (.NEF, .ARW, …) are too large to commit. Each scenario's scenario.toml records a SHA-256 of the raw; the actual file lives at data/eval/golden_v1/scenarios/<id>/raw.NEF for users who have a local copy, OR at a configurable CHEMIGRAM_GOLDEN_RAWS_DIR for users with raws stored elsewhere. If neither is present, scenarios skip gracefully with an explicit warning.
The convention is: golden_v1 is shipped at the close of Phase 5's first auto-research milestone. Phase 1 ships the eval harness design only (RFC-017, this ADR, and ADR-047), no scenarios.