ADR-047 — Run manifests for eval reproducibility¶
Status · Accepted Date · 2026-04-28 TA anchor · /components/eval Related RFC · RFC-017 Inspired by · Issue #379 in chipi/podcast_scraper
Context¶
The eval harness (per RFC-017) produces metrics per run. Without capturing the full system context that produced those metrics, comparing runs across time becomes guesswork — "did prompt v3 win because the prompt was better, or because we upgraded the model that day, or because darktable shipped a new release?" The auto-research workflow is honest only if every run carries enough context to be re-explained later.
The reference pattern is chipi/podcast_scraper's Issue #379 work: every run writes a manifest capturing system state (Python version, OS, GPU info, model versions, git commit SHA, config hash) plus per-stage timings and metrics. Comparison across runs is then a matter of inspecting manifests, not reconstructing missing context.
Decision¶
Every eval run writes a JSON manifest capturing the full set of identifiers, configurations, and outputs that produced the metrics. The manifest is the unit of comparison — running one manifest against another (e.g., "v2 run vs v3 run") is what auto-research is.
Manifest schema:
{
"run_id": "2026-09-20T14:30:00Z",
"timestamp": "2026-09-20T14:30:00Z",
"git_sha": "abc123def456...",
"git_dirty": false,
"golden_version": "v1",
"prompt_versions": {
"mode_b/system": "v1",
"mode_b/plan": "v3",
"mode_b/evaluate": "v1",
"mode_b/refine": "v1"
},
"model_config": {
"provider": "anthropic",
"model": "claude-opus-4-7",
"temperature": 0.7,
"max_tokens": 4096
},
"system": {
"python_version": "3.12.3",
"platform": "Darwin 24.0.0 arm64",
"darktable_version": "5.4.1",
"chemigram_version": "0.5.0",
"config_hash": "sha256:..."
},
"scenarios_run": [
"001_iguana_warm",
"002_manta_blue",
"003_evening_pelagic"
],
"metrics": {
"001_iguana_warm": {
"read_context_first": true,
"vocab_purity": 1.0,
"expected_primitives_used": 0.67,
"forbidden_primitives_used": 0,
"candidate_count": 3,
"tool_call_count": 18,
"duration_seconds": 47.2
},
"002_manta_blue": { ... },
"003_evening_pelagic": { ... }
},
"metric_summary": {
"vocab_purity_mean": 0.92,
"expected_primitives_used_mean": 0.71,
"forbidden_primitives_used_total": 0,
"tool_call_count_mean": 19.3
},
"duration_seconds": 287.3,
"total_tool_calls": 142,
"errors": []
}
Manifests are written to metrics/runs/<run_id>.json and appended (one line each) to metrics/eval_history.jsonl for time-series analysis.
Rationale¶
- The manifest IS the run. Without it, a metrics number is a number with no provenance. With it, six months later we can answer "was that 0.92 vocab_purity from prompt v2 or v3?"
- Captures everything that varied. Git SHA, prompt versions, model config, system info, golden version. If one of these changed between runs, the manifest shows it.
config_hashcovers what the explicit fields don't. A run with a custom config (different vocabulary scope, different masker, different threshold) hashes the config; comparison runs can verify they used the same config without comparing every field by hand.- Per-scenario metrics + summary. Per-scenario lets you see "scenario 4 regressed badly even though the mean is fine"; summary gives the headline number.
- Append-only
eval_history.jsonlenables trends. A scatter-plot ofvocab_purity_meanover time, colored byprompt_versions["mode_b/plan"], tells the story of the prompt's evolution. - Errors as a top-level field. Failed scenarios stay visible in the manifest rather than disappearing as silent gaps.
Alternatives considered¶
- No manifest (just metrics): loses the ability to explain runs after the fact. The metrics-without-context problem.
- Manifest as YAML or TOML: JSON is the standard for machine-read structured data; toolchain (
jq, etc.) is rich. JSON wins. - Manifest in a database: considered (sqlite for the history). Overkill for v1; flat files are fine until query patterns demand otherwise. JSONL gives append-friendly + grep-friendly + jq-friendly.
- Embedded in the metrics file (no separate manifest): considered. Splitting "what produced this" from "what was measured" is cleaner; per-scenario metrics nest naturally under the run-level manifest.
- Less detail (just prompt versions and metrics): loses the ability to debug "did the model change?" Auto-research depends on knowing all the inputs, not just the named ones.
Consequences¶
Positive:
- Auto-research is honest: comparisons reference real manifests, not vague memory
- Time-series analysis of metric evolution is possible (eval_history.jsonl)
- Months-old metrics are still explainable
- Failed runs stay visible, not silent
- Standard tooling (
jq, etc.) works directly
Negative:
- Per-run disk overhead (small — a manifest is ~5–20 KB)
- Discipline required: developers must run eval through the harness, not ad-hoc, to get a manifest. Mitigation: the harness IS the canonical interface; ad-hoc running shouldn't happen in the auto-research loop.
Implementation notes¶
- Manifest writing happens in
chemigram.eval.manifest. The eval runner (per RFC-017) calls this at the end of every run. metrics/runs/andmetrics/eval_history.jsonlare gitignored by default — they're outputs, not source. Specific runs that establish baselines or document major decisions can be committed explicitly (e.g.,metrics/baselines/golden_v1_first_run.json).- Config hashing uses SHA-256 of the canonicalized config (sorted keys, no whitespace). Implementation lives alongside
chemigram.coreconfig loading. - The
systemsection is captured at run start using stdlib (platform,sys) plus adarktable-cli --versionsubprocess call. - Run IDs are ISO 8601 UTC timestamps with seconds precision (
2026-09-20T14:30:00Z). Collision is unlikely; if it happens, the second run gets a_2suffix. - Eval comparison utilities (
chemigram.eval.reports) load two manifests and produce a diff report — what changed, what stayed the same, which metrics moved.