Skip to content

ADR-047 — Run manifests for eval reproducibility

Status · Accepted Date · 2026-04-28 TA anchor · /components/eval Related RFC · RFC-017 Inspired by · Issue #379 in chipi/podcast_scraper

Context

The eval harness (per RFC-017) produces metrics per run. Without capturing the full system context that produced those metrics, comparing runs across time becomes guesswork — "did prompt v3 win because the prompt was better, or because we upgraded the model that day, or because darktable shipped a new release?" The auto-research workflow is honest only if every run carries enough context to be re-explained later.

The reference pattern is chipi/podcast_scraper's Issue #379 work: every run writes a manifest capturing system state (Python version, OS, GPU info, model versions, git commit SHA, config hash) plus per-stage timings and metrics. Comparison across runs is then a matter of inspecting manifests, not reconstructing missing context.

Decision

Every eval run writes a JSON manifest capturing the full set of identifiers, configurations, and outputs that produced the metrics. The manifest is the unit of comparison — running one manifest against another (e.g., "v2 run vs v3 run") is what auto-research is.

Manifest schema:

{
  "run_id": "2026-09-20T14:30:00Z",
  "timestamp": "2026-09-20T14:30:00Z",
  "git_sha": "abc123def456...",
  "git_dirty": false,
  "golden_version": "v1",
  "prompt_versions": {
    "mode_b/system": "v1",
    "mode_b/plan": "v3",
    "mode_b/evaluate": "v1",
    "mode_b/refine": "v1"
  },
  "model_config": {
    "provider": "anthropic",
    "model": "claude-opus-4-7",
    "temperature": 0.7,
    "max_tokens": 4096
  },
  "system": {
    "python_version": "3.12.3",
    "platform": "Darwin 24.0.0 arm64",
    "darktable_version": "5.4.1",
    "chemigram_version": "0.5.0",
    "config_hash": "sha256:..."
  },
  "scenarios_run": [
    "001_iguana_warm",
    "002_manta_blue",
    "003_evening_pelagic"
  ],
  "metrics": {
    "001_iguana_warm": {
      "read_context_first": true,
      "vocab_purity": 1.0,
      "expected_primitives_used": 0.67,
      "forbidden_primitives_used": 0,
      "candidate_count": 3,
      "tool_call_count": 18,
      "duration_seconds": 47.2
    },
    "002_manta_blue": { ... },
    "003_evening_pelagic": { ... }
  },
  "metric_summary": {
    "vocab_purity_mean": 0.92,
    "expected_primitives_used_mean": 0.71,
    "forbidden_primitives_used_total": 0,
    "tool_call_count_mean": 19.3
  },
  "duration_seconds": 287.3,
  "total_tool_calls": 142,
  "errors": []
}

Manifests are written to metrics/runs/<run_id>.json and appended (one line each) to metrics/eval_history.jsonl for time-series analysis.

Rationale

  • The manifest IS the run. Without it, a metrics number is a number with no provenance. With it, six months later we can answer "was that 0.92 vocab_purity from prompt v2 or v3?"
  • Captures everything that varied. Git SHA, prompt versions, model config, system info, golden version. If one of these changed between runs, the manifest shows it.
  • config_hash covers what the explicit fields don't. A run with a custom config (different vocabulary scope, different masker, different threshold) hashes the config; comparison runs can verify they used the same config without comparing every field by hand.
  • Per-scenario metrics + summary. Per-scenario lets you see "scenario 4 regressed badly even though the mean is fine"; summary gives the headline number.
  • Append-only eval_history.jsonl enables trends. A scatter-plot of vocab_purity_mean over time, colored by prompt_versions["mode_b/plan"], tells the story of the prompt's evolution.
  • Errors as a top-level field. Failed scenarios stay visible in the manifest rather than disappearing as silent gaps.

Alternatives considered

  • No manifest (just metrics): loses the ability to explain runs after the fact. The metrics-without-context problem.
  • Manifest as YAML or TOML: JSON is the standard for machine-read structured data; toolchain (jq, etc.) is rich. JSON wins.
  • Manifest in a database: considered (sqlite for the history). Overkill for v1; flat files are fine until query patterns demand otherwise. JSONL gives append-friendly + grep-friendly + jq-friendly.
  • Embedded in the metrics file (no separate manifest): considered. Splitting "what produced this" from "what was measured" is cleaner; per-scenario metrics nest naturally under the run-level manifest.
  • Less detail (just prompt versions and metrics): loses the ability to debug "did the model change?" Auto-research depends on knowing all the inputs, not just the named ones.

Consequences

Positive:

  • Auto-research is honest: comparisons reference real manifests, not vague memory
  • Time-series analysis of metric evolution is possible (eval_history.jsonl)
  • Months-old metrics are still explainable
  • Failed runs stay visible, not silent
  • Standard tooling (jq, etc.) works directly

Negative:

  • Per-run disk overhead (small — a manifest is ~5–20 KB)
  • Discipline required: developers must run eval through the harness, not ad-hoc, to get a manifest. Mitigation: the harness IS the canonical interface; ad-hoc running shouldn't happen in the auto-research loop.

Implementation notes

  • Manifest writing happens in chemigram.eval.manifest. The eval runner (per RFC-017) calls this at the end of every run.
  • metrics/runs/ and metrics/eval_history.jsonl are gitignored by default — they're outputs, not source. Specific runs that establish baselines or document major decisions can be committed explicitly (e.g., metrics/baselines/golden_v1_first_run.json).
  • Config hashing uses SHA-256 of the canonicalized config (sorted keys, no whitespace). Implementation lives alongside chemigram.core config loading.
  • The system section is captured at run start using stdlib (platform, sys) plus a darktable-cli --version subprocess call.
  • Run IDs are ISO 8601 UTC timestamps with seconds precision (2026-09-20T14:30:00Z). Collision is unlikely; if it happens, the second run gets a _2 suffix.
  • Eval comparison utilities (chemigram.eval.reports) load two manifests and produce a diff report — what changed, what stayed the same, which metrics moved.