ADR-087: Autoresearch Track A v2 — Dev or Held-Out Split and Judging¶
- Status: Accepted
- Date: 2026-05-08
- Authors: Podcast Scraper Team
- Related RFCs: RFC-073 (Autoresearch v2), RFC-057
- Related ADRs: ADR-073 (v1 closure and champions)
Context & Problem Statement¶
RFC-057 Track A (prompt tuning) shipped a working ratchet, but v1 had structural issues: smoke ⊂
benchmark contaminated validation; binary-OR judge contestation flipped entire runs on a
single noisy episode; temperature=0 on OpenAI was not determinism; bundled JSON formatting
biased judges. Champions could look strong on dev and fail on slightly different held data.
Decision¶
Track A adopts a v2 framework (thin harness changes + new datasets or silvers), with these immutable rules:
-
Disjoint dev vs held-out episodes — Iteration scores on
curated_5feeds_dev_v1(e01+e02 per feed); champion validation oncurated_5feeds_benchmark_v2(e03 per feed only). Zero overlap between tuning and final validation sets. v1 datasets remain frozen for history (ADR-073). -
Fraction-based contestation — A run is contested only when more than 40% of scored episodes exceed the judge delta threshold (replacing v1 any-episode OR).
-
Efficiency rubric dimension — Replaces Conciseness so longer, information-dense summaries are not penalised by default.
-
JSON prose extraction before judging — Bundled outputs are flattened to plain prose (title, summary, bullets) before judge calls so formatting does not dominate scores.
-
Seed plumbing (partial mitigation) — OpenAI
seedis wired through config and factory for summarization experiments; documented as variance reduction, not byte-identical reproducibility. -
Non-goals unchanged — Track B (ML parameter sweep) stays on RFC-057 machinery; v2 does not add multi-run averaging or statistical intervals in the first shipping slice.
Rationale¶
- Honest generalisation signal — Held-out catches overfitting that shared-episode validation hides.
- Stable promotion — Fraction contestation reduces judge-noise false rejects.
- Preserves v1 artifacts — ADR-073 remains the audit trail for shipped v1 champions; v2 runs alongside without rewriting history.
Alternatives Considered¶
- Keep v1 split and only add seeds — Insufficient; subset contamination remained.
- Larger held-out only, no rubric change — Improves signal but leaves length bias and JSON judge artifacts.
- Replace ROUGE entirely with LLM judges — Rejected for v2 scope; rubric + ROUGE hybrid stays per RFC-073.
Consequences¶
- Positive: New champion work must cite dev + held-out scores with the v2 config family.
- Negative: More datasets and silver references to maintain; operators must use the v2 YAML matrix from the RFC.
- Neutral: Future statistical tightening is explicitly deferred in the RFC.
Implementation Notes¶
- Code:
autoresearch/(orchestrator, scoring),data/eval/configs/**/autoresearch_*_v2.yaml, providerseedwiring as documented in RFC-073 - Ops:
autoresearch/README.md,autoresearch/JUDGING.md