ADR-040: Explicit Golden Dataset Versioning¶
- Status: Accepted
- Date: 2026-01-11
- Authors: Podcast Scraper Team
- Related RFCs: RFC-041
- Related PRDs: PRD-007
Context & Problem Statement¶
Golden data (human-verified transcripts or high-quality summaries) is the source of truth for evaluation. If golden data is regenerated or modified silently, all historical benchmarks become invalid.
Decision¶
We enforce Explicit Golden Dataset Versioning.
- Golden datasets are stored in versioned folders (e.g.,
data/eval/golden/indicator_v1/). - Creation or updates to a golden dataset require a separate, manual-approval pipeline (
make golden). - Once a version is tagged and used in a baseline, it is "frozen" and never modified.
Rationale¶
- Stability: Guarantees that benchmark scores are comparable over months of development.
- Rigor: Prevents "gaming the metrics" by accidentally updating the golden reference to match a specific model's output.
- Auditability: We always know exactly which version of ground truth was used for a specific project release.
Alternatives Considered¶
- Live Golden Data: Rejected as it invalidates historical comparisons whenever a typo is fixed in a ground-truth transcript.
Consequences¶
- Positive: Rock-solid evaluation foundation; clear version history for ground truth.
- Negative: Requires managing multiple versions of evaluation data.