ADR-018: Codified Comparison Baselines¶
- Status: Accepted
- Date: 2026-01-11
- Updated: 2026-01-16
- Authors: Podcast Scraper Team
- Related RFCs: RFC-015, RFC-041
- Related PRDs: PRD-007
Context & Problem Statement¶
When evaluating a new model or prompt, "better" is often subjective. Without a stable, codified baseline, it is impossible to determine if a change is a genuine improvement or a regression in disguise.
Decision¶
We mandate Codified Comparison Baselines.
- Every experiment and benchmark MUST reference a specific
baseline_id(e.g.,bart_led_baseline_v2). - A baseline is a frozen artifact directory containing metadata, predictions, and calculated metrics.
- The system prevents comparisons between experiments using different datasets or mismatched baselines.
Implementation: Baselines are stored in data/eval/baselines/{baseline_id}/ with:
predictions.jsonl- All episode outputs in structured formatmetrics.json- Aggregated performance and quality metricsfingerprint.json- Complete system fingerprint for reproducibilitybaseline.json- Baseline metadata and statisticsconfig.yaml- Experiment configuration usedREADME.md- Baseline purpose and replacement policy
Baselines are created via promotion workflow (make run-promote --as baseline) from temporary runs, ensuring immutability and explicit intent.
Rationale¶
- Objectivity: Moves from "this looks better" to "this improved ROUGE-L by 5% and reduced latency by 20% vs. the baseline."
- Integrity: Enforcing
baseline_idensures that developers are always comparing "apples to apples." - Regression Detection: Enables automated CI checks that fail if a PR drops below the established baseline.
Alternatives Considered¶
- Ad-hoc Comparison: Rejected as it leads to "baseline drift" where improvements are measured against outdated or unknown states.
Consequences¶
- Positive: Clear regression signals; data-driven decision making; stable project quality targets.
- Negative: Requires a one-time effort to create and "freeze" baseline artifacts.