Skip to content

ADR-018: Codified Comparison Baselines

  • Status: Accepted
  • Date: 2026-01-11
  • Updated: 2026-01-16
  • Authors: Podcast Scraper Team
  • Related RFCs: RFC-015, RFC-041
  • Related PRDs: PRD-007

Context & Problem Statement

When evaluating a new model or prompt, "better" is often subjective. Without a stable, codified baseline, it is impossible to determine if a change is a genuine improvement or a regression in disguise.

Decision

We mandate Codified Comparison Baselines.

  • Every experiment and benchmark MUST reference a specific baseline_id (e.g., bart_led_baseline_v2).
  • A baseline is a frozen artifact directory containing metadata, predictions, and calculated metrics.
  • The system prevents comparisons between experiments using different datasets or mismatched baselines.

Implementation: Baselines are stored in data/eval/baselines/{baseline_id}/ with:

  • predictions.jsonl - All episode outputs in structured format
  • metrics.json - Aggregated performance and quality metrics
  • fingerprint.json - Complete system fingerprint for reproducibility
  • baseline.json - Baseline metadata and statistics
  • config.yaml - Experiment configuration used
  • README.md - Baseline purpose and replacement policy

Baselines are created via promotion workflow (make run-promote --as baseline) from temporary runs, ensuring immutability and explicit intent.

Rationale

  • Objectivity: Moves from "this looks better" to "this improved ROUGE-L by 5% and reduced latency by 20% vs. the baseline."
  • Integrity: Enforcing baseline_id ensures that developers are always comparing "apples to apples."
  • Regression Detection: Enables automated CI checks that fail if a PR drops below the established baseline.

Alternatives Considered

  1. Ad-hoc Comparison: Rejected as it leads to "baseline drift" where improvements are measured against outdated or unknown states.

Consequences

  • Positive: Clear regression signals; data-driven decision making; stable project quality targets.
  • Negative: Requires a one-time effort to create and "freeze" baseline artifacts.

References