ADR-041: Multi-Tiered Benchmarking Strategy¶
- Status: Accepted
- Date: 2026-01-11
- Authors: Podcast Scraper Team
- Related RFCs: RFC-041
Context & Problem Statement¶
Running a comprehensive benchmark suite (20+ episodes, multiple models) takes 30-60 minutes and is too slow for PR validation. Conversely, running only unit tests doesn't catch quality regressions in the AI pipeline.
Decision¶
We adopt a Multi-Tiered Benchmarking Strategy:
- Smoke Tests (PR Tier): Runs on every Pull Request. Uses a tiny, representative subset (3 episodes) and a single baseline config. Goal: Catch total pipeline breakages in <5 minutes.
- Full Benchmarks (Nightly Tier): Runs nightly on
main. Uses the full dataset (20+ episodes) and multiple "stress case" configurations. Goal: Detect subtle quality or latency regressions.
Rationale¶
- Developer Velocity: Smoke tests provide near-instant feedback without bottlenecking the PR queue.
- Thoroughness: Nightly runs ensure we don't miss long-term drift that only appears across a larger data sample.
- Cost Control: Avoids expensive API calls or massive GPU usage on every minor commit push.
Alternatives Considered¶
- Full Benchmarks on PR: Rejected as it kills developer momentum.
- No Benchmarks in CI: Rejected as it allows silent quality regressions to reach production.
Consequences¶
- Positive: High CI stability; fast feedback loop; comprehensive nightly coverage.
- Negative: Requires maintaining two separate CI workflows.