ADR-051: Single Code Path for Evaluation and Application¶
- Status: Accepted
- Date: 2026-04-03
- Authors: Podcast Scraper Team
- Related RFCs: RFC-048
Context & Problem Statement¶
Evaluation runs and application (production/dev) runs used the same models but could diverge in behavior due to implicit defaults, separate eval-only logic, and hidden parameter drift. When evaluation metrics looked good but production behavior differed, root-cause analysis was difficult because the two paths were not guaranteed to share the same execution logic.
The core question: should evaluation exercise a separate code path optimized for measurement, or the exact same code path that ships to users?
Decision¶
Evaluation and application share a single execution path. No separate "eval summarizer" or "prod summarizer" exists.
- One code path: Eval runs call the same providers, pipeline stages, and preprocessing logic as the application. Shared provider initialization, shared generation logic, shared dynamic safeguards.
- Explicit parameters only: All behavioral ML parameters (
max_new_tokens,min_new_tokens,early_stopping,no_repeat_ngram_size, chunking strategy) must appear explicitly in config. No silent defaults for parameters that affect output. - Scorers are read-only observers: Scorers never mutate behavior. They operate on predictions after generation and never change generation parameters, filter predictions, or affect chunking/reduce decisions. All filtering (e.g. NER scope filtering) happens inside the scorer.
- Preprocessing is part of the model contract: The preprocessing profile is included in the fingerprint and treated as a behavioral parameter, not an incidental setting.
- Dynamic safeguards run everywhere: Runtime safety logic (capping
max_new_tokensbased on input size, forcingmin_new_tokens=0to prevent expansion) runs in both eval and app. These are provider responsibilities, not eval-only features.
Rationale¶
- If eval and app diverge, metrics are misleading — "what you evaluate is what you ship" is the non-negotiable principle.
- Explicit parameters eliminate hidden drift. When behavior changes, it is traceable to a config change, not an implicit default shift.
- Read-only scorers guarantee that adding or modifying evaluation logic never introduces production regressions.
- Including preprocessing in the fingerprint means two runs with different cleaning profiles are correctly identified as different, not silently compared.
Alternatives Considered¶
- Separate eval mode with stricter constraints: Rejected; creates two paths that can diverge. "Stricter" eval-only policies make eval results non-representative of production.
- Shared code with eval-only parameter overrides: Rejected; overrides reintroduce hidden divergence. If an override matters for eval, it matters for production too.
- Allow implicit defaults for non-critical parameters: Rejected; difficult to determine what is "non-critical." The v7 experiment showed preprocessing — seemingly minor — had an 80% impact.
Consequences¶
- Positive: Eval results are trustworthy indicators of production behavior. All behavior is explainable from logs + config. No hidden drift.
- Negative: Cannot add eval-only optimizations (e.g. eval-fast presets) without ensuring they also work in production. Slightly more verbose configs (every parameter explicit).
- Neutral: Fingerprints become the source of truth for explaining and comparing runs.
Implementation Notes¶
- Module:
src/podcast_scraper/evaluation/,src/podcast_scraper/providers/ml/ - Pattern: Scorers implement a read-only observer interface; providers own dynamic safeguards
- Artifact contract: Every run (eval or app) produces
predictions.jsonl,metrics.json,metrics_report.md,fingerprint.json, andrun.log - Relationship to ADR-040: Materialization (ADR-040) ensures inputs are identical; this ADR ensures execution is identical