ADR-051: Single Code Path for Evaluation and Application¶

Status: Accepted
Date: 2026-04-03
Authors: Podcast Scraper Team
Related RFCs: RFC-048

Context & Problem Statement¶

Evaluation runs and application (production/dev) runs used the same models but could diverge in behavior due to implicit defaults, separate eval-only logic, and hidden parameter drift. When evaluation metrics looked good but production behavior differed, root-cause analysis was difficult because the two paths were not guaranteed to share the same execution logic.

The core question: should evaluation exercise a separate code path optimized for measurement, or the exact same code path that ships to users?

Decision¶

Evaluation and application share a single execution path. No separate "eval summarizer" or "prod summarizer" exists.

One code path: Eval runs call the same providers, pipeline stages, and preprocessing logic as the application. Shared provider initialization, shared generation logic, shared dynamic safeguards.
Explicit parameters only: All behavioral ML parameters (max_new_tokens, min_new_tokens, early_stopping, no_repeat_ngram_size, chunking strategy) must appear explicitly in config. No silent defaults for parameters that affect output.
Scorers are read-only observers: Scorers never mutate behavior. They operate on predictions after generation and never change generation parameters, filter predictions, or affect chunking/reduce decisions. All filtering (e.g. NER scope filtering) happens inside the scorer.
Preprocessing is part of the model contract: The preprocessing profile is included in the fingerprint and treated as a behavioral parameter, not an incidental setting.
Dynamic safeguards run everywhere: Runtime safety logic (capping max_new_tokens based on input size, forcing min_new_tokens=0 to prevent expansion) runs in both eval and app. These are provider responsibilities, not eval-only features.

Rationale¶

If eval and app diverge, metrics are misleading — "what you evaluate is what you ship" is the non-negotiable principle.
Explicit parameters eliminate hidden drift. When behavior changes, it is traceable to a config change, not an implicit default shift.
Read-only scorers guarantee that adding or modifying evaluation logic never introduces production regressions.
Including preprocessing in the fingerprint means two runs with different cleaning profiles are correctly identified as different, not silently compared.

Alternatives Considered¶

Separate eval mode with stricter constraints: Rejected; creates two paths that can diverge. "Stricter" eval-only policies make eval results non-representative of production.
Shared code with eval-only parameter overrides: Rejected; overrides reintroduce hidden divergence. If an override matters for eval, it matters for production too.
Allow implicit defaults for non-critical parameters: Rejected; difficult to determine what is "non-critical." The v7 experiment showed preprocessing — seemingly minor — had an 80% impact.

Consequences¶

Positive: Eval results are trustworthy indicators of production behavior. All behavior is explainable from logs + config. No hidden drift.
Negative: Cannot add eval-only optimizations (e.g. eval-fast presets) without ensuring they also work in production. Slightly more verbose configs (every parameter explicit).
Neutral: Fingerprints become the source of truth for explaining and comparing runs.

Implementation Notes¶

Module: src/podcast_scraper/evaluation/, src/podcast_scraper/providers/ml/
Pattern: Scorers implement a read-only observer interface; providers own dynamic safeguards
Artifact contract: Every run (eval or app) produces predictions.jsonl, metrics.json, metrics_report.md, fingerprint.json, and run.log
Relationship to ADR-040: Materialization (ADR-040) ensures inputs are identical; this ADR ensures execution is identical