ADR-040: Materialization Boundary for Evaluation Inputs¶
- Status: Accepted
- Date: 2026-04-03
- Authors: Podcast Scraper Team
- Related RFCs: RFC-046
Context & Problem Statement¶
Preprocessing profiles (e.g. cleaning_v3 vs cleaning_v4) were specified in run
configs alongside model parameters. Two runs with different preprocessing but the same
dataset_id appeared comparable when they were not — the v7 experiment showed an 80
percentage-point improvement from preprocessing alone. This made evaluation comparisons
ambiguous and input contracts implicit.
The boundary between "what is the dataset?" and "what does this run configure?" needed a principled definition.
Decision¶
Preprocessing is moved from a run-time parameter to a dataset materialization parameter. The materialization boundary is defined as follows:
- Materialization owns preprocessing: The combination of
(dataset_id + canonical_profile + adapter)produces amaterialization_id. Runs reference amaterialization_id, not a rawdataset_idwith a separatepreprocessing_profile. - Two-layer preprocessing model: Layer A (canonical cleanup) is shared across all providers. Layer B (adapter) is provider-specific (e.g. speaker anonymization for ML models). Both are applied at materialization time, not run time.
- Chunking stays in run config: Chunking is model-dependent (BART needs 1024 tokens, LED handles 4096). It belongs in run config because materializing chunks would lock inputs to one model's requirements.
- Semantic versioning in materialization ID: Materialization configs carry semver
(
version: "1.0.0") so materialized inputs can evolve while preserving historical datasets. - Materialized inputs are frozen: Once generated, materialized text files are immutable. Changes produce a new version.
Rationale¶
- The v7 experiment proved that preprocessing is not a minor parameter — it changes what the input is, not how the model processes it. It belongs in the dataset definition.
- Separating canonical from adapter preprocessing allows fair cross-provider comparison (canonical only) while enabling provider-specific optimization (with adapter).
- Chunking is the one preprocessing step that genuinely depends on the model's context window, so it stays in run config.
- Semver enables reproducibility: a specific
materialization_idalways produces identical inputs.
Alternatives Considered¶
- Keep preprocessing as run parameter (status quo): Rejected; ambiguous comparisons and hidden input differences (proven by v7 experiment).
- Include chunking in materialization: Rejected; model-dependent — would require separate materializations per chunk size, defeating the purpose of a shared input contract.
- Apply adapters at runtime instead of materialization time: Rejected; makes adapters a hidden run parameter, which is the problem this ADR solves.
Consequences¶
- Positive: Comparisons are honest — same
materialization_idguarantees same inputs. Materialized datasets are frozen and auditable. Provider-specific optimization is explicit. - Negative: Adds a materialization generation step before experiments. Requires
migration from
preprocessing_profilein existing configs. - Neutral:
data/eval/materializations/anddata/eval/materialized/directories are added to the project structure.
Implementation Notes¶
- Module:
podcast_scraper/evaluation/— materialization config loading and generation - Config:
data/eval/materializations/*.yaml— materialization definitions - Output:
data/eval/materialized/<materialization_id>/— frozen text files - Migration:
preprocessing_profilein run configs is deprecated; backward compatibility layer logs a warning during transition