ADR-040: Materialization Boundary for Evaluation Inputs¶

Status: Accepted
Date: 2026-04-03
Authors: Podcast Scraper Team
Related RFCs: RFC-046

Context & Problem Statement¶

Preprocessing profiles (e.g. cleaning_v3 vs cleaning_v4) were specified in run configs alongside model parameters. Two runs with different preprocessing but the same dataset_id appeared comparable when they were not — the v7 experiment showed an 80 percentage-point improvement from preprocessing alone. This made evaluation comparisons ambiguous and input contracts implicit.

The boundary between "what is the dataset?" and "what does this run configure?" needed a principled definition.

Decision¶

Preprocessing is moved from a run-time parameter to a dataset materialization parameter. The materialization boundary is defined as follows:

Materialization owns preprocessing: The combination of (dataset_id + canonical_profile + adapter) produces a materialization_id. Runs reference a materialization_id, not a raw dataset_id with a separate preprocessing_profile.
Two-layer preprocessing model: Layer A (canonical cleanup) is shared across all providers. Layer B (adapter) is provider-specific (e.g. speaker anonymization for ML models). Both are applied at materialization time, not run time.
Chunking stays in run config: Chunking is model-dependent (BART needs 1024 tokens, LED handles 4096). It belongs in run config because materializing chunks would lock inputs to one model's requirements.
Semantic versioning in materialization ID: Materialization configs carry semver (version: "1.0.0") so materialized inputs can evolve while preserving historical datasets.
Materialized inputs are frozen: Once generated, materialized text files are immutable. Changes produce a new version.

Rationale¶

The v7 experiment proved that preprocessing is not a minor parameter — it changes what the input is, not how the model processes it. It belongs in the dataset definition.
Separating canonical from adapter preprocessing allows fair cross-provider comparison (canonical only) while enabling provider-specific optimization (with adapter).
Chunking is the one preprocessing step that genuinely depends on the model's context window, so it stays in run config.
Semver enables reproducibility: a specific materialization_id always produces identical inputs.

Alternatives Considered¶

Keep preprocessing as run parameter (status quo): Rejected; ambiguous comparisons and hidden input differences (proven by v7 experiment).
Include chunking in materialization: Rejected; model-dependent — would require separate materializations per chunk size, defeating the purpose of a shared input contract.
Apply adapters at runtime instead of materialization time: Rejected; makes adapters a hidden run parameter, which is the problem this ADR solves.

Consequences¶

Positive: Comparisons are honest — same materialization_id guarantees same inputs. Materialized datasets are frozen and auditable. Provider-specific optimization is explicit.
Negative: Adds a materialization generation step before experiments. Requires migration from preprocessing_profile in existing configs.
Neutral: data/eval/materializations/ and data/eval/materialized/ directories are added to the project structure.

Implementation Notes¶

Module: podcast_scraper/evaluation/ — materialization config loading and generation
Config: data/eval/materializations/*.yaml — materialization definitions
Output: data/eval/materialized/<materialization_id>/ — frozen text files
Migration: preprocessing_profile in run configs is deprecated; backward compatibility layer logs a warning during transition