ADR-033: Registered Preprocessing Profiles¶
- Status: Accepted
- Date: 2026-01-11
- Authors: Podcast Scraper Team
- Related RFCs: RFC-016
Context & Problem Statement¶
Transcript cleaning (removing ads, timestamps, or fillers) has as much impact on summary quality as the model itself. However, cleaning logic was often "hidden" inside functions, making it hard to track how changes to regex patterns affected overall metrics.
Decision¶
We move cleaning logic into Registered Preprocessing Profiles.
- Profiles are defined as versioned objects (e.g.,
cleaning_v3). - Each profile specifies exactly which steps are active (e.g.,
remove_sponsors=True). - The
profile_idis recorded in the output fingerprint (ADR-024).
Rationale¶
- Isolating Variables: Allows researchers to test Model A vs. Model B while keeping the Preprocessing Profile identical.
- Traceability: If ROUGE scores improve, we can definitively say if it was due to a better model or a better cleaning profile.
- Reusability: Standardizes cleaning across all episodes in a dataset.
Alternatives Considered¶
- Ad-hoc Cleaning: Rejected as it makes benchmarking impossible to reproduce or explain.
Consequences¶
- Positive: Clearer insight into "data vs. model" performance; easy to roll back cleaning regressions.
- Negative: Requires maintaining a registry of profiles.