ADR-020: Provider-Agnostic Preprocessing¶

Status: Accepted
Date: 2026-01-11
Authors: Podcast Scraper Team
Related RFCs: RFC-013

Context & Problem Statement¶

Transcript cleaning (timestamp removal, sponsor block filtering) was originally implemented inside the summarization logic. Adding a second provider (OpenAI) would have meant duplicating this complex regex logic or risk inconsistent inputs.

Decision¶

We move all cleaning and sanitization into a Shared Preprocessing Module.

Cleaning happens once in the metadata or workflow layer.
The "clean" text is then passed to the selected provider (MLProvider or OpenAIProvider).

Rationale¶

Single Source of Truth: Changes to how we detect "Sponsor Blocks" only need to be made in one place.
Efficiency: Avoids redundant processing if multiple providers are used sequentially.
Comparability: Ensures that when we compare a local model to OpenAI, they are both looking at the exact same cleaned text.

Alternatives Considered¶

Provider-Specific Cleaning: Rejected as it makes benchmarking and cross-provider comparison invalid.

Consequences¶

Positive: Guarantees consistent input quality; simplifies provider implementation (they only handle inference).
Negative: The preprocessing logic must remain generic enough to not over-clean text for specific models.

References¶

RFC-013: OpenAI Provider Implementation