RFC-040: Audio Pre-Processing Pipeline for Podcast Ingestion¶
- Status: Completed
- Authors: Architecture Review
- Stakeholders: Core Pipeline, Transcription Providers
- Related PRDs:
- N/A (new capability)
- Related ADRs:
- ADR-036: Standardized Pre-Provider Audio Stage
- ADR-037: Content-Hash Based Audio Caching
- ADR-038: FFmpeg-First Audio Manipulation
- ADR-039: Speech-Optimized Codec (Opus)
- Related RFCs:
docs/rfc/RFC-005-whisper-integration.md(transcription pipeline)docs/rfc/RFC-013-openai-provider-implementation.md(provider pattern)docs/rfc/RFC-029-provider-refactoring-consolidation.md(unified provider architecture)- Related Documents:
docs/architecture/ARCHITECTURE.md(pipeline flow)docs/guides/PROVIDER_IMPLEMENTATION_GUIDE.md(provider patterns)
Abstract¶
This RFC proposes a dedicated audio pre-processing stage in the podcast ingestion pipeline to optimize raw audio files before they are sent to any transcription provider (local Whisper or OpenAI API). The pre-processor will convert audio to mono, resample to 16 kHz, remove silence using Voice Activity Detection (VAD), and normalize loudness. This is critical for API-based transcription where file size limits and per-minute costs make optimization essential. It also improves performance for local Whisper transcription.
Architecture Alignment: This RFC introduces preprocessing as a pipeline stage that occurs before provider selection, not as part of the provider itself. Audio is preprocessed in episode_processor.py after download but before passing to any transcription provider. This ensures all providers (Whisper, OpenAI, future providers) benefit from optimized audio, maintains separation of concerns, and addresses API upload size limits.
Problem Statement¶
The current podcast ingestion pipeline downloads audio files and passes them directly to transcription providers (Whisper or OpenAI Whisper API). Podcast audio often contains:
- Long silence periods (intros, outros, pauses)
- Music segments (intro/outro themes, mid-roll music)
- Inconsistent loudness levels (varying recording quality across episodes)
- Unnecessarily high fidelity (48kHz, stereo, high bitrates)
This leads to:
- API upload size limits — Many podcast episodes exceed provider file size limits
- Longer transcription runtimes — More audio data to process (both local and API)
- Higher API costs — Providers charge per audio minute, including non-speech segments
- Higher compute costs — Local Whisper processes silence and music unnecessarily
- Larger transcripts — Non-speech segments may produce "[MUSIC]", "[SILENCE]", or filler text
- Inconsistent quality — Varying audio levels affect transcription accuracy
API Provider File Size Limits (Official Documentation)¶
Different transcription providers impose various file size constraints:
| Provider | API Endpoint | Max File Size | Max Duration | Documentation | Status |
|---|---|---|---|---|---|
| OpenAI Whisper | Audio API | 25 MB | N/A | Official Docs | ✅ Supported |
| Mistral Voxtral | Audio API | TBD (likely similar to OpenAI) | TBD | API Docs | 🔄 Planned (PRD-010) |
| Google Gemini | Multimodal API | TBD (native audio) | TBD | AI Studio | 🔄 Planned (PRD-012) |
| Grok | N/A | ❌ No transcription | N/A | API Docs | ✅ Implemented (PRD-013, LLMs only) |
| Google Cloud (sync) | Speech-to-Text | 10 MB | ~1 minute | Quotas | ❌ Not planned |
| Azure Speech | Fast/Batch | 300 MB - 1 GB | 120-240 min | Service Limits | ❌ Not planned |
Transcription Provider Support Matrix (from PRDs):
| Provider | Transcription | Speaker Detection | Summarization | Notes |
|---|---|---|---|---|
| Local (Whisper) | ✅ | ✅ (spaCy) | ✅ (Transformers) | No size limits |
| OpenAI | ✅ (25 MB limit) | ✅ | ✅ | Currently supported |
| Mistral | ✅ Voxtral (limit TBD) | ✅ | ✅ | Planned (PRD-010) |
| Gemini | ✅ Native audio (limit TBD) | ✅ | ✅ | Planned (PRD-012) |
| Anthropic | ❌ | ✅ | ✅ | No audio API (PRD-009) |
| DeepSeek | ❌ | ✅ | ✅ | No audio API (PRD-011) |
| Grok | ❌ | ✅ | ✅ | LLMs only (PRD-013) |
| Ollama | ❌ | ✅ | ✅ | Local LLMs (PRD-014) |
Critical Constraints:
- OpenAI: 25 MB hard limit (most restrictive known, currently supported)
- Mistral & Gemini: Limits unknown, assumed similar or more generous
- Conservative Design: Target <25 MB to ensure compatibility with all current and planned providers
Common Denominator Approach:
To support both current (OpenAI) and planned (Mistral, Gemini) API transcription providers, we adopt a conservative <25 MB target:
- Known constraint: OpenAI = 25 MB (confirmed)
- Unknown constraints: Mistral Voxtral and Gemini audio limits not yet documented
- Safe assumption: Designing for 25 MB ensures compatibility even if future providers have similar or slightly more restrictive limits
- Benefit: If Mistral/Gemini have more generous limits (>25 MB), preprocessed files will still be well-optimized
Real-World Impact:
- Raw podcast episodes (30-60 minutes) typically range from 30-100 MB in MP3 format
- 70-80% of podcast episodes exceed the 25 MB limit and cannot be uploaded to OpenAI without preprocessing
- Preprocessing is mandatory (not optional) for API-based transcription of typical podcast content
- Target output: 2-5 MB for typical episodes (10-25× reduction, well below any likely provider limit)
Real-World Impact:
- Raw podcast episodes (30-60 minutes) typically range from 30-100 MB in MP3 format
- 70-80% of podcast episodes exceed the 25 MB limit and cannot be uploaded without preprocessing
- Preprocessing is not optional for API-based transcription of typical podcast content
We need a pipeline stage that preprocesses audio before it reaches any provider, ensuring files fit within the 25 MB constraint while optimizing for both local and API-based transcription.
Use Cases:
- API Upload Size Compliance: Reduce audio files to fit within 25 MB limit (OpenAI, Grok)
- Cost Optimization: Reduce API costs by removing silence and music before API calls
- Performance Optimization: Speed up local Whisper transcription by processing smaller audio files
- Quality Standardization: Normalize audio levels across different podcast sources for consistent transcription quality
Goals¶
- Ensure API compatibility — All preprocessed audio must be <25 MB to support OpenAI & Grok
- Reduce audio size for API upload limits — Target 10-25× reduction (typical 50 MB → 2-5 MB)
- Lower transcription costs — Reduce API usage by processing less audio (30-60% cost savings)
- Improve transcription performance — Faster processing for both local and API providers
- Improve transcription accuracy — Consistent audio levels and speech-only content
- Standardize audio inputs — All podcasts processed with consistent quality
- Maintain backward compatibility — Preprocessing is optional and configurable
- Cache preprocessed audio — Avoid reprocessing with content-based hashing
- Provider-agnostic — All transcription providers receive optimized audio
Constraints & Assumptions¶
Constraints:
- Must not break existing transcription workflows (backward compatibility)
- Must work before any provider is invoked (not provider-specific)
- Must ensure preprocessed files are <25 MB (OpenAI & Grok constraint - most restrictive)
- Must handle various input formats (MP3, M4A, WAV, etc.)
- Must be performant enough not to negate transcription time savings
- Must preserve audio segment boundaries for accurate transcription timing
- Must be optional and configurable (disabled by default for initial rollout)
- Must not require provider-specific implementations
Assumptions:
ffmpegis available on the system (already common dependency for audio processing)- VAD preprocessing adds 10-30% overhead but saves 30-60% on transcription time (net win)
- Content-based hashing is sufficient for cache keys (no security sensitivity)
- Preprocessed audio can be cached in
.cache/directory structure - Aggressive VAD with conservative thresholds (preserve meaningful pauses) is acceptable
Design & Implementation¶
1. Module Structure¶
Create a new preprocessing module following the provider pattern:
src/podcast_scraper/
├── preprocessing/
│ ├── __init__.py
│ ├── base.py # AudioPreprocessor protocol
│ ├── factory.py # Factory function
│ ├── ffmpeg_processor.py # FFmpeg-based implementation
│ └── cache.py # Preprocessed audio caching
2. AudioPreprocessor Protocol¶
Define a protocol for audio preprocessing:
# preprocessing/base.py
from typing import Protocol, Tuple
from pathlib import Path
class AudioPreprocessor(Protocol):
"""Protocol for audio preprocessing providers."""
def preprocess(
self,
input_path: str,
output_path: str,
) -> Tuple[bool, float]:
"""Preprocess audio file for transcription.
Args:
input_path: Path to raw audio file
output_path: Path to save preprocessed audio
Returns:
Tuple of (success: bool, elapsed_time: float)
"""
...
def get_cache_key(self, input_path: str) -> str:
"""Generate content-based cache key for input audio.
Args:
input_path: Path to audio file
Returns:
Cache key (hash of audio content + preprocessing settings)
"""
...
3. FFmpeg Implementation¶
Implement audio preprocessing using ffmpeg:
# preprocessing/ffmpeg_processor.py
import hashlib
import logging
import subprocess
import time
from pathlib import Path
from typing import Tuple
logger = logging.getLogger(__name__)
class FFmpegAudioPreprocessor:
"""Audio preprocessor using ffmpeg for format conversion and silenceremove filter."""
def __init__(self, cfg):
"""Initialize preprocessor with configuration.
Args:
cfg: Configuration object with preprocessing settings
"""
self.cfg = cfg
self.sample_rate = cfg.preprocessing_sample_rate # Default: 16000
self.silence_threshold = cfg.preprocessing_silence_threshold # Default: -50dB
self.silence_duration = cfg.preprocessing_silence_duration # Default: 2.0s
self.target_loudness = cfg.preprocessing_target_loudness # Default: -16 LUFS
def preprocess(self, input_path: str, output_path: str) -> Tuple[bool, float]:
"""Preprocess audio using ffmpeg pipeline.
Pipeline stages:
1. Convert to mono
2. Resample to 16 kHz
3. Remove silence (VAD)
4. Normalize loudness to -16 LUFS
5. Compress using speech-optimized codec
Args:
input_path: Path to raw audio file
output_path: Path to save preprocessed audio
Returns:
Tuple of (success: bool, elapsed_time: float)
"""
start_time = time.time()
# Build ffmpeg command
# -ac 1: convert to mono
# -ar 16000: resample to 16 kHz
# -af silenceremove: remove silence (conservative thresholds)
# -af loudnorm: normalize loudness to -16 LUFS
# -c:a libopus: speech-optimized codec
# -b:a 24k: 24 kbps bitrate
cmd = [
"ffmpeg",
"-i", input_path,
"-ac", "1", # Mono
"-ar", str(self.sample_rate), # 16 kHz
"-af", (
f"silenceremove="
f"start_periods=1:"
f"start_threshold={self.silence_threshold}:"
f"start_duration={self.silence_duration}:"
f"stop_periods=-1:"
f"stop_threshold={self.silence_threshold}:"
f"stop_duration={self.silence_duration},"
f"loudnorm=I={self.target_loudness}:LRA=11:TP=-1.5"
),
"-c:a", "libopus", # Speech-optimized codec
"-b:a", "24k", # 24 kbps
"-y", # Overwrite output
output_path,
]
try:
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=300, # 5 minute timeout
check=True,
)
elapsed = time.time() - start_time
logger.debug("Audio preprocessing completed in %.1fs", elapsed)
return True, elapsed
except subprocess.CalledProcessError as exc:
logger.error("FFmpeg preprocessing failed: %s", exc.stderr)
return False, time.time() - start_time
except subprocess.TimeoutExpired:
logger.error("FFmpeg preprocessing timed out after 300s")
return False, time.time() - start_time
except FileNotFoundError:
logger.error("FFmpeg not found. Install ffmpeg to use audio preprocessing.")
return False, 0.0
def get_cache_key(self, input_path: str) -> str:
"""Generate cache key from file content hash + preprocessing config.
Args:
input_path: Path to audio file
Returns:
Cache key (SHA256 hash of content + config)
"""
hasher = hashlib.sha256()
# Hash file content (first 1MB for performance)
try:
with open(input_path, "rb") as f:
hasher.update(f.read(1024 * 1024))
except OSError as exc:
logger.warning("Failed to hash audio file: %s", exc)
# Use file path as fallback
hasher.update(input_path.encode("utf-8"))
# Hash preprocessing config to invalidate cache when settings change
config_str = (
f"{self.sample_rate}|"
f"{self.silence_threshold}|"
f"{self.silence_duration}|"
f"{self.target_loudness}"
)
hasher.update(config_str.encode("utf-8"))
return hasher.hexdigest()[:16] # 16 hex chars (64 bits)
4. Caching Strategy¶
Implement caching to avoid reprocessing:
# preprocessing/cache.py
import os
from pathlib import Path
from typing import Optional
PREPROCESSING_CACHE_DIR = ".cache/preprocessing"
def get_cached_audio_path(
cache_key: str,
cache_dir: str = PREPROCESSING_CACHE_DIR,
) -> Optional[str]:
"""Check if preprocessed audio exists in cache.
Args:
cache_key: Content-based cache key
cache_dir: Cache directory path
Returns:
Path to cached audio if exists, None otherwise
"""
cache_path = os.path.join(cache_dir, f"{cache_key}.mp3")
if os.path.exists(cache_path):
return cache_path
return None
def save_to_cache(
source_path: str,
cache_key: str,
cache_dir: str = PREPROCESSING_CACHE_DIR,
) -> str:
"""Save preprocessed audio to cache.
Args:
source_path: Path to preprocessed audio
cache_key: Content-based cache key
cache_dir: Cache directory path
Returns:
Path to cached audio
"""
os.makedirs(cache_dir, exist_ok=True)
cache_path = os.path.join(cache_dir, f"{cache_key}.mp3")
# Copy to cache (or move if source is temp)
import shutil
shutil.copy2(source_path, cache_path)
return cache_path
5. Integration with Transcription Pipeline¶
Modify episode_processor.py to preprocess audio before passing to any provider:
# episode_processor.py (modified)
def transcribe_media_to_text(
job: models.TranscriptionJob,
cfg: config.Config,
whisper_model,
run_suffix: Optional[str],
effective_output_dir: str,
transcription_provider=None,
audio_preprocessor=None, # NEW: Optional AudioPreprocessor
pipeline_metrics=None,
) -> tuple[bool, Optional[str], int]:
"""Transcribe media file using transcription provider and save result.
Audio preprocessing happens BEFORE provider selection - all providers
receive optimized audio. This is critical for API providers with upload
size limits (e.g., OpenAI 25 MB limit) and reduces costs for metered APIs.
"""
# ... existing validation code ...
temp_media = job.temp_media
media_for_transcription = temp_media
# NEW: Preprocess audio BEFORE passing to any provider
# This happens at the pipeline level, not within providers
if audio_preprocessor and cfg.preprocessing_enabled:
cache_key = audio_preprocessor.get_cache_key(temp_media)
# Check cache first
cached_path = preprocessing.cache.get_cached_audio_path(cache_key)
if cached_path:
logger.debug("Using cached preprocessed audio: %s", cache_key)
media_for_transcription = cached_path
else:
# Preprocess audio
preprocessed_path = f"{temp_media}.preprocessed.mp3"
success, preprocess_elapsed = audio_preprocessor.preprocess(
temp_media, preprocessed_path
)
if success:
# Save to cache
cached_path = preprocessing.cache.save_to_cache(
preprocessed_path, cache_key
)
media_for_transcription = cached_path
# Log size reduction
original_size = os.path.getsize(temp_media)
preprocessed_size = os.path.getsize(cached_path)
reduction = (1 - preprocessed_size / original_size) * 100
logger.info(
" preprocessed audio: %.1f%% smaller (%.1fMB -> %.1fMB) in %.1fs",
reduction,
original_size / (1024 * 1024),
preprocessed_size / (1024 * 1024),
preprocess_elapsed,
)
# Record preprocessing time
if pipeline_metrics:
pipeline_metrics.record_preprocessing_time(preprocess_elapsed)
else:
# Preprocessing failed, use original audio
logger.warning("Audio preprocessing failed, using original audio")
media_for_transcription = temp_media
# Pass preprocessed (or original) audio to provider
# Provider is agnostic to whether audio was preprocessed
try:
result, tc_elapsed = transcription_provider.transcribe_with_segments(
media_for_transcription, # Preprocessed or original - provider doesn't know/care
language=cfg.language
)
# ... rest of transcription code ...
Key Architecture Points:
- Preprocessing happens at pipeline level — Not inside providers
- All providers receive optimized audio — Whisper, OpenAI, future providers all benefit
- Providers are agnostic — They don't know if audio was preprocessed
- Cache is shared — Same preprocessed audio reused across provider switches
- Separation of concerns — Preprocessing is orthogonal to transcription
### 6. Configuration Fields
Add preprocessing configuration to `Config`:
```python
# config.py (additions)
class Config(BaseModel):
# ... existing fields ...
# Audio Preprocessing
preprocessing_enabled: bool = Field(
default=False,
description="Enable audio preprocessing before transcription (default: False)"
)
preprocessing_sample_rate: int = Field(
default=16000,
description="Target sample rate for preprocessing (default: 16000 Hz)"
)
preprocessing_silence_threshold: str = Field(
default="-50dB",
description="Silence detection threshold (default: -50dB)"
)
preprocessing_silence_duration: float = Field(
default=2.0,
description="Minimum silence duration to remove in seconds (default: 2.0)"
)
preprocessing_target_loudness: int = Field(
default=-16,
description="Target loudness in LUFS for normalization (default: -16)"
)
preprocessing_cache_dir: Optional[str] = Field(
default=None,
description="Custom cache directory for preprocessed audio (default: .cache/preprocessing)"
)
7. CLI Arguments¶
Add CLI flags for preprocessing:
# cli.py (additions)
def _add_preprocessing_arguments(parser: argparse.ArgumentParser) -> None:
"""Add audio preprocessing arguments to parser."""
preprocessing_group = parser.add_argument_group("Audio Preprocessing")
preprocessing_group.add_argument(
"--enable-preprocessing",
action="store_true",
dest="preprocessing_enabled",
help="Enable audio preprocessing before transcription (experimental)"
)
preprocessing_group.add_argument(
"--preprocessing-silence-threshold",
type=str,
default="-50dB",
help="Silence detection threshold (default: -50dB)"
)
# ... other preprocessing flags ...
Key Decisions¶
- Pipeline-Level Preprocessing, Not Provider-Level
- Decision: Preprocessing happens in
episode_processor.pybefore provider invocation, not within providers -
Rationale: API providers have upload size limits (OpenAI 25 MB). Preprocessing must happen before upload, not after. All providers benefit equally from optimized audio. Separation of concerns - preprocessing is orthogonal to transcription method.
-
FFmpeg as Implementation
- Decision: Use
ffmpegfor audio preprocessing -
Rationale: Industry-standard tool, widely available, supports all needed operations (format conversion, VAD, loudness normalization). Alternative libraries (pydub, librosa) would add Python dependencies and may be slower.
-
Optional and Disabled by Default
- Decision: Preprocessing is opt-in via
preprocessing_enabled=True -
Rationale: Maintains backward compatibility. Users can test preprocessing on their workloads before enabling permanently. Allows gradual rollout.
-
Content-Based Caching
- Decision: Cache preprocessed audio using content hash + config hash
-
Rationale: Avoids reprocessing the same audio multiple times. Common in development/testing when reprocessing episodes. Cache invalidates automatically when preprocessing settings change.
-
Conservative VAD Thresholds
- Decision: Use conservative silence detection thresholds by default
-
Rationale: Preserves meaningful pauses and prevents removing legitimate speech. Users can tune thresholds if needed.
-
Opus Codec for Preprocessed Audio
- Decision: Use Opus codec at 24 kbps for preprocessed audio
- Rationale: Opus is optimized for speech, provides excellent quality at low bitrates, and is widely supported. 24 kbps provides good speech quality while reducing file size significantly.
Alternatives Considered¶
- No Preprocessing
- Description: Continue processing raw audio directly
- Pros: Simpler pipeline, no additional dependencies
- Cons: Higher costs, longer processing times, inconsistent quality
-
Why Rejected: Benefits of preprocessing outweigh the added complexity
-
Text-Only Cleanup After Transcription
- Description: Remove silence markers ("[MUSIC]", "[SILENCE]") from transcripts post-transcription
- Pros: Easier to implement, no audio processing needed
- Cons: Does not reduce transcription time or costs, does not improve transcription quality
-
Why Rejected: Does not address the root cause (processing unnecessary audio)
-
Python-Based VAD (webrtcvad, silero-vad)
- Description: Use Python VAD libraries instead of ffmpeg
- Pros: No external dependencies, potentially more control
- Cons: Adds Python dependencies, may be slower, requires manual audio format handling
-
Why Rejected: FFmpeg provides a complete solution (VAD + format conversion + normalization) in one tool
-
Mandatory Preprocessing
- Description: Make preprocessing mandatory for all transcription
- Pros: Ensures consistent quality across all workflows
- Cons: Breaking change, may not be desirable for all use cases
- Why Rejected: Too disruptive, opt-in approach allows users to test and validate benefits first
Testing Strategy¶
Test Coverage:
- Unit Tests: FFmpeg command generation, cache key generation, error handling
- Integration Tests: Full preprocessing pipeline with real audio files, cache hit/miss scenarios
- E2E Tests: Preprocessing + transcription workflow with both Whisper and OpenAI providers
Test Organization:
tests/unit/test_preprocessing.py— Unit tests for preprocessor logictests/integration/test_preprocessing_integration.py— Integration tests with real audiotests/e2e/test_preprocessing_e2e.py— E2E tests with full pipeline- Use
tests/fixtures/audio/for test audio files (small samples)
Test Execution:
- Unit tests: Run in CI (fast, no external dependencies except ffmpeg availability check)
- Integration tests: Run in CI with
@pytest.mark.integrationand@pytest.mark.skipif(not has_ffmpeg()) - E2E tests: Run in nightly CI or on-demand
Test Data:
- Create small test audio files (~10s) with known characteristics:
silence_test.mp3— Audio with long silence periodsmusic_test.mp3— Audio with music intro/outroquiet_test.mp3— Low-volume audioloud_test.mp3— High-volume audio
Performance Validation:
- Measure preprocessing overhead vs. transcription time savings
- Target: Preprocessing adds <30% overhead but saves >30% transcription time (net positive)
- Track metrics: preprocessing time, audio size reduction, transcription time, total pipeline time
Rollout & Monitoring¶
Rollout Plan:
- Phase 1: Experimental (v2.5.0) — Release as opt-in feature, document as experimental, gather user feedback
- Phase 2: Refinement (v2.6.0) — Refine thresholds based on feedback, improve caching, add more configuration options
- Phase 3: Stable (v2.7.0) — Mark as stable, consider enabling by default for new users
Monitoring:
- Track preprocessing metrics:
preprocessing_time_seconds— Time spent preprocessingpreprocessing_audio_size_reduction— Size reduction ratiopreprocessing_cache_hit_rate— Cache effectiveness- Track transcription metrics impact:
transcription_time_seconds(before/after preprocessing)transcription_cost_usd(for OpenAI provider)transcript_word_count(may decrease with silence removal)- Log warnings for:
- FFmpeg not available
- Preprocessing timeouts
- Preprocessing failures (fallback to original audio)
Success Criteria:
- ✅ Preprocessing reduces transcription time by ≥30% on average
- ✅ Preprocessing reduces OpenAI API costs by ≥30% on average
- ✅ Preprocessing maintains or improves transcription quality (manual review)
- ✅ Cache hit rate ≥80% in development/testing scenarios
- ✅ Zero breaking changes to existing workflows
Expected Impact¶
| Area | Expected Improvement | Measurement |
|---|---|---|
| Audio size | 10–25× smaller | File size comparison (MB) |
| API compatibility | 100% of podcasts fit <25 MB | Upload success rate |
| Preprocessing overhead | +10–30% time | Time measurement (seconds) |
| Transcription runtime | -30–60% faster | Time measurement (seconds) |
| OpenAI/Grok API cost | -30–60% reduction | API usage tracking ($) |
| Transcript size | -15–40% fewer tokens | Word count comparison |
| Transcription quality | Improved consistency | Manual QA review |
| Total pipeline time | -20–40% faster | End-to-end time measurement |
Net Impact: Despite preprocessing overhead, total pipeline time should decrease due to larger transcription time savings.
File Size Target: Preprocessed audio should average 2-5 MB for typical 30-60 minute podcast episodes, well below the 25 MB limit with safety margin.
Benefits¶
- Cost Reduction: Significant reduction in OpenAI Whisper API costs by processing less audio
- Performance Improvement: Faster transcription for both local Whisper and OpenAI API
- Quality Standardization: Consistent audio levels and speech-only content improve transcription accuracy
- Flexibility: Optional and configurable, users can tune for their specific needs
- Caching: Preprocessed audio is cached, avoiding reprocessing in development/testing workflows
- Backward Compatibility: Existing workflows continue to work unchanged
Migration Path¶
No migration needed — preprocessing is opt-in:
- Existing Users: No changes required, workflows continue as-is
- New Users: Can optionally enable preprocessing in config or via CLI flag
- Gradual Adoption: Users can test preprocessing on a subset of episodes before enabling globally
Enabling Preprocessing:
# config.yaml
preprocessing_enabled: true # Enable preprocessing
Or via CLI:
python3 -m podcast_scraper.cli https://feed.xml --enable-preprocessing
Open Questions¶
- FFmpeg Availability Check: Should we fail fast if ffmpeg is not available, or silently disable preprocessing?
-
Proposed: Warn and disable preprocessing if ffmpeg not found, log clear error message
-
Codec Selection: Should we support multiple output codecs (Opus, AAC, MP3) or standardize on Opus?
-
Proposed: Start with Opus only, add codec selection in future if users request it
-
VAD Threshold Tuning: Should we provide presets (conservative, balanced, aggressive) or expect users to tune manually?
-
Proposed: Start with single conservative default, add presets in Phase 2 based on feedback
-
Preprocessing for OpenAI Only: Should preprocessing be automatic for OpenAI provider (to save costs) but optional for Whisper?
-
Proposed: Keep consistent (manual opt-in for all providers), document cost savings for OpenAI users
-
Cache Cleanup: Should we provide a tool to clean preprocessing cache (similar to ML model cache cleanup)?
- Proposed: Yes, add
cache --clean preprocessingsubcommand and include inclean-all
References¶
- Related RFC:
docs/rfc/RFC-005-whisper-integration.md(transcription pipeline) - Related RFC:
docs/rfc/RFC-029-provider-refactoring-consolidation.md(unified providers) - Source Code:
podcast_scraper/episode_processor.py(transcription orchestration) - External Tools: FFmpeg documentation (https://ffmpeg.org/)
- VAD Reference: FFmpeg silenceremove filter (https://ffmpeg.org/ffmpeg-filters.html#silenceremove)