Performance Guide¶
This guide covers performance considerations, optimization opportunities, and performance-related troubleshooting for podcast_scraper.
Audio Preprocessing Performance (Issue #392)¶
Cache Miss Cost¶
Observation: Audio preprocessing (especially Voice Activity Detection) is computationally expensive. The first time an episode is processed, preprocessing can take 30-60 seconds for a typical 30-60 minute podcast episode. However, subsequent runs with the same audio file benefit from caching.
Performance Impact:
- Cache Hit: Near-instant (milliseconds) - preprocessed audio is reused
- Cache Miss: 30-60 seconds for typical episodes (depends on audio length and preprocessing settings)
- Net Benefit: Despite preprocessing overhead, total pipeline time typically decreases by 20-40% due to:
- Faster transcription (smaller audio files)
- Reduced API costs (30-60% reduction for API providers)
- Better API compatibility (files fit within size limits)
Cache Strategy:
- Cache key: SHA256 hash of first 1MB of audio + preprocessing configuration
- Cache location:
.cache/preprocessing/ - Cache invalidation: Automatic when preprocessing settings change
Optimization Opportunities:
- Pre-warm cache for frequently processed episodes:
# Process episodes once to populate cache
python3 -m podcast_scraper.cli feed.xml --enable-preprocessing
- Adjust preprocessing settings to balance quality vs. performance:
- Lower
preprocessing_silence_threshold= more aggressive silence removal = faster transcription -
Higher
preprocessing_silence_duration= fewer processing steps = faster preprocessing -
Monitor cache hit rate in metrics:
- Check
preprocessing_cache_hit_ratein pipeline metrics - Target: ≥80% cache hit rate in development/testing scenarios
When Cache Misses Are Expensive:
- First-time processing of new episodes
- Changing preprocessing configuration (invalidates cache)
- Processing episodes from different sources (different audio = different hash)
Mitigation Strategies:
- Use consistent preprocessing settings across runs
- Pre-process episodes in batch to populate cache
- Consider preprocessing as a separate step for large batches
Transcription Performance¶
Provider Selection¶
Local Whisper:
- Pros: No API costs, no rate limits, complete privacy
- Cons: Slower (CPU-bound), requires GPU for reasonable speed
- Best for: Small batches, privacy-sensitive content, offline processing
API Providers (OpenAI, Gemini):
- Pros: Fast (parallel processing), high quality, no local compute
- Cons: API costs, rate limits, file size limits (25MB for OpenAI)
- Best for: Large batches, production workloads, when speed matters
Model Selection¶
Whisper Models:
tiny: Fastest, lowest qualitybase: Fast, acceptable qualitysmall: Balanced (recommended default)medium: Slower, higher qualitylarge: Slowest, highest quality
Recommendation: Use small for most use cases. Upgrade to medium or large only if quality is critical.
Summarization Performance¶
Summarization Model Selection¶
Local Models (BART, LED):
- Pros: No API costs, complete privacy
- Cons: Slower, lower quality, requires GPU for reasonable speed
- Best for: Small batches, privacy-sensitive content
API Providers (OpenAI, Anthropic, Gemini):
- Pros: Fast, high quality, parallel processing
- Cons: API costs, rate limits, context window limits
- Best for: Large batches, production workloads
Context Window Management¶
- Long transcripts: May exceed model context windows
- Solution: Transcripts are automatically chunked when needed
- Performance impact: Chunking adds overhead but enables processing of any length transcript
Memory Usage¶
ML Model Loading¶
Memory per model:
- Whisper
small: ~500MB - Whisper
medium: ~1.5GB - Whisper
large: ~3GB - BART/LED summarization: ~1-2GB
Recommendations:
- Preload models once:
make preload-ml-models - Use smaller models if memory-constrained
- Process episodes sequentially if memory is limited
Parallel Processing¶
Default behavior:
- Downloads: Parallel (limited by
max_workers) - Transcription: Sequential (provider-dependent)
- Summarization: Sequential (provider-dependent)
Memory-aware worker calculation:
- Automatically calculates optimal workers based on available RAM
- Reserves 4GB for system operations
- Caps at 8 workers for CPU efficiency
Network Performance¶
RSS Feed Fetching¶
- Caching: RSS feeds are cached to avoid redundant fetches
- Timeout: Default 60s timeout for feed fetching
- Retry: Automatic retry with exponential backoff
Media Download¶
- Parallel downloads: Multiple episodes downloaded concurrently
- Resume support: Partial downloads can be resumed
- Timeout: Configurable per-episode download timeout
Disk I/O Performance¶
Cache Directories¶
Cache locations:
.cache/whisper/- Whisper models.cache/preprocessing/- Preprocessed audio.cache/transcripts/- Cached transcripts.cache/huggingface/- Transformers models
Performance tips:
- Use fast storage (SSD) for cache directories
- Monitor cache directory size (can grow large)
- Periodic cleanup recommended for long-running systems
Output Directory¶
- Write performance: Multiple episodes writing concurrently
- Recommendation: Use fast storage (SSD) for output directory
- Network storage: Supported but may be slower
Profiling and Monitoring¶
Metrics Collection¶
Pipeline metrics track:
- Processing times per stage
- Cache hit/miss rates
- API call counts and costs
- Token usage
Access metrics:
# JSON output (end of run)
cat output/run_*.json
# JSONL streaming (during run)
tail -f output/run_*.jsonl
Debug Logging¶
Enable debug logging to see detailed performance information:
export LOG_LEVEL=DEBUG
python3 -m podcast_scraper.cli feed.xml
What to look for:
- Preprocessing times
- Cache hit/miss messages
- API call durations
- Memory usage warnings
Optimization Checklist¶
- [ ] Use preprocessing caching (enable preprocessing)
- [ ] Preload ML models (
make preload-ml-models) - [ ] Use appropriate model sizes for your hardware
- [ ] Monitor cache hit rates
- [ ] Use fast storage (SSD) for cache and output
- [ ] Adjust worker counts based on available resources
- [ ] Use API providers for large batches (if cost is acceptable)
- [ ] Monitor memory usage and adjust batch sizes
Related Documentation¶
- Troubleshooting Guide - Common issues and solutions
- Development Guide - Environment setup
- RFC-040: Audio Preprocessing Pipeline - Preprocessing details
- RFC-028: ML Model Preloading - Model caching