Podcast Scraper¶
Download and process podcast transcripts with ease.
Podcast Scraper is a Python tool that downloads transcripts for every episode in a podcast RSS feed. It understands Podcasting 2.0 transcript tags, resolves relative URLs, and can fall back to Whisper transcription when episodes lack published transcripts.
⚠️ Important: This project is intended for personal, non-commercial use only. All downloaded content must remain local and not be shared or redistributed. See Legal Notice & Appropriate Use for details.
✨ Key Features¶
- Transcript Downloads — Automatic detection and download of podcast transcripts from RSS feeds.
- Whisper Fallback — Generate transcripts using OpenAI Whisper when none exist.
- Speaker Detection — Automatic speaker name detection using Named Entity Recognition (NER).
- Screenplay Formatting — Format Whisper transcripts as dialogue with speaker labels.
- Episode Summarization — Generate concise summaries using local transformer models (BART + LED).
- Metadata Generation — Create database-friendly JSON/YAML metadata documents per episode.
- Multi-threaded Downloads — Concurrent processing with configurable worker pools.
- Resumable Operations — Skip existing files, reuse media, and handle interruptions gracefully.
- Configuration Files — JSON/YAML config support for repeatable workflows.
- Service Mode — Non-interactive daemon mode for automation and process managers.
🚀 Quick Start¶
Installation¶
Option 1: pip (from source)
# Clone the repository
git clone https://github.com/chipi/podcast_scraper.git
cd podcast_scraper
# Install the package in editable mode
pip install -e .
# For Whisper transcription, ensure ffmpeg is installed
# macOS: brew install ffmpeg
# Ubuntu: sudo apt install ffmpeg
# (Optional) To regenerate architecture diagrams: install Graphviz, run make visualize, then commit docs/architecture/diagrams/*.svg
# make ci / make ci-fast fail if diagrams are stale (visualize)
# macOS: brew install graphviz
# Ubuntu: sudo apt install graphviz
Option 2: pipx (isolated environment)
# Install from PyPI (when published) or from a local path
pipx install podcast_scraper
# Or from local clone:
pipx install /path/to/podcast_scraper
Option 3: uv
# With uv, install from source or PyPI
uv pip install -e .
# Or: uv tool install podcast_scraper (when published as a tool)
Requirements: Python 3.10+, ffmpeg for audio processing. Run podcast-scraper doctor to verify your environment.
Basic Usage¶
# Download transcripts from a podcast
python3 -m podcast_scraper.cli https://example.com/feed.xml \
--max-episodes 10 \
--output-dir ./my_transcripts
# Use Whisper when transcripts are missing (now default in v2.4.0)
python3 -m podcast_scraper.cli https://example.com/feed.xml \
--whisper-model base.en
# Generate metadata and summaries
python3 -m podcast_scraper.cli https://example.com/feed.xml \
--generate-metadata \
--generate-summaries
Configuration File¶
# config.yaml
rss: https://example.com/feed.xml
output_dir: ./transcripts
max_episodes: 50
transcribe_missing: true
whisper_model: base.en
screenplay: true
num_speakers: 2
auto_speakers: true
generate_metadata: true
generate_summaries: true
workers: 4
skip_existing: true
Example configs:
- config.example.json — JSON format
- config.example.yaml — YAML format
📚 Documentation¶
Getting Started¶
| Resource | Description |
|---|---|
| Quick Start | Installation, basic usage, and first commands |
| Configuration Guide | Complete configuration options and examples |
| CLI Reference | Command-line interface documentation |
| Python API | Public API for programmatic usage |
| Legal Notice | ⚠️ Important usage restrictions and fair use |
User Guides¶
| Guide | Description |
|---|---|
| Service Mode | Running as daemon or service (systemd, supervisor, cron) |
| ML Provider Reference | ML implementation details, models, and tuning |
| Troubleshooting | Common issues and solutions |
| Glossary | Key terms and concepts |
For Developers¶
| Resource | Description |
|---|---|
| Quick Reference | ⭐ One-page cheat sheet for common commands |
| Architecture Overview | High-level system design and module responsibilities |
| Testing Strategy | Test coverage, quality assurance, and testing guidelines |
| Testing Guide | Detailed test execution, fixtures, and coverage information |
| Experiment Guide | Complete guide: datasets, baselines, experiments, and evaluation |
| CI/CD Overview | CI/CD pipeline documentation |
| Engineering Process | The "Triad of Truth": PRDs, RFCs, and ADRs |
| Development Guide | Development environment setup and tooling |
| Dependencies Guide | Third-party dependencies, rationale, and management |
| API Boundaries | API design principles and stability guarantees |
| API Migration Guide | Upgrading between versions |
| API Versioning | Versioning strategy and compatibility |
Provider System (v2.4.0+)¶
| Guide | Description |
|---|---|
| AI Provider Comparison Guide | Detailed comparison of all 8 supported AI providers |
| ML Model Comparison Guide | Compare ML models: Whisper, spaCy, Transformers (BART/LED) |
| Provider Configuration Quick Reference | Quick guide for configuring providers via CLI, config files, and programmatically |
| Provider Implementation Guide | Complete guide for implementing new providers |
| Protocol Extension Guide | Extending protocols and adding new methods to providers |
⚖️ Legal & Fair Use¶
This project is intended for personal, non-commercial use only. All downloaded content must remain local to your device and must not be shared, uploaded, or redistributed.
You are responsible for ensuring compliance with:
- Copyright law
- RSS feed terms of service
- Podcast platform policies
This software is provided for educational and personal-use purposes only. It is not intended to power a public dataset, index, or any commercial service without explicit permission from rights holders.
📄 License¶
MIT License — See LICENSE for details.
Important: The MIT license applies only to the source code. It does not grant any rights to redistribute third-party podcast content.
🔗 Quick Links¶
- Repository: github.com/chipi/podcast_scraper
- Issues: Report bugs or request features
- Documentation: chipi.github.io/podcast_scraper