Podcast Scraper¶

Download and process podcast transcripts with ease.

Podcast Scraper is a Python tool that downloads transcripts for every episode in a podcast RSS feed. It understands Podcasting 2.0 transcript tags, resolves relative URLs, and can fall back to Whisper transcription when episodes lack published transcripts.

Important: This project is intended for personal, non-commercial use only. All downloaded content must remain local and not be shared or redistributed. See Legal Notice & Appropriate Use for details.

Key Features¶

Transcript Downloads — Automatic detection and download of podcast transcripts from RSS feeds.
Whisper Fallback — Generate transcripts using OpenAI Whisper when none exist.
Speaker Detection — Automatic speaker name detection using Named Entity Recognition (NER).
Screenplay Formatting — Format Whisper transcripts as dialogue with speaker labels.
Episode Summarization — Generate concise summaries using local transformer models (BART + LED).
Metadata Generation — Create database-friendly JSON/YAML metadata documents per episode.
Multi-threaded Downloads — Concurrent processing with configurable worker pools.
Resumable Operations — Skip existing files, reuse media, and handle interruptions gracefully.
Configuration Files — JSON/YAML config support for repeatable workflows.
Service Mode — Non-interactive daemon mode for automation and process managers.

Quick Start¶

Installation¶

Option 1: pip (from source)

# Clone the repository
git clone https://github.com/chipi/podcast_scraper.git
cd podcast_scraper

# Install the package in editable mode
pip install -e .

# For Whisper transcription, ensure ffmpeg is installed
# macOS: brew install ffmpeg
# Ubuntu: sudo apt install ffmpeg

# (Optional) To regenerate architecture diagrams: install Graphviz, run make visualize, then commit docs/architecture/diagrams/*.svg
# make ci / make ci-fast fail if diagrams are stale (visualize)
# macOS: brew install graphviz
# Ubuntu: sudo apt install graphviz

Option 2: pipx (isolated environment)

# Install from PyPI (when published) or from a local path
pipx install podcast_scraper
# Or from local clone:
pipx install /path/to/podcast_scraper

Option 3: uv

# With uv, install from source or PyPI
uv pip install -e .
# Or: uv tool install podcast_scraper  (when published as a tool)

Requirements: Python 3.10+, ffmpeg for audio processing. Run podcast-scraper doctor to verify your environment.

Basic Usage¶

# Download transcripts from a podcast
python3 -m podcast_scraper.cli https://example.com/feed.xml \
  --max-episodes 10 \
  --output-dir ./my_transcripts

# Optional: oldest-first or publish-date window — see CONFIGURATION.md (Episode selection)
# --episode-order oldest --max-episodes 50
# --since 2024-01-01 --until 2024-12-31

# Use Whisper when transcripts are missing (now default in v2.4.0)
python3 -m podcast_scraper.cli https://example.com/feed.xml \
  --whisper-model base.en

# Generate metadata and summaries
python3 -m podcast_scraper.cli https://example.com/feed.xml \
  --generate-metadata \
  --generate-summaries

Configuration File¶

# config.yaml
rss: https://example.com/feed.xml
output_dir: ./transcripts
max_episodes: 50
# episode_order: newest  # or oldest — see docs/api/CONFIGURATION.md#episode-selection-github-521
# episode_since: "2024-01-01"
# episode_until: "2024-12-31"
transcribe_missing: true
whisper_model: base.en
screenplay: true
num_speakers: 2
auto_speakers: true
generate_metadata: true
generate_summaries: true
workers: 4
skip_existing: true

Example configs:

config.example.json — JSON format
config.example.yaml — YAML format

Documentation¶

Getting Started¶

Resource	Description
Quick Start	Installation, basic usage, and first commands
Configuration Guide	Complete configuration options and examples
CLI Reference	Command-line interface documentation
Python API	Public API for programmatic usage
Legal Notice	Important usage restrictions and fair use

User Guides¶

Guide	Description
Service Mode	Running as daemon or service (systemd, supervisor, cron)
ML Provider Reference	ML implementation details, models, and tuning
Troubleshooting	Common issues and solutions
Glossary	Key terms and concepts

For Developers¶

Resource	Description
Quick Reference	One-page cheat sheet for common commands
Architecture Overview	High-level system design and module responsibilities
Testing Strategy	Test coverage, quality assurance, and testing guidelines
Testing Guide	Detailed test execution, fixtures, and coverage information
Experiment Guide	Complete guide: datasets, baselines, experiments, and evaluation
CI/CD Overview	CI/CD pipeline documentation
Engineering Process	The "Triad of Truth": PRDs, RFCs, and ADRs
Development Guide	Development environment setup and tooling
Dependencies Guide	Third-party dependencies, rationale, and management
API Boundaries	API design principles and stability guarantees
API Migration Guide	Upgrading between versions
API Versioning	Versioning strategy and compatibility

Provider System (v2.4.0+)¶

Guide	Description
AI Provider Comparison Guide	Detailed comparison of all 8 supported AI providers
ML Model Comparison Guide	Compare ML models: Whisper, spaCy, Transformers (BART/LED)
Provider Configuration Quick Reference	Quick guide for configuring providers via CLI, config files, and programmatically
Provider Implementation Guide	Complete guide for implementing new providers
Protocol Extension Guide	Extending protocols and adding new methods to providers

Legal & Fair Use¶

This project is intended for personal, non-commercial use only. All downloaded content must remain local to your device and must not be shared, uploaded, or redistributed.

You are responsible for ensuring compliance with:

Copyright law
RSS feed terms of service
Podcast platform policies

This software is provided for educational and personal-use purposes only. It is not intended to power a public dataset, index, or any commercial service without explicit permission from rights holders.

Read full legal notice →

License¶

MIT License — See LICENSE for details.

Important: The MIT license applies only to the source code. It does not grant any rights to redistribute third-party podcast content.

Quick Links¶

Repository: github.com/chipi/podcast_scraper
Issues: Report bugs or request features
Documentation: chipi.github.io/podcast_scraper