Skip to content

Podcast Scraper

Download and process podcast transcripts with ease.

Podcast Scraper is a Python tool that downloads transcripts for every episode in a podcast RSS feed. It understands Podcasting 2.0 transcript tags, resolves relative URLs, and can fall back to Whisper transcription when episodes lack published transcripts.

⚠️ Important: This project is intended for personal, non-commercial use only. All downloaded content must remain local and not be shared or redistributed. See Legal Notice & Appropriate Use for details.


✨ Key Features

  • Transcript Downloads — Automatic detection and download of podcast transcripts from RSS feeds.
  • Whisper Fallback — Generate transcripts using OpenAI Whisper when none exist.
  • Speaker Detection — Automatic speaker name detection using Named Entity Recognition (NER).
  • Screenplay Formatting — Format Whisper transcripts as dialogue with speaker labels.
  • Episode Summarization — Generate concise summaries using local transformer models (BART + LED).
  • Metadata Generation — Create database-friendly JSON/YAML metadata documents per episode.
  • Multi-threaded Downloads — Concurrent processing with configurable worker pools.
  • Resumable Operations — Skip existing files, reuse media, and handle interruptions gracefully.
  • Configuration Files — JSON/YAML config support for repeatable workflows.
  • Service Mode — Non-interactive daemon mode for automation and process managers.

🚀 Quick Start

Installation

Option 1: pip (from source)

# Clone the repository
git clone https://github.com/chipi/podcast_scraper.git
cd podcast_scraper

# Install the package in editable mode
pip install -e .

# For Whisper transcription, ensure ffmpeg is installed
# macOS: brew install ffmpeg
# Ubuntu: sudo apt install ffmpeg

# (Optional) To regenerate architecture diagrams: install Graphviz, run make visualize, then commit docs/architecture/diagrams/*.svg
# make ci / make ci-fast fail if diagrams are stale (visualize)
# macOS: brew install graphviz
# Ubuntu: sudo apt install graphviz

Option 2: pipx (isolated environment)

# Install from PyPI (when published) or from a local path
pipx install podcast_scraper
# Or from local clone:
pipx install /path/to/podcast_scraper

Option 3: uv

# With uv, install from source or PyPI
uv pip install -e .
# Or: uv tool install podcast_scraper  (when published as a tool)

Requirements: Python 3.10+, ffmpeg for audio processing. Run podcast-scraper doctor to verify your environment.

Basic Usage

# Download transcripts from a podcast
python3 -m podcast_scraper.cli https://example.com/feed.xml \
  --max-episodes 10 \
  --output-dir ./my_transcripts

# Use Whisper when transcripts are missing (now default in v2.4.0)
python3 -m podcast_scraper.cli https://example.com/feed.xml \
  --whisper-model base.en

# Generate metadata and summaries
python3 -m podcast_scraper.cli https://example.com/feed.xml \
  --generate-metadata \
  --generate-summaries

Configuration File

# config.yaml
rss: https://example.com/feed.xml
output_dir: ./transcripts
max_episodes: 50
transcribe_missing: true
whisper_model: base.en
screenplay: true
num_speakers: 2
auto_speakers: true
generate_metadata: true
generate_summaries: true
workers: 4
skip_existing: true

Example configs:


📚 Documentation

Getting Started

Resource Description
Quick Start Installation, basic usage, and first commands
Configuration Guide Complete configuration options and examples
CLI Reference Command-line interface documentation
Python API Public API for programmatic usage
Legal Notice ⚠️ Important usage restrictions and fair use

User Guides

Guide Description
Service Mode Running as daemon or service (systemd, supervisor, cron)
ML Provider Reference ML implementation details, models, and tuning
Troubleshooting Common issues and solutions
Glossary Key terms and concepts

For Developers

Resource Description
Quick Reference ⭐ One-page cheat sheet for common commands
Architecture Overview High-level system design and module responsibilities
Testing Strategy Test coverage, quality assurance, and testing guidelines
Testing Guide Detailed test execution, fixtures, and coverage information
Experiment Guide Complete guide: datasets, baselines, experiments, and evaluation
CI/CD Overview CI/CD pipeline documentation
Engineering Process The "Triad of Truth": PRDs, RFCs, and ADRs
Development Guide Development environment setup and tooling
Dependencies Guide Third-party dependencies, rationale, and management
API Boundaries API design principles and stability guarantees
API Migration Guide Upgrading between versions
API Versioning Versioning strategy and compatibility

Provider System (v2.4.0+)

Guide Description
AI Provider Comparison Guide Detailed comparison of all 8 supported AI providers
ML Model Comparison Guide Compare ML models: Whisper, spaCy, Transformers (BART/LED)
Provider Configuration Quick Reference Quick guide for configuring providers via CLI, config files, and programmatically
Provider Implementation Guide Complete guide for implementing new providers
Protocol Extension Guide Extending protocols and adding new methods to providers

This project is intended for personal, non-commercial use only. All downloaded content must remain local to your device and must not be shared, uploaded, or redistributed.

You are responsible for ensuring compliance with:

  • Copyright law
  • RSS feed terms of service
  • Podcast platform policies

This software is provided for educational and personal-use purposes only. It is not intended to power a public dataset, index, or any commercial service without explicit permission from rights holders.

Read full legal notice →


📄 License

MIT License — See LICENSE for details.

Important: The MIT license applies only to the source code. It does not grant any rights to redistribute third-party podcast content.