Skip to content

API Migration Guide

Documentation for migrating between major and minor versions of the podcast_scraper API.


v2.5.0 to v2.6.0 (Current)

v2.6.0 adds viewer and HTTP capabilities; the core library API is unchanged.

Additive: Corpus Library and index operations

  • New FastAPI routes under /api/corpus/ (feeds, episodes, detail, similar episodes). See Server Guide and RFC-067.
  • POST /api/index/rebuild — background vector index rebuild; poll GET /api/index/stats for rebuild_in_progress and errors.
  • Vue viewer — Library tab and dashboard hooks aligned with the above.

Migration: pip install -e '.[dev]' if you call the HTTP API or run podcast serve. No changes required for run_pipeline / service.run callers.

Feeds API and jobs: structured feeds.spec.yaml (RFC-077 / #626)

  • Canonical corpus feeds file is feeds.spec.yaml at the corpus root (root object with feeds array). The viewer GET/PUT /api/feeds contract uses JSON { "feeds": [...] } (not urls).
  • Pipeline jobs subprocess passes --config and --feeds-spec (when the file exists) to python -m podcast_scraper.cli, matching the main CLI flags. Manual runs often add --profile <name> alongside --config for the same merge semantics; see CLI.md — Quick Start.
  • Migration from rss_urls.list.txt: convert one URL per line to feeds: ["https://...", ...] in YAML or JSON, save as feeds.spec.yaml, or use config/examples/feeds.spec.example.* as a template. --rss-file remains supported on the CLI for line lists but is not what the Feeds API or job runner use.

Additive: Corpus Digest and health discovery (RFC-068)

  • GET /api/corpus/digest — rolling-window digest of recent episodes (feed-diverse) plus optional semantic topic bands when a vector index exists. See Server Guide and RFC-068.
  • GET /api/health — response may include corpus_digest_api (bool). false disables Digest / glance in the viewer. If the field is omitted but corpus_library_api is true, the GI/KG viewer infers digest is available (legacy health JSON); if the running process is too old to mount GET /api/corpus/digest, the digest request fails — upgrade/restart the API from a current [dev] install.

v2.3.2 to v2.4.0

v2.4.0 introduces a multi-provider ecosystem and changes several defaults.

Breaking Behavior Changes

These are not code-breaking but change the default behavior of the pipeline:

  1. Automatic Transcription: transcribe_missing now defaults to true.
  2. Migration: If you want to only download existing transcripts, explicitly set transcribe_missing: false in your config.
  3. Whisper Model: The default whisper_model changed from base to base.en.
  4. Migration: For non-English podcasts, you must now explicitly set whisper_model: base (or another multilingual model).
  5. Output Structure: Transcripts and metadata are now placed in subdirectories.
  6. Migration: Update any scripts that assume all files are in the root run directory.

Multi-Provider Configuration

v2.4.0 replaces specific provider flags with a unified provider system:

  • New fields: transcription_provider, speaker_detector_provider, summary_provider.
  • Supported providers: whisper, spacy, transformers (local), and cloud providers like openai, anthropic, mistral, etc.

v1.0 to v2.0

Version 2.0 refactored the monolithic v1.0 into a clean modular architecture.

Modular Architecture

Before (v1.0): Monolithic podcast_scraper.py file with no formal public API.

After (v2.0): focused modules with 4 primary public exports:

from podcast_scraper import Config, load_config_file, run_pipeline, cli

New Usage Pattern

import podcast_scraper

# Configuration
config = podcast_scraper.Config(
    rss_url="https://example.com/feed.xml",
    output_dir="./transcripts",
    max_episodes=10,
)

# Run pipeline
count, summary = podcast_scraper.run_pipeline(config)

Version History

Version Date Highlights
v2.6.0 2026-04 Corpus Library /api/corpus/*, index rebuild API, viewer Library UX, RFC-064 profile tooling.
v2.5.0 2026-02 LLM provider expansion, production hardening, MPS exclusive mode, LLM metrics.
v2.4.0 2026-01 Multi-provider ecosystem, production defaults, cache CLI.
v2.3.0 2025-11 Added service API and episode summarization.
v2.2.0 2025-11 Metadata generation (JSON/YAML).
v2.1.0 2025-11 Automatic speaker detection (NER).
v2.0.0 2025-11 Modular architecture foundation.
v1.0.0 2025-11 Initial monolithic release.

Checking API Version

import podcast_scraper

# Both will return the same string, e.g., "2.6.0"
print(podcast_scraper.__version__)
print(podcast_scraper.__api_version__)