Skip to content

CLI Interface

The command-line interface provides an interactive way to use podcast_scraper.

Overview

The CLI is designed for:

  • Interactive command-line use
  • Quick one-off transcript downloads
  • Testing and experimentation
  • Integration with shell scripts

For non-interactive use (daemons, services), see the Service API instead.

Quick Start

# Basic usage
python -m podcast_scraper.cli https://example.com/feed.xml

# With options
python -m podcast_scraper.cli https://example.com/feed.xml \
  --max-episodes 50 \
  --transcribe-missing \
  --workers 8

# With config file
python -m podcast_scraper.cli --config config.yaml

Typical run: profile + operator config + feed list

After install, the usual pattern is preset defaults (--profile), your operator YAML (--config), and where feeds live (--feeds-spec, --rss-file, or a URL on argv). You do not need to merge profile YAML into the operator file by hand.

python -m podcast_scraper.cli \
  --profile cloud_balanced \
  --config config/manual/operator_defaults.yaml \
  --feeds-spec config/manual/feeds.spec.registry_10.yaml

Merge order for Config: packaged config/profiles/<profile>.yaml first, then operator YAML and CLI flags override. The --profile flag is injected into the payload even when --config points at a file that omits profile: (same effect as profile: <name> in YAML).

Same content in the root README.md. See Configuration — Multi-feed compose.

API Reference

cli

Command-line interface helpers for podcast_scraper.

Low MI (radon): see docs/ci/CODE_QUALITY_TRENDS.md § Low-MI modules.

Functions

main
main(argv: Optional[Sequence[str]] = None, *, apply_log_level_fn: Optional[Callable[[str, Optional[str], bool], None]] = None, run_pipeline_fn: Optional[Callable[[Config], Tuple[int, str]]] = None, logger: Optional[Logger] = None) -> int

Entry point for the CLI; returns an exit status code.

Common Options

Basic Options

  • RSS_URL - RSS feed URL (omit when the feed list is supplied another way: --feeds-spec, --rss-file, --rss, or rss / feeds / rss_urls inside the --config YAML)
  • --output-dir PATH - Output directory
  • --max-episodes N - Maximum episodes to process after other episode-selection steps
  • --episode-order {newest,oldest} - Feed item order before date filter and offset (newest = XML document order, default; oldest = reversed). GitHub #521.
  • --episode-offset N - Skip this many items after order and optional date filter (non-negative integer, default 0). GitHub #521.
  • --since YYYY-MM-DD - Keep items whose pubDate maps to a calendar date on or after this day (inclusive). GitHub #521.
  • --until YYYY-MM-DD - Keep items whose pubDate maps to a calendar date on or before this day (inclusive). GitHub #521.
  • --workers N - Number of concurrent download workers

Selection order: --episode-order → optional --since / --until--episode-offset--max-episodes. Items without a parseable pubDate are kept when date filters are set; the run logs a warning with a count. See CONFIGURATION.md — Episode selection.

RSS and multi-feed (GitHub #440)

  • --rss URL — Repeatable. Each occurrence adds another feed URL (merged with config feeds / rss_urls and optional --rss-file). Mutually exclusive with --feeds-spec.
  • --rss-file PATH — Text file with one RSS URL per line (comments and blank lines allowed); URLs are merged with other sources. Mutually exclusive with --feeds-spec.
  • --feeds-spec PATH — Structured feeds document (YAML or JSON): root object with feeds array of URL strings or objects with url plus optional per-feed overrides (same allowlist as podcast_scraper.rss.feeds_spec; RFC-077 / #626). Mutually exclusive with --rss-file and with explicit RSS URL arguments (positional / --rss). When set, --output-dir is required if the spec lists two or more feeds.
  • Multi-feed — When two or more distinct feed URLs are in effect, --output-dir is required (corpus parent). Outputs are written under <output_dir>/feeds/<stable_feed_id>/ per feed. Single-feed runs do not add the feeds/ segment.
  • --multi-feed-strict / --no-multi-feed-strict — Multi-feed only (GitHub #559). Default: lenient (exit 0 if every failed feed is a soft failure: RSS fetch/parse ValueError, FFmpeg stderr UTF-8 decode, Whisper 413 / payload size). --multi-feed-strict turns on strict mode for CI: exit 1 on any feed failure even when soft-classified. Hard failures (for example RuntimeError, invalid per-feed path) always exit 1. Same behavior is on Config as multi_feed_strict for service.run().

Config file equivalent: YAML feeds: or rss_urls: list (entries may be URL strings or mappings with url plus optional per-feed overrides), or use --rss-file / --feeds-spec (see CONFIGURATION.md — RSS and multi-feed and Multi-feed compose). Typical multi-feed setup: --feeds-spec pointing at a feeds document (start from config/examples/feeds.spec.example.yaml / .json) plus operator YAML with output_dir and either profile: or the CLI flag --profile <name> (equivalent merge — see Quick Start — Typical run). You may still pass any URL-per-line list with --rss-file. Acceptance fast matrix: config/acceptance/MAIN_ACCEPTANCE_CONFIG.yaml (see config/acceptance/README.md). For append / resume, set append: true on your operator YAML (or pass --append) per CONFIGURATION.md — Append / resume.

Provider Selection (v2.4.0+)

  • --transcription-provider PROVIDER - Provider for transcription (whisper, openai, gemini, mistral)
  • --speaker-detector-provider PROVIDER - Provider for speaker detection (spacy, openai, gemini, anthropic, mistral, deepseek, grok, ollama)
  • --summary-provider PROVIDER - Provider for summarization (transformers, hybrid_ml, openai, gemini, anthropic, mistral, deepseek, grok, ollama)

Transcription Options

  • --transcribe-missing - Use Whisper for episodes without transcripts (now default)
  • --no-transcribe-missing - Disable automatic transcription
  • --whisper-model MODEL - Whisper model to use (tiny, base, small, medium, large)
  • --whisper-device DEVICE - Device for Whisper (cuda/mps/cpu/auto, default: auto-detect)
  • --mps-exclusive - Serialize GPU work on MPS to prevent memory contention (default: enabled)
  • --no-mps-exclusive - Allow concurrent GPU operations on MPS (for systems with sufficient GPU memory)
  • --screenplay - Request screenplay-style transcript layout (only honored for --transcription-provider whisper; for OpenAI / Gemini / Mistral audio transcription truthy values are coerced off at validation with one INFO per gate cycle — GitHub #562). Set PODCAST_SCRAPER_SCREENPLAY_STRICT=1 to error instead of coerce when screenplay is requested with a non-Whisper transcription provider.
  • --num-speakers N - Number of speakers (default: 2)
  • --speaker-names NAMES - Comma-separated speaker names

Audio Preprocessing Options (RFC-040)

  • --enable-preprocessing - Enable audio preprocessing before transcription (experimental)
  • --preprocessing-cache-dir DIR - Custom cache directory for preprocessed audio (default: .cache/preprocessing)
  • --preprocessing-sample-rate RATE - Target sample rate in Hz before MP3 encode (default: 16000; use rates FFmpeg accepts for -ar, e.g. 8000, 16000, 22050, 44100, 48000)
  • --preprocessing-silence-threshold THRESHOLD - Silence detection threshold (default: -50dB)
  • --preprocessing-silence-duration DURATION - Minimum silence duration to remove in seconds (default: 2.0)
  • --preprocessing-target-loudness LOUDNESS - Target loudness in LUFS for normalization (default: -16)

Note: Preprocessing requires ffmpeg to be installed. If ffmpeg is not available, preprocessing is automatically disabled with a warning.

Transcript cleaning and hybrid ML preprocessing (Issue #419)

  • --transcript-cleaning-strategy {pattern,llm,hybrid} - How transcripts are cleaned before summarization (pattern = regex/rules; llm = LLM-only; hybrid = pattern then conditional LLM when using LLM-oriented cleaners). Applies to LLM summarization providers and hybrid_ml (same cleaning_processor wiring as API providers).
  • --hybrid-internal-preprocessing-after-pattern PROFILE_ID - When --summary-provider hybrid_ml and --transcript-cleaning-strategy pattern, selects the registered preprocessing profile applied inside HybridMLProvider.summarize after workflow pattern cleaning (default in config: cleaning_hybrid_after_pattern). Omit to use the Config default; YAML/config file field: hybrid_internal_preprocessing_after_pattern.

See RFC-042 — Layered transcript cleaning, CONFIGURATION.md, CONFIGURATION.md — LLM cleaning length guard, and Preprocessing Profiles Guide.

OpenAI Provider Options

  • --openai-api-key KEY - OpenAI API key (can also use OPENAI_API_KEY env var)
  • --openai-api-base URL - Custom OpenAI API base URL (for E2E testing or custom endpoints)
  • --openai-transcription-model MODEL - OpenAI model for transcription (default: whisper-1)
  • --openai-speaker-model MODEL - OpenAI model for speaker detection (default: gpt-4o-mini)
  • --openai-summary-model MODEL - OpenAI model for summarization (default: gpt-4o-mini)
  • --openai-insight-model MODEL - OpenAI model for GIL generate_insights when gi_insight_source=provider (default: same as summary model)
  • --openai-temperature TEMP - Temperature for OpenAI generation (0.0-2.0, default: 0.3)
  • --openai-max-tokens N - Maximum tokens for OpenAI responses (default: model-specific)
  • --openai-cleaning-model MODEL - OpenAI model for transcript cleaning (default: gpt-4o-mini, cheaper than summary model)
  • --openai-cleaning-temperature TEMP - Temperature for OpenAI cleaning (0.0-2.0, default: 0.2, lower = more deterministic)

Gemini Provider Options

  • --gemini-api-key KEY - Gemini API key (can also use GEMINI_API_KEY env var)
  • --gemini-api-base URL - Custom Gemini API base URL (for E2E testing or custom endpoints)
  • --gemini-transcription-model MODEL - Gemini model for transcription (default: environment-based, gemini-1.5-flash for test, gemini-1.5-pro for prod)
  • --gemini-speaker-model MODEL - Gemini model for speaker detection (default: environment-based)
  • --gemini-summary-model MODEL - Gemini model for summarization (default: environment-based)
  • --gemini-temperature TEMP - Temperature for Gemini generation (0.0-2.0, default: 0.3)
  • --gemini-max-tokens N - Max tokens for Gemini generation (default: model default)
  • --gemini-cleaning-model MODEL - Gemini model for transcript cleaning (default: gemini-1.5-flash, cheaper than summary model)
  • --gemini-cleaning-temperature TEMP - Temperature for Gemini cleaning (0.0-2.0, default: 0.2, lower = more deterministic)

Anthropic Provider Options

  • --anthropic-api-key KEY - Anthropic API key (can also use ANTHROPIC_API_KEY env var)
  • --anthropic-api-base URL - Custom Anthropic API base URL (for E2E testing or custom endpoints)
  • --anthropic-speaker-model MODEL - Anthropic model for speaker detection (default: environment-based)
  • --anthropic-summary-model MODEL - Anthropic model for summarization (default: environment-based)
  • --anthropic-temperature TEMP - Temperature for Anthropic generation (0.0-1.0, default: 0.3)
  • --anthropic-max-tokens N - Max tokens for Anthropic generation (default: model default)
  • --anthropic-cleaning-model MODEL - Anthropic model for transcript cleaning (default: claude-haiku-4-5, cheaper than summary model)
  • --anthropic-cleaning-temperature TEMP - Temperature for Anthropic cleaning (0.0-1.0, default: 0.2, lower = more deterministic)

Mistral Provider Options

  • --mistral-api-key KEY - Mistral API key (can also use MISTRAL_API_KEY env var)
  • --mistral-api-base URL - Custom Mistral API base URL (for E2E testing or custom endpoints)
  • --mistral-transcription-model MODEL - Mistral model for transcription (default: voxtral-mini-latest)
  • --mistral-speaker-model MODEL - Mistral model for speaker detection (default: environment-based)
  • --mistral-summary-model MODEL - Mistral model for summarization (default: environment-based)
  • --mistral-temperature TEMP - Temperature for Mistral generation (0.0-1.0, default: 0.3)
  • --mistral-max-tokens N - Max tokens for Mistral generation (default: model default)
  • --mistral-cleaning-model MODEL - Mistral model for transcript cleaning (default: mistral-small-latest, cheaper than summary model)
  • --mistral-cleaning-temperature TEMP - Temperature for Mistral cleaning (0.0-1.0, default: 0.2, lower = more deterministic)

DeepSeek Provider Options

  • --deepseek-api-key KEY - DeepSeek API key (can also use DEEPSEEK_API_KEY env var)
  • --deepseek-api-base URL - Custom DeepSeek API base URL (for E2E testing or custom endpoints)
  • --deepseek-speaker-model MODEL - DeepSeek model for speaker detection (default: deepseek-chat)
  • --deepseek-summary-model MODEL - DeepSeek model for summarization (default: deepseek-chat)
  • --deepseek-temperature TEMP - Temperature for DeepSeek generation (0.0-2.0, default: 0.3)
  • --deepseek-max-tokens N - Max tokens for DeepSeek generation (default: model default)
  • --deepseek-cleaning-model MODEL - DeepSeek model for transcript cleaning (default: deepseek-chat, cheaper than summary model)
  • --deepseek-cleaning-temperature TEMP - Temperature for DeepSeek cleaning (0.0-2.0, default: 0.2, lower = more deterministic)

Grok Provider Options

  • --grok-api-key KEY - Grok API key (can also use GROK_API_KEY env var)
  • --grok-api-base URL - Custom Grok API base URL (for E2E testing or custom endpoints)
  • --grok-speaker-model MODEL - Grok model for speaker detection (default: grok-2)
  • --grok-summary-model MODEL - Grok model for summarization (default: grok-2)
  • --grok-temperature TEMP - Temperature for Grok generation (0.0-2.0, default: 0.3)
  • --grok-max-tokens N - Max tokens for Grok generation (default: model default)
  • --grok-cleaning-model MODEL - Grok model for transcript cleaning (default: grok-3-mini, cheaper than summary model)
  • --grok-cleaning-temperature TEMP - Temperature for Grok cleaning (0.0-2.0, default: 0.2, lower = more deterministic)

Ollama Provider Options

  • --ollama-api-base URL - Custom Ollama API base URL (for E2E testing or custom endpoints, default: http://localhost:11434/v1)
  • --ollama-speaker-model MODEL - Ollama model for speaker detection (default: environment-based)
  • --ollama-summary-model MODEL - Ollama model for summarization (default: environment-based)
  • --ollama-temperature TEMP - Temperature for Ollama generation (0.0-2.0, default: 0.3)
  • --ollama-max-tokens N - Max tokens for Ollama generation (default: model default)
  • --ollama-timeout SECONDS - Timeout in seconds for Ollama API calls (default: 120, local inference can be slow)
  • --ollama-cleaning-model MODEL - Ollama model for transcript cleaning (default: llama3.2:latest, smaller/faster than summary model)
  • --ollama-cleaning-temperature TEMP - Temperature for Ollama cleaning (0.0-2.0, default: 0.2, lower = more deterministic)

Metadata & Summarization

  • --generate-metadata - Generate metadata documents
  • --metadata-format FORMAT - Format (json or yaml)
  • --generate-summaries - Generate episode summaries
  • --summary-model MODEL - Summary model to use (MAP-phase, transformers provider)
  • --summary-reduce-model MODEL - Summary reduce model to use (REDUCE-phase, transformers provider)
  • --hybrid-map-model MODEL - Hybrid MAP model when --summary-provider hybrid_ml (e.g. longt5-base)
  • --hybrid-reduce-model MODEL - Hybrid REDUCE model (HF id, Ollama tag, or GGUF path for llama_cpp)
  • --hybrid-reduce-backend {transformers,ollama,llama_cpp} - Hybrid REDUCE backend
  • --hybrid-map-device / --hybrid-reduce-device - Devices for hybrid MAP/REDUCE (cuda, mps, cpu, auto)
  • --save-cleaned-transcript / --no-save-cleaned-transcript - Persist .cleaned transcript alongside source (default: save)

Cache Management (v2.4.0+)

  • cache --status - View cache status for all ML models
  • cache --clean [TYPE] - Clean ML caches (types: whisper, transformers, spacy, all)

Control Options

  • --dry-run - Preview without writing files (includes LLM cost projection when API providers are configured)
  • --skip-existing - Skip episodes with existing output
  • --append - Resume into a stable run_append_* directory and skip episodes whose on-disk metadata already matches the RSS episode_id and required artifacts (GitHub #444). Incompatible with --clean-output. Unlike --skip-existing (path/exists heuristics in metadata generation), append skips before download using validated metadata + transcript (and optional summary / GI / KG files when those features are enabled).
  • --clean-output - Remove output directory before processing
  • --fail-fast - Stop on first episode failure (Issue #379)
  • --max-failures N - Stop after N episode failures (Issue #379)
  • --http-retry-total N - Max urllib3 retries per media/transcript HTTP request (0-20)
  • --http-backoff-factor F - Exponential backoff factor for those retries (0.0-10.0)
  • --rss-retry-total N - Max urllib3 retries per RSS feed fetch (0-20)
  • --rss-backoff-factor F - Backoff factor for RSS retries (0.0-10.0)
  • --episode-retry-max N - Application-level retries per episode after urllib3 exhaustion (0-10; 0 disables)
  • --episode-retry-delay-sec SEC - Initial seconds between episode-level retries (0.0-120.0)
  • --host-request-interval-ms MS - Minimum milliseconds between requests to the same host (0-600000; 0 = off; Issue #522)
  • --host-max-concurrent N - Max concurrent downloads per host (0-64; 0 = unlimited; Issue #522)
  • --circuit-breaker / --no-circuit-breaker - Enable or disable HTTP circuit breaker (Issue #522; mutually exclusive)
  • --circuit-breaker-failure-threshold N - Failures in rolling window before open (1-100)
  • --circuit-breaker-window-seconds SEC - Rolling window (1-86400)
  • --circuit-breaker-cooldown-seconds SEC - Cooldown while open (1-86400)
  • --circuit-breaker-scope {feed,host} - Breaker key: RSS URL vs request host
  • --rss-conditional-get / --no-rss-conditional-get - Enable or disable RSS conditional GET (Issue #522; mutually exclusive)
  • --rss-cache-dir DIR - Directory for RSS conditional validators/body cache

Defaults, semantics, recommended presets, and full config/CLI parity: CONFIGURATION.md — Download resilience (canonical — do not rely on example YAML under config/examples/ as a second source of truth).

Logging Options

  • --json-logs - Output structured JSON logs for monitoring/alerting (Issue #379)

Live pipeline monitor (RFC-065, #512)

  • --monitor — After the output directory is known, spawn a child process that reads <output_dir>/.pipeline_status.json, samples RSS and CPU% for the main pipeline PID (psutil), and renders a rich.Live panel on stderr (or appends plain lines to .monitor.log when stderr is not a TTY) until the pipeline exits. Core deps only. With pip install -e ".[monitor]": press f in the parent TTY to write debug/flamegraph_*.svg (py-spy).
  • --memray / --memray-output PATH — Re-exec the CLI under memray for heap captures (optional .[monitor]; see Live Pipeline Monitor).

Guide: Live Pipeline Monitor · RFC: RFC-065

Cost Projection in Dry-Run Mode

When using --dry-run with OpenAI providers configured, the pipeline displays a cost projection before execution. This helps you estimate API costs before running expensive operations.

The cost projection includes:

  • Transcription costs - Based on estimated audio duration (from RSS feed metadata or 30-minute fallback)
  • Speaker detection costs - Based on estimated token usage (transcript length + prompt overhead)
  • Summarization costs - Based on estimated token usage (transcript length + prompt overhead)

Example output:

Dry run complete. transcripts_planned=5
  - Direct downloads planned: 3
  - Whisper transcriptions planned: 2
  - Output directory: /path/to/output

Cost Projection (Dry Run):
==============================
Transcription (whisper-1):
  - Episodes: 5
  - Estimated audio: 150.0 minutes
  - Estimated cost: $0.9000

Speaker Detection (gpt-4o-mini):
  - Episodes: 5
  - Estimated tokens: ~29,750 input + ~250 output
  - Estimated cost: $0.0045

Summarization (gpt-4o):
  - Episodes: 5
  - Estimated tokens: ~29,950 input + ~750 output
  - Estimated cost: $0.1625

Total Estimated Cost: $1.0670

Note: Estimates are approximate and based on average episode duration. Actual costs may vary based on actual audio length and transcript complexity.

Note: Cost projection appears when billable LLM providers are configured (OpenAI, Gemini, Mistral, etc.). Estimates use episode durations from RSS feed metadata when available, or a conservative 30-minute average per episode as a fallback.

Optional YAML pricing overrides (pricing_assumptions_file / PRICING_ASSUMPTIONS_FILE) apply to the same formulas; see LLM cost estimate assumptions in the configuration guide.

Configuration Files

The CLI supports JSON and YAML configuration files:

python -m podcast_scraper.cli --config config.json
  • --config PATH — Operator YAML/JSON: corpus paths, episode limits, provider flags, etc. Keys in this file override packaged profile defaults when --profile or profile: is in effect; see the opening section of CONFIGURATION.md (YAML preset (profile:)).
  • --profile NAME — Shorthand for “use config/profiles/<NAME>.yaml as the default bundle.” May be combined with --config: the profile merges first, then the operator file and explicit CLI flags win. Equivalent to putting profile: <NAME> in the operator YAML except you keep the preset name on the command line.

Feed URLs can live in the operator file (rss, feeds, rss_urls) or outside it via --feeds-spec / --rss-file / argv (see RSS and multi-feed).

Diagnostic Commands (Issue #379, #429)

Subcommands: The first argument can be doctor, cache, or pricing-assumptions (e.g. podcast-scraper doctor). Startup checks (Python 3.10+, ffmpeg) run only for the main pipeline; they are skipped for these subcommands so you can run doctor even when ffmpeg is missing.

pricing-assumptions Command

Reports whether the pricing YAML resolves, prints schema_version and metadata (last_reviewed, pricing_effective_date, stale_review_after_days, source_urls). Use this after editing rates or on a schedule to see if assumptions need a human refresh.

python -m podcast_scraper.cli pricing-assumptions
python -m podcast_scraper.cli pricing-assumptions --file config/pricing_assumptions.yaml
make check-pricing-assumptions

Staleness in the report

The command loads metadata from the YAML and evaluates staleness only when both of the following are set and valid:

  • last_reviewed — ISO date (YYYY-MM-DD)
  • stale_review_after_days — positive integer

Rule: let age_days be the number of whole calendar days from last_reviewed to today (local date of the machine running the CLI). If age_days > stale_review_after_days, the file is stale. The command then appends a Staleness: section with a short explanation (verify vendor sites and update the YAML).

If either field is missing, unparsable, or stale_review_after_days is zero or negative, no staleness line is printed (the tool does not guess).

--strict exit code

With --strict, the command still prints the full report, but exits with code 1 when the stale condition above is true. Exit code 0 means either not stale, or staleness could not be computed (same cases as “no staleness line”). Use this in CI or release scripts so an expired review date fails the step until someone bumps last_reviewed after verifying prices.

python -m podcast_scraper.cli pricing-assumptions --strict
make check-pricing-assumptions ARGS='--strict'

Full field semantics and merge behavior: Configuration: LLM cost estimate assumptions.

doctor Command

The doctor command validates your environment and dependencies:

python -m podcast_scraper.cli doctor

Checks performed:

  • Python version (must be 3.10+)
  • ffmpeg availability (required for Whisper transcription)
  • Write permissions (output directory)
  • ML model cache status (Whisper, Transformers, spaCy)
  • Network connectivity (optional, with --verbose)

Example output:

[ok] Checking Python version...
  [ok] Python 3.11.9 (required: 3.10+)

[ok] Checking ffmpeg...
  [ok] ffmpeg version 6.0

[ok] Checking write permissions...
  [ok] Output directory is writable

[ok] Checking ML model caches...
  [ok] Whisper: 2 models cached (245 MB)
  [ok] Transformers: 3 models cached (1.2 GB)
  [ok] spaCy: 1 model cached (45 MB)

[ok] All checks passed!

Exit codes:

  • 0 - All checks passed
  • 1 - Some checks failed

Exit Codes (Issue #379)

The CLI uses standard exit codes for automation and scripting:

  • 0: Success (pipeline completed successfully, even if some episodes failed)
  • 1: Failure (run-level error: invalid config, missing dependencies, fatal errors)

Exit Code Policy:

  • Exit code 0 is returned when the pipeline completes, even if individual episodes fail.
  • Exit code 1 is only returned for run-level failures (invalid configuration, missing dependencies, fatal errors).
  • Episode-level failures are tracked in metrics and in index.json (per-episode status, error_type, error_message, error_stage). They do not change the exit code: the run still exits 0 when it finishes. Use --fail-fast or --max-failures to stop processing earlier; those flags do not make the exit code 1.

Examples:

# Success: pipeline completed, 5 episodes processed, 2 failed
podcast_scraper --rss https://example.com/feed.xml
echo $?  # 0

# Failure: invalid RSS URL
podcast_scraper --rss invalid-url
echo $?  # 1

# Failure: missing ffmpeg
podcast_scraper --rss https://example.com/feed.xml
echo $?  # 1 (dependency check failed)

Examples

Download Transcripts

# Download all available transcripts
python -m podcast_scraper.cli https://example.com/feed.xml

# Limit to 10 episodes
python -m podcast_scraper.cli https://example.com/feed.xml --max-episodes 10

# Oldest-first batching (e.g. back-catalog): process 50 oldest items in the feed
python -m podcast_scraper.cli https://example.com/feed.xml \
  --episode-order oldest \
  --max-episodes 50

# Publish-date window (calendar days; timezone-aware pubDate uses UTC date)
python -m podcast_scraper.cli https://example.com/feed.xml \
  --since 2024-01-01 \
  --until 2024-12-31

Advanced Usage

# Transcribe missing episodes with specific model
python -m podcast_scraper.cli https://example.com/feed.xml \
  --whisper-model small.en

# With screenplay formatting and metadata
python -m podcast_scraper.cli https://example.com/feed.xml \
  --screenplay \
  --num-speakers 2 \
  --speaker-names "Host,Guest" \
  --generate-metadata

Multi-Provider Examples

# Use OpenAI for everything
python -m podcast_scraper.cli https://example.com/feed.xml \
  --transcription-provider openai \
  --speaker-detector-provider openai \
  --summary-provider openai

# Use Gemini for everything
python -m podcast_scraper.cli https://example.com/feed.xml \
  --transcription-provider gemini \
  --speaker-detector-provider gemini \
  --summary-provider gemini

# Mix and match providers
python -m podcast_scraper.cli https://example.com/feed.xml \
  --transcription-provider whisper \
  --speaker-detector-provider gemini \
  --summary-provider openai

Grounded insights (gi) subcommands

Inspect and explore Grounded Insight Layer artifacts (*.gi.json) after a run with generate_gi enabled (see Grounded Insights Guide, RFC-050). Shallow v1 scope (ML vs stub vs bullets, topic explore semantics, deterministic gi query, Postgres deferral): Recorded product decisions (v1, issue 460). With generate_kg too: KG shallow v1 record.

# Validate artifacts (symmetric with `kg validate`; use --strict for full JSON Schema)
python -m podcast_scraper.cli gi validate ./output/metadata --strict

# Export corpus: NDJSON (one artifact per line) or merged bundle (symmetric with `kg export`)
python -m podcast_scraper.cli gi export --output-dir ./output --format ndjson --out gi.ndjson
python -m podcast_scraper.cli gi export --output-dir ./output --format merged --out gi-bundle.json

# One episode: stats, optional full text and quotes (--show)
python -m podcast_scraper.cli gi inspect --episode-path ./output/metadata/ep1.gi.json
python -m podcast_scraper.cli gi inspect --output-dir ./output --episode-id 'sha256:...'
python -m podcast_scraper.cli gi inspect --output-dir ./mycorpus --episode-id 'sha256:...' --feed-id 'rss_example.com_abc123'

# One insight by id (with evidence spans)
python -m podcast_scraper.cli gi show-insight --id 'insight:<id-from-gi.json>' --episode-path ./output/metadata/ep1.gi.json

# Cross-episode: topic / speaker filters, sort, RFC-style JSON
# When ./output/search/vectors.faiss exists, --topic uses semantic ranking first (RFC-061).
python -m podcast_scraper.cli gi explore --output-dir ./output --topic 'AI regulation' --format json
python -m podcast_scraper.cli gi explore --output-dir ./output --speaker Host --sort time --strict

# UC4: fixed English question patterns → explore JSON or topic leaderboard (RFC-050)
python -m podcast_scraper.cli gi query --output-dir ./output --question 'What insights about inflation?'
python -m podcast_scraper.cli gi query --output-dir ./output --question 'What insights are there about trade?'
python -m podcast_scraper.cli gi query --output-dir ./output --question 'What did Sam say?' --limit 10
python -m podcast_scraper.cli gi query --output-dir ./output --question 'What did Sam say about inflation?'
python -m podcast_scraper.cli gi query --output-dir ./output --question 'Which topics have the most insights?'

Semantic corpus search (search, index)

Meaning-based retrieval over indexed GIL, summaries, and transcripts (RFC-061). See Semantic Search Guide. Enable indexing with vector_search: true in config or run index manually.

python -m podcast_scraper.cli search "supply chain disruptions" --output-dir ./output
python -m podcast_scraper.cli search "quantum computing" --output-dir ./output --type insight --format json
python -m podcast_scraper.cli index --output-dir ./output
python -m podcast_scraper.cli index --output-dir ./output --rebuild
python -m podcast_scraper.cli index --output-dir ./output --stats

Topic clusters (topic-clusters)

Corpus-wide clustering of KG topic embeddings already stored as kg_topic rows in the FAISS index (RFC-075). Writes topic_clusters.json next to the index (default: <output-dir>/search/topic_clusters.json). Requires an indexed corpus with kg_topic included. The JSON schema_version is "2" for new writes: graph_compound_parent_id / cil_alias_target_topic_id separate viewer compound ids (tc:) from CIL merge targets (topic:). See RFC-075 §3.

python -m podcast_scraper.cli topic-clusters --output-dir ./output
python -m podcast_scraper.cli topic-clusters --output-dir ./output --threshold 0.8
python -m podcast_scraper.cli topic-clusters --output-dir ./output \
  --validate-config ./path/to/my_topic_cluster_validation.yaml
python -m podcast_scraper.cli topic-clusters --output-dir ./output --merge-cil-overrides

--merge-cil-overrides merges auto topic_id_aliases into cil_lift_overrides.json (existing alias keys in that file take precedence). Run after indexing when kg_topic rows exist.

Insight clusters (insight-clusters)

Cross-episode clustering of GI insight texts from *.gi.json artifacts (#599). Uses the same average-linkage algorithm as topic clustering (RFC-075). Writes insight_clusters.json to <output-dir>/search/insight_clusters.json. Requires a corpus with GI artifacts.

python -m podcast_scraper.cli insight-clusters --output-dir ./output
python -m podcast_scraper.cli insight-clusters --output-dir ./output --threshold 0.8
python -m podcast_scraper.cli insight-clusters --output-dir ./output --output-file ./custom_path.json

The gi explore --expand-clusters flag loads insight_clusters.json and adds cross-episode cluster context (supporting quotes from other episodes) to each matching insight:

python -m podcast_scraper.cli gi explore --topic "AI safety" --output-dir ./output --expand-clusters
python -m podcast_scraper.cli gi explore --topic "index funds" --output-dir ./output --expand-clusters --format json

Corpus status (corpus-status)

Offline summary for a corpus parent (multi-feed: directory containing feeds/): whether corpus_manifest.json is present, per-feed metadata/ file counts, a sample of index.json failures when available, and whether <corpus_parent>/search holds a FAISS index plus embedding model id from index_meta.json (GitHub #506 / RFC-063 §7).

python -m podcast_scraper.cli corpus-status --output-dir ./corpus_parent
python -m podcast_scraper.cli corpus-status --output-dir ./corpus_parent --format json

PRD-017 quality metrics (grounding rate, quote validity, density) over a run directory:

make gil-quality-metrics DIR=./output
make gil-quality-metrics DIR=./output ARGS='--enforce --min-avg-insights 3 --min-avg-quotes 5'
make compare-gil-runs REF=./output/run_ref CAND=./output/run_ml
make kg-quality-metrics DIR=./output ARGS='--enforce --json'
make quality-metrics-ci

compare-gil-runs expects each path to be a pipeline run root with metadata/*.gi.json (see docs/wip/gil-ml-vs-openai-outcome-benchmark.md).

Knowledge Graph (kg) subcommands

Inspect and export Knowledge Graph artifacts (*.kg.json) after a run with generate_kg enabled. Symmetric to the gi subcommand for GIL.

# Validate all kg.json under a directory (strict schema)
python -m podcast_scraper.cli kg validate ./output/metadata --strict

# Inspect one episode (by file or by episode id under output dir)
python -m podcast_scraper.cli kg inspect --episode-path ./output/metadata/1_ep.kg.json
python -m podcast_scraper.cli kg inspect --output-dir ./output --episode-id 'sha256:...'
python -m podcast_scraper.cli kg inspect --output-dir ./mycorpus --episode-id 'sha256:...' --feed-id 'rss_example.com_abc123'

# Export corpus: NDJSON (one artifact per line) or single merged JSON bundle
python -m podcast_scraper.cli kg export --output-dir ./output --format ndjson --out kg.ndjson
python -m podcast_scraper.cli kg export --output-dir ./output --format merged --out kg-bundle.json

# Aggregate entities across episodes; topic pairs that co-occur in the same episode
python -m podcast_scraper.cli kg entities --output-dir ./output --min-episodes 2 --format json
python -m podcast_scraper.cli kg topics --output-dir ./output --min-support 2 --format json

Details: Knowledge Graph Guide, RFC-056. Shallow v1 scope (extraction + ML, no kg query IR, Postgres deferral): Recorded product decisions (v1, KG shallow). GIL companion: Recorded product decisions (v1, issue 460).

See Also