Pipeline and Workflow Guide¶

This guide describes how the podcast_scraper pipeline runs: entry points, flow, module roles, and behavioral quirks. For strategic architecture, ADRs, and planned evolution, see Architecture.

High-Level Flow¶

Entry: podcast_scraper.cli.main parses CLI args (optionally merging JSON/YAML from --config, named preset from --profile, and feed lists from --feeds-spec / --rss-file / argv) into a validated Config object and applies global logging preferences. Human-oriented overview: CLI.md — Quick Start.
Multi-feed outer loop (GitHub #440): When Config.rss_urls contains two or more feeds, cli.main and service.run iterate feeds sequentially: for each URL they derive a child output_dir under <corpus_parent>/feeds/<stable_feed_id>/ and invoke the same inner pipeline with a single-feed sub-config. Failures are logged per feed; the overall process exit code reflects whether any feed failed.
Run orchestration: workflow.orchestration.run_pipeline coordinates the end-to-end job for one feed at a time: output setup, RSS acquisition, episode selection (prepare_episodes_from_feed: episode_order, optional episode_since / episode_until, episode_offset, max_episodes; see CONFIGURATION.md — Episode selection), episode materialization, transcript download, optional Whisper transcription, optional metadata generation, optional summarization, and cleanup. When Config.monitor is true (--monitor), a child process samples the pipeline PID and renders a live RSS/CPU/stage view (or plain .monitor.log when stderr is not a TTY) while updating .pipeline_status.json; the parent may start a stdin listener for optional py-spy (f) when .[monitor] is installed (see Live Pipeline Monitor, RFC-065).
Episode handling: For each Episode, workflow.episode_processor.process_episode_download either saves an existing transcript or enqueues media for Whisper.
Speaker detection (RFC-010): When automatic speaker detection is enabled, host names are extracted from RSS author tags (channel-level <author>, <itunes:author>, <itunes:owner>) as the primary source, falling back to NER extraction from feed metadata if no author tags exist. Guest names are extracted from episode-specific metadata (titles and descriptions) using Named Entity Recognition (NER) with spaCy. Manual speaker names are only used as fallback when detection fails. Note: The pipeline logs debug messages when transcription parallelism is ignored due to provider limitations (e.g., Whisper always uses sequential processing).
Audio Preprocessing (RFC-040, GitHub #561): When preprocessing is enabled, audio files are optimized before transcription: converted to mono, resampled to the configured rate (default 16 kHz), silence removed via VAD, loudness normalized, then encoded to MP3 (libmp3lame). Default MP3 bitrate is auto: 64 kbps for local transcription (for example Whisper) and 48 kbps for OpenAI / Gemini to reduce payload size. If a cached or freshly encoded file is still over the shared post-preprocess cap for those providers, the pipeline may run additional FFmpeg re-encodes at lower standard bitrates until under a target size or the minimum bitrate. The preprocessing cache key includes the MP3 bitrate so different rungs do not collide. Preprocessing runs in workflow.episode_processor.transcribe_media_to_text before any provider receives the audio.
Transcription: When Whisper fallback is enabled, workflow.episode_processor.download_media_for_transcription downloads media to a temp area and workflow.episode_processor.transcribe_media_to_text persists Whisper output using deterministic naming. Detected speaker names are integrated into screenplay formatting when enabled.
Metadata generation (PRD-004/RFC-011): When enabled, per-episode metadata documents are generated alongside transcripts, capturing feed-level and episode-level information, detected speaker names, and processing metadata in JSON/YAML format.
Summarization (PRD-005/RFC-012): When enabled, episode transcripts are summarized using the configured provider — local transformer models (BART, PEGASUS, LED) via MLProvider; the hybrid_ml provider (MAP with LongT5 + REDUCE via Ollama, llama.cpp, or transformers; layered cleaning with transcript_cleaning_strategy + internal preprocessing profiles per RFC-042); or any of 7 LLM providers (OpenAI, Gemini, Anthropic, Mistral, DeepSeek, Grok, Ollama) via prompt templates. See ML Provider Reference for ML architecture details.
Run Tracking (Issue #379): Run manifests capture system state at pipeline start. Per-episode stage timings track processing duration. Run summaries combine manifest and metrics. Episode index files list all processed episodes with status.
Progress/UI: All long-running operations report progress through the pluggable factory in utils.progress, defaulting to rich in the CLI.
GIL Extraction (PRD-017): When enabled, the Grounded Insight Layer extracts structured insights and verbatim quotes from transcripts, links them via grounding relationships, and writes a gi.json file per episode. This step runs after summarization and uses the same multi-provider architecture. See Grounded Insights Guide and Architecture for details.
KG Extraction (RFC-055): When enabled, Knowledge Graph extraction produces structured topic graphs from transcripts and summaries, writing *.kg.json per episode. See Knowledge Graph Guide for details.

Pipeline Flow Diagram¶

flowchart TD
    Start([CLI Entry]) --> Parse[Parse CLI Args & Config Files]
    Parse --> Validate[Validate & Normalize Config]
    Validate --> Setup[Setup Output Directory]
    Setup --> FetchRSS[Fetch & Parse RSS Feed]
    FetchRSS --> ExtractEpisodes[Select items then extract episode metadata]
    ExtractEpisodes --> DetectSpeakers{Speaker Detection Enabled?}
    DetectSpeakers -->|Yes| ExtractHosts[Extract Host Names from RSS]
    ExtractHosts --> ExtractGuests[Extract Guest Names via NER]
    DetectSpeakers -->|No| ProcessEpisodes[Process Episodes]
    ExtractGuests --> ProcessEpisodes
    ProcessEpisodes --> CheckTranscript{Transcript Available?}
    CheckTranscript -->|Yes| DownloadTranscript[Download Transcript]
    CheckTranscript -->|No| QueueWhisper[Queue for Whisper]
    DownloadTranscript --> SaveTranscript[Save Transcript File]
    QueueWhisper --> DownloadMedia[Download Media File]
    DownloadMedia --> Preprocess{Preprocessing Enabled?}
    Preprocess -->|Yes| PreprocessAudio[Preprocess Audio: Mono, 16kHz, VAD, Normalize, Opus]
    Preprocess -->|No| Transcribe
    PreprocessAudio --> Transcribe[Whisper Transcription]
    Transcribe --> FormatScreenplay[Format with Speaker Names]
    FormatScreenplay --> SaveTranscript
    SaveTranscript --> GenerateMetadata{Metadata Generation?}
    GenerateMetadata -->|Yes| CreateMetadata[Generate Metadata JSON/YAML]
    GenerateMetadata -->|No| Cleanup
    CreateMetadata --> GenerateSummary{Summarization Enabled?}
    GenerateSummary -->|Yes| Summarize[Generate Summary]
    GenerateSummary -->|No| GILCheck
    Summarize --> AddSummaryToMetadata[Add Summary to Metadata]
    AddSummaryToMetadata --> GILCheck{GIL Extraction?}
    GILCheck -->|Yes| ExtractGIL[Extract Insights + Quotes]
    GILCheck -->|No| KGCheck
    ExtractGIL --> GroundInsights[Ground Insights with Quotes]
    GroundInsights --> WriteGI[Write gi.json]
    WriteGI --> KGCheck{KG Extraction?}
    KGCheck -->|Yes| ExtractKG[Extract Topic Graph]
    KGCheck -->|No| Cleanup
    ExtractKG --> WriteKG[Write kg.json]
    WriteKG --> Cleanup[Cleanup Temp Files]
    Cleanup --> End([Complete])

    style Start fill:#e1f5ff
    style End fill:#d4edda
    style ProcessEpisodes fill:#fff3cd
    style Transcribe fill:#f8d7da
    style GenerateMetadata fill:#d1ecf1
    style ExtractGIL fill:#e8daef
    style ExtractKG fill:#d5f5e3

Module Roles in the Pipeline¶

cli.py: Parse/validate CLI arguments, integrate config files, set up progress reporting, trigger run_pipeline. Optimized for interactive command-line use.
service.py: Service API for programmatic/daemon use. Provides service.run() and service.run_from_config_file() functions that return structured ServiceResult objects. Works exclusively with configuration files (no CLI arguments), optimized for non-interactive use (supervisor, systemd, etc.). Entry point: python -m podcast_scraper.service --config config.yaml.
config.py: Immutable Pydantic model representing all runtime options; JSON/YAML loader with strict validation and normalization helpers. Includes language configuration, NER settings, and speaker detection flags (RFC-010).
workflow.orchestration: Pipeline coordinator that orchestrates directory prep, RSS parsing, download concurrency, Whisper lifecycle, speaker detection coordination, and cleanup. Emits .pipeline_status.json stage updates when monitor is enabled.
monitor/ (RFC-065): start_monitor_subprocess() (multiprocessing spawn), .pipeline_status.json helpers, cross-process psutil sampler, rich.Live on stderr or .monitor.log, memray_util / py_spy_listener. See Live Pipeline Monitor.
workflow.stages.scraping: RSS fetch/parse (fetch_and_parse_feed) and episode selection (prepare_episodes_from_feed) before Episode construction — order, optional publish-date window, offset, and cap (CONFIGURATION.md — Episode selection). Guide: RSS and feed ingestion.
rss.parser: Safe RSS/XML parsing using defusedxml (ADR-002), discovery of transcript/enclosure URLs, publication-date helpers for filtering, and creation of Episode models from selected items.
rss.downloader / rss.http_policy: HTTP session pooling with retry-enabled adapters (configurable via Config; configure_downloader() and configure_http_policy() at pipeline start for retries and optional Issue #522 fair HTTP), streaming downloads, and shared progress hooks. Guide: RSS and feed ingestion.
workflow.episode_processor: Episode-level decision logic, transcript storage, Whisper job management, delay handling, and file naming rules. Integrates detected speaker names into Whisper screenplay formatting.
utils.filesystem: Filename sanitization, output directory derivation based on feed hash (ADR-003), run suffix logic, and helper utilities for Whisper output paths.
Provider System (RFC-013, RFC-029): Protocol-based provider architecture for transcription, speaker detection, and summarization (ADR-020). Each capability has a protocol interface (TranscriptionProvider, SpeakerDetector, SummarizationProvider) and factory functions that create provider instances based on configuration. Providers implement initialize(), protocol methods (e.g., transcribe(), summarize()), and cleanup(). See Provider Implementation Guide for details.
Unified Providers (RFC-029): Nine summarization options; nine unified provider classes (1 local ML + 1 hybrid ML + 7 LLM) implement protocol combinations (ADR-024):

Provider	Transcription	Speaker Detection	Summarization	Notes
`MLProvider`	Whisper	spaCy NER	Transformers	Local, no API cost
`HybridMLProvider`	No		MAP-REDUCE	LongT5 MAP + Ollama/llama_cpp/transformers REDUCE (RFC-042)
`OpenAIProvider`	Whisper API	GPT API	GPT API	Cloud, prompt-managed
`GeminiProvider`	Gemini API	Gemini API	Gemini API	2M context, native audio
`AnthropicProvider`	No	Claude API	Claude API	High quality reasoning
`MistralProvider`	No	Mistral API	Mistral API	OpenAI alternative
`DeepSeekProvider`	No	DeepSeek API	DeepSeek API	Ultra low-cost
`GrokProvider`	No	Grok API	Grok API	Real-time info (xAI)
`OllamaProvider`	No	Ollama API	Ollama API	Local LLM, zero cost

Factories: Factory functions in transcription/factory.py, speaker_detectors/factory.py, and summarization/factory.py create the appropriate unified provider based on configuration.
Capabilities: providers/capabilities.py defines ProviderCapabilities — a dataclass describing what each provider supports (JSON mode, tool calls, streaming, etc.). Used by factories and orchestration to select appropriate providers.
Prompt Management (RFC-017): prompts/store.py implements versioned Jinja2 prompt templates organized by <provider>/<task>/<version>.j2. Each of the 9 providers has tuned templates for summarization and NER. LLM providers load prompts via PromptStore.render() ensuring consistent, version-tracked prompt engineering.
providers/ml/whisper_utils.py: Lazy loading of the third-party openai-whisper library, transcription invocation with language-aware model selection (preferring .en variants for English), and screenplay formatting helpers that use detected speaker names. Accessed via MLProvider (unified provider pattern).
speaker_detection.py (RFC-010): Named Entity Recognition using spaCy to extract PERSON entities from episode metadata, distinguish hosts from guests, and provide speaker names for Whisper screenplay formatting. spaCy is a required dependency. Now accessed via MLProvider (unified provider pattern).
providers/ml/summarizer.py (PRD-005/RFC-012): Episode summarization using local transformer models (BART, PEGASUS, LED) to generate concise summaries from transcripts. Implements a hybrid map-reduce strategy. Accessed via MLProvider (unified provider pattern). See ML Provider Reference for details.
gi/ (PRD-017): Grounded Insight Layer — structured insight and quote extraction with evidence grounding. Key modules: pipeline.py (orchestration), schema.py (validation), grounding.py (insight-quote linking), contracts.py (grounding contract), explore.py (CLI exploration), corpus.py (cross-episode operations), io.py (serialization), provenance.py (provenance tracking), compare_runs.py (cross-run comparison), quality_metrics.py (quality scoring).
kg/ (RFC-055): Knowledge Graph extraction — structured topic graphs from transcripts and summaries. Key modules: pipeline.py (orchestration), schema.py (validation), llm_extract.py (LLM-based extraction), cli_handlers.py (CLI subcommands), io.py (serialization), corpus.py (cross-episode operations), contracts.py (KG contract enforcement), quality_metrics.py (quality scoring).
utils.progress: Minimal global progress publishing API so callers can swap in alternative UIs.
models/ (package): Simple dataclasses (RssFeed, Episode, TranscriptionJob in entities.py) shared across modules.
workflow.metadata_generation (PRD-004/RFC-011): Per-episode metadata document generation, capturing feed-level and episode-level information, detected speaker names, transcript sources, processing metadata, and optional summaries in structured JSON/YAML format. Opt-in feature for backwards compatibility.

Module Dependencies Diagram¶

graph TB
    subgraph "Public API"
        CLI[cli.py]
        Config[config.py]
        Service[service.py]
        Workflow[workflow/orchestration.py]
    end

    subgraph "Core Processing"
        RSSParser[rss/parser.py]
        EpisodeProc[workflow/episode_processor.py]
        Downloader[rss/downloader.py]
        AudioPreproc[preprocessing/]
    end

    subgraph "Support Modules"
        Filesystem[utils/filesystem.py]
        Models[models/]
        Progress[utils/progress.py]
        Schemas[schemas/summary_schema.py]
    end

    subgraph "Prompt Management"
        PromptStore[prompts/store.py]
        PromptTemplates[prompts/provider/task/v1.j2]
    end

    subgraph "Provider System"
        TranscriptionFactory[transcription/factory.py]
        SpeakerFactory[speaker_detectors/factory.py]
        SummaryFactory[summarization/factory.py]
        Capabilities[providers/capabilities.py]
    end

    subgraph "Local ML Provider"
        MLProvider[providers/ml/ml_provider.py]
        HybridMLProvider[providers/ml/hybrid_ml_provider.py]
        Whisper[providers/ml/whisper_utils.py]
        SpeakerDetect[providers/ml/speaker_detection.py]
        Summarizer[providers/ml/summarizer.py]
        NER[providers/ml/ner_extraction.py]
    end

    subgraph "LLM Providers"
        OpenAIProvider[providers/openai/]
        GeminiProvider[providers/gemini/]
        AnthropicProvider[providers/anthropic/]
        MistralProvider[providers/mistral/]
        DeepSeekProvider[providers/deepseek/]
        GrokProvider[providers/grok/]
        OllamaProvider[providers/ollama/]
    end

    subgraph "Knowledge Extraction"
        GI[gi/pipeline.py]
        GISchema[gi/schema.py]
        GIGrounding[gi/grounding.py]
        KG[kg/pipeline.py]
        KGSchema[kg/schema.py]
        KGExtract[kg/llm_extract.py]
    end

    subgraph "Optional Features"
        Metadata[workflow/metadata_generation.py]
        Evaluation[evaluation/]
        Cleaning[cleaning/]
    end

    CLI --> Config
    CLI --> Workflow
    Service --> Config
    Service --> Workflow
    Workflow --> RSSParser
    Workflow --> EpisodeProc
    Workflow --> Downloader
    Workflow --> TranscriptionFactory
    Workflow --> SpeakerFactory
    Workflow --> SummaryFactory
    Workflow --> AudioPreproc
    TranscriptionFactory --> MLProvider
    TranscriptionFactory --> OpenAIProvider
    TranscriptionFactory --> GeminiProvider
    SpeakerFactory --> MLProvider
    SpeakerFactory --> OpenAIProvider
    SpeakerFactory --> GeminiProvider
    SpeakerFactory --> AnthropicProvider
    SpeakerFactory --> MistralProvider
    SpeakerFactory --> DeepSeekProvider
    SpeakerFactory --> GrokProvider
    SpeakerFactory --> OllamaProvider
    SummaryFactory --> MLProvider
    SummaryFactory --> HybridMLProvider
    SummaryFactory --> OpenAIProvider
    SummaryFactory --> GeminiProvider
    SummaryFactory --> AnthropicProvider
    SummaryFactory --> MistralProvider
    SummaryFactory --> DeepSeekProvider
    SummaryFactory --> GrokProvider
    SummaryFactory --> OllamaProvider
    MLProvider --> Whisper
    MLProvider --> SpeakerDetect
    MLProvider --> Summarizer
    MLProvider --> NER
    HybridMLProvider --> Summarizer
    OpenAIProvider --> PromptStore
    GeminiProvider --> PromptStore
    AnthropicProvider --> PromptStore
    MistralProvider --> PromptStore
    DeepSeekProvider --> PromptStore
    GrokProvider --> PromptStore
    OllamaProvider --> PromptStore
    PromptStore --> PromptTemplates
    Workflow --> Metadata
    Workflow --> GI
    Workflow --> KG
    Workflow --> Filesystem
    GI --> GISchema
    GI --> GIGrounding
    GI --> SummaryFactory
    KG --> KGSchema
    KG --> KGExtract
    KGExtract --> SummaryFactory
    EpisodeProc --> Downloader
    EpisodeProc --> Filesystem
    RSSParser --> Models
    Summarizer --> Models
    SpeakerDetect --> Models
    TranscriptionFactory --> Capabilities
    SpeakerFactory --> Capabilities
    SummaryFactory --> Capabilities

Generated diagrams:

Module dependency graph (pydeps) — Regenerate with make visualize (requires Graphviz).
Workflow call graph (pyan3) — Function-level calls from workflow/orchestration.py. Regenerate with make call-graph.
Orchestration flowchart, Service API flowchart — Regenerate with make flowcharts (code2flow).

Pipeline and Workflow Behavior (Quirks)¶

Typed, immutable configuration: Config is a frozen Pydantic model, ensuring every module receives canonicalized values (e.g., normalized URLs, integer coercions, validated Whisper models). This centralizes validation and guards downstream logic.
Resilient HTTP interactions: A per-thread requests.Session with exponential backoff retry (LoggingRetry) handles transient network issues while logging retries for observability. Retry counts and backoff factors are configurable via Config fields (http_retry_total, http_backoff_factor, rss_retry_total, rss_backoff_factor) with resilient defaults (8 retries / 1.0 s backoff for media; 10 retries / 2.0 s backoff for RSS). An additional application-level episode retry (episode_retry_max, default 1) re-runs the full episode download on transient network errors after urllib3 retries are exhausted. Optional per-host pacing, circuit breaker, and RSS conditional GET (Issue #522) are in the same section, with recommended presets. See CONFIGURATION.md -- Download Resilience (canonical). Model loading operations use retry_with_exponential_backoff for transient errors (network failures, timeouts).
Concurrent transcript pulls: Transcript downloads are parallelized via ThreadPoolExecutor, guarded with locks when mutating shared counters/job queues. Whisper remains sequential to avoid GPU/CPU thrashing and to keep the UX predictable.
Deterministic filesystem layout: Output folders follow output/rss_<host>_<hash> conventions. Optional run_id and Whisper suffixes create run-scoped subdirectories while sanitize_filename protects against filesystem hazards.
Dry-run and resumability: --dry-run walks the entire plan without touching disk, while --skip-existing short-circuits work per episode, making repeated runs idempotent.
Pluggable progress/UI: A narrow ProgressFactory abstraction lets embedding applications replace the default tqdm progress without touching business logic.
Optional Whisper dependency: Whisper is imported lazily and guarded so environments without GPU support or openai-whisper can still run transcript-only workloads.
Optional summarization dependency (PRD-005/RFC-012): Summarization requires torch and transformers dependencies and is imported lazily. When dependencies are unavailable, summarization is gracefully skipped. Models are automatically selected based on available hardware (MPS for Apple Silicon, CUDA for NVIDIA GPUs, CPU fallback). See ML Provider Reference for details.
Language-aware processing (RFC-010): A single language configuration drives both Whisper model selection (preferring English-only .en variants) and NER model selection (e.g., en_core_web_sm), ensuring consistent language handling across the pipeline.
Automatic speaker detection (RFC-010): Named Entity Recognition extracts speaker names from episode metadata transparently. Manual speaker names (--speaker-names) are only used as fallback when automatic detection fails, not as override. spaCy is a required dependency for speaker detection.
Host/guest distinction: Host detection prioritizes RSS author tags (channel-level only) as the most reliable source, falling back to NER extraction from feed metadata when author tags are unavailable. Guests are always detected from episode-specific metadata using NER, ensuring accurate speaker labeling in Whisper screenplay output.
When the feed author is an organization (Issue #393): If the RSS channel author is an organization name (e.g. NPR, BBC, CNN), the pipeline correctly does not treat it as a host. In that case, host detection uses, in order: NER on feed title/description, episode-level authors (e.g. itunes:author per episode), and if set, the known_hosts config option. For network podcasts, set known_hosts in config to supply host names explicitly when auto-detection finds none.
Provider-based architecture (RFC-013): All capabilities (transcription, speaker detection, summarization) use a protocol-based provider system. Providers are created via factory functions based on configuration, allowing pluggable implementations. Providers implement consistent interfaces (initialize(), protocol methods, cleanup()). See Provider Implementation Guide for complete implementation details.
Local-first summarization (PRD-005/RFC-012): Summarization defaults to local transformer models for privacy and cost-effectiveness. API-based providers (OpenAI) are supported via the provider system. Long transcripts are handled via chunking strategies, and memory optimization is applied for GPU backends (CUDA/MPS). Models are automatically cached and reused across runs, with cache management utilities available via CLI and programmatic APIs. Model loading prefers safetensors format for security and performance (Issue #379). Pinned model revisions ensure reproducibility (Issue #379).
Reproducibility (Issue #379): Deterministic runs via seed control (torch, numpy, transformers). Run manifests capture complete system state. Per-episode stage timings enable performance analysis. Run summaries combine manifest and metrics for complete run records.
Operational Hardening (Issue #379): Three-layer retry with configurable parameters and resilient defaults (see CONFIGURATION.md -- Download Resilience). Timeout enforcement for transcription and summarization. Failure handling flags (--fail-fast, --max-failures) for pipeline control. Structured JSON logging for log aggregation. Path validation and model allowlist validation for security. End-of-run failure_summary in run.json aggregates failed episodes by error type for triage.

Run Tracking Files (Issue #379, #429)¶

The pipeline writes several tracking files in each run directory (output_dir/run_<suffix>/):

File	Purpose
run.json	Top-level run summary combining run manifest and pipeline metrics. Includes `index_file` and `run_manifest_file` so consumers can locate the episode index and run manifest. When episodes fail, includes a `failure_summary` section with `total_failed`, `by_error_type` (counts), and `failed_episode_ids`.
index.json	Episode index listing every episode in the run with `status` (`ok` / `failed` / `skipped`), paths (transcript, metadata), and on failure: `error_type`, `error_message`, `error_stage`. Use for scripting or debugging partial runs.
run_manifest.json	System state at pipeline start: git SHA, config hash, Python version, OS, GPU info, model versions, seed (reproducibility).
metrics.json	Pipeline metrics: episode statuses, stage timings, performance data.

Run summaries and metrics use schema_version: "1.0.0". index.json uses 1.0.0 by default; runs with append (GitHub #444) write schema_version: "1.1.0" and may include optional pipeline_append: true. Older 1.0.0 index files remain valid; append resume prefers on-disk metadata + episode_id over index rows alone (index may be stale after a crash). For exit codes and partial failures, see Troubleshooting - Exit codes and partial failures.

Operational note (multi-feed + append): With feeds: / rss_urls: (or --feeds-spec) and append: true, each feed still has its own feeds/<stable_feed_id>/run_append_*/ tree. Run the same command twice (fixtures or real RSS) to confirm the second pass skips complete episodes without creating duplicate run_* directories. Example: --feeds-spec path/to/your-two-feed.yaml with append: true on the operator YAML.

Partial multi-feed success — exit code lessons¶

Status: Draft notes (operator / design). Related GitHub: #557 (Whisper 25 MB + incident log), #558 (FFmpeg/ffprobe subprocess UTF-8 replace + preprocessing fallback incident rows). #559 (soft-failure taxonomy; default lenient with multi_feed_strict: false; strict CI uses multi_feed_strict: true or CLI --multi-feed-strict) is implemented in code and docs (CONFIGURATION / CLI).

What went well (resilience that already exists)¶

Per-feed isolation in service._run_multi_feed: One feed’s run_pipeline exception (e.g. RSS fetch ValueError) is caught; other feeds still run. See

src/podcast_scraper/service.py (try / except around workflow.run_pipeline(sub_cfg)).

Finalize on partial batches: finalize_multi_feed_batch in src/podcast_scraper/workflow/corpus_operations.py still writes manifest and

corpus_run_summary.json, and runs parent index_corpus where configured; vector index failure there is logged as non-fatal in a try / except.

Episode-level transcription errors: Many paths log transcription raised an unexpected error, increment metrics, and continue the loop unless fail_fast / max_failures

triggers (see src/podcast_scraper/workflow/stages/transcription.py).

So the long manual run did complete batch work: artifacts, corpus summary, and indexing for feeds that succeeded.

What felt like “blowing up” at the end¶

Process exit code: Operators often want exit 0 when “best effort” corpus work finished and artifacts are usable, with failures only in structured JSON / logs.

Today, any feed-level error collection in service._run_multi_feed sets ServiceResult.success=False and aggregates error text; the CLI / acceptance harness then treats the run as failed even when most feeds succeeded.

Terminal Error: line: A single concatenated message listing every bad RSS URL is easy to read as a fatal crash, even though the session already wrote summaries and ran

follow-up steps (e.g. acceptance analysis).

Semantic mismatch: Some failures are data / network (404 RSS), some are bugs (#558), some are policy / limits (#557). Lumping them all into one “failed run”

bucket makes CI and manual runs harder to interpret.

Design intent to capture¶

Layer	Desired behavior
Orchestration	Always finish finalize (manifest, summary, optional index) when safe; never lose partial work silently.
Reporting	Keep `corpus_run_summary.json` `overall_ok` accurate (not all feeds ok). Optionally add `soft_failures` or `hard_failures` counts for triage.
Exit code	Separate “artifacts written” from “all feeds green”. Options: new config flag (e.g. `treat_feed_errors_as_warnings`), or CLI flag `--best-effort-exit-zero`, or distinct exit codes (documented).
Failures	Skip with reason for: RSS 404, optional oversize Whisper (after #557), optional decode (after #558). Fail or non-zero for: config invalid, lock failure, total loss of corpus root.

Follow-up¶

#559 exit semantics and optional extra triage fields remain as design notes. #557 and #558 behavior is implemented in code (CONFIGURATION.md). Keep this file as narrative until exit-code / reporting follow-ups ship, then trim or link from docs/guides/ if maintainers want it public.