Pipeline and Workflow Guide¶
This guide describes how the podcast_scraper pipeline runs: entry points, flow, module roles, and behavioral quirks. For strategic architecture, ADRs, and planned evolution, see Architecture.
High-Level Flow¶
- Entry:
podcast_scraper.cli.mainparses CLI args (optionally merging JSON/YAML from--config, named preset from--profile, and feed lists from--feeds-spec/--rss-file/ argv) into a validatedConfigobject and applies global logging preferences. Human-oriented overview: CLI.md — Quick Start. - Multi-feed outer loop (GitHub #440): When
Config.rss_urlscontains two or more feeds,cli.mainandservice.runiterate feeds sequentially: for each URL they derive a childoutput_dirunder<corpus_parent>/feeds/<stable_feed_id>/and invoke the same inner pipeline with a single-feed sub-config. Failures are logged per feed; the overall process exit code reflects whether any feed failed. - Run orchestration:
workflow.orchestration.run_pipelinecoordinates the end-to-end job for one feed at a time: output setup, RSS acquisition, episode selection (prepare_episodes_from_feed:episode_order, optionalepisode_since/episode_until,episode_offset,max_episodes; see CONFIGURATION.md — Episode selection), episode materialization, transcript download, optional Whisper transcription, optional metadata generation, optional summarization, and cleanup. WhenConfig.monitoris true (--monitor), a child process samples the pipeline PID and renders a live RSS/CPU/stage view (or plain.monitor.logwhen stderr is not a TTY) while updating.pipeline_status.json; the parent may start a stdin listener for optional py-spy (f) when.[monitor]is installed (see Live Pipeline Monitor, RFC-065). - Episode handling: For each
Episode,workflow.episode_processor.process_episode_downloadeither saves an existing transcript or enqueues media for Whisper. - Speaker detection (RFC-010): When automatic speaker detection is enabled, host names are extracted from RSS author tags (channel-level
<author>,<itunes:author>,<itunes:owner>) as the primary source, falling back to NER extraction from feed metadata if no author tags exist. Guest names are extracted from episode-specific metadata (titles and descriptions) using Named Entity Recognition (NER) with spaCy. Manual speaker names are only used as fallback when detection fails. Note: The pipeline logs debug messages when transcription parallelism is ignored due to provider limitations (e.g., Whisper always uses sequential processing). - Audio Preprocessing (RFC-040, GitHub #561): When preprocessing is enabled, audio files are optimized before transcription: converted to mono, resampled to the configured rate (default 16 kHz), silence removed via VAD, loudness normalized, then encoded to MP3 (
libmp3lame). Default MP3 bitrate is auto: 64 kbps for local transcription (for example Whisper) and 48 kbps for OpenAI / Gemini to reduce payload size. If a cached or freshly encoded file is still over the shared post-preprocess cap for those providers, the pipeline may run additional FFmpeg re-encodes at lower standard bitrates until under a target size or the minimum bitrate. The preprocessing cache key includes the MP3 bitrate so different rungs do not collide. Preprocessing runs inworkflow.episode_processor.transcribe_media_to_textbefore any provider receives the audio. - Transcription: When Whisper fallback is enabled,
workflow.episode_processor.download_media_for_transcriptiondownloads media to a temp area andworkflow.episode_processor.transcribe_media_to_textpersists Whisper output using deterministic naming. Detected speaker names are integrated into screenplay formatting when enabled. - Metadata generation (PRD-004/RFC-011): When enabled, per-episode metadata documents are generated alongside transcripts, capturing feed-level and episode-level information, detected speaker names, and processing metadata in JSON/YAML format.
- Summarization (PRD-005/RFC-012): When enabled, episode transcripts are summarized using the configured provider — local transformer models (BART, PEGASUS, LED) via
MLProvider; the hybrid_ml provider (MAP with LongT5 + REDUCE via Ollama, llama.cpp, or transformers; layered cleaning withtranscript_cleaning_strategy+ internal preprocessing profiles per RFC-042); or any of 7 LLM providers (OpenAI, Gemini, Anthropic, Mistral, DeepSeek, Grok, Ollama) via prompt templates. See ML Provider Reference for ML architecture details. - Run Tracking (Issue #379): Run manifests capture system state at pipeline start. Per-episode stage timings track processing duration. Run summaries combine manifest and metrics. Episode index files list all processed episodes with status.
- Progress/UI: All long-running operations report progress through the pluggable factory in
utils.progress, defaulting torichin the CLI. - GIL Extraction (PRD-017): When enabled, the Grounded Insight Layer extracts structured insights and verbatim quotes from transcripts, links them via grounding relationships, and writes a
gi.jsonfile per episode. This step runs after summarization and uses the same multi-provider architecture. See Grounded Insights Guide and Architecture for details. - KG Extraction (RFC-055): When enabled, Knowledge Graph extraction produces structured topic graphs from transcripts and summaries, writing
*.kg.jsonper episode. See Knowledge Graph Guide for details.
Pipeline Flow Diagram¶
flowchart TD
Start([CLI Entry]) --> Parse[Parse CLI Args & Config Files]
Parse --> Validate[Validate & Normalize Config]
Validate --> Setup[Setup Output Directory]
Setup --> FetchRSS[Fetch & Parse RSS Feed]
FetchRSS --> ExtractEpisodes[Select items then extract episode metadata]
ExtractEpisodes --> DetectSpeakers{Speaker Detection Enabled?}
DetectSpeakers -->|Yes| ExtractHosts[Extract Host Names from RSS]
ExtractHosts --> ExtractGuests[Extract Guest Names via NER]
DetectSpeakers -->|No| ProcessEpisodes[Process Episodes]
ExtractGuests --> ProcessEpisodes
ProcessEpisodes --> CheckTranscript{Transcript Available?}
CheckTranscript -->|Yes| DownloadTranscript[Download Transcript]
CheckTranscript -->|No| QueueWhisper[Queue for Whisper]
DownloadTranscript --> SaveTranscript[Save Transcript File]
QueueWhisper --> DownloadMedia[Download Media File]
DownloadMedia --> Preprocess{Preprocessing Enabled?}
Preprocess -->|Yes| PreprocessAudio[Preprocess Audio: Mono, 16kHz, VAD, Normalize, Opus]
Preprocess -->|No| Transcribe
PreprocessAudio --> Transcribe[Whisper Transcription]
Transcribe --> FormatScreenplay[Format with Speaker Names]
FormatScreenplay --> SaveTranscript
SaveTranscript --> GenerateMetadata{Metadata Generation?}
GenerateMetadata -->|Yes| CreateMetadata[Generate Metadata JSON/YAML]
GenerateMetadata -->|No| Cleanup
CreateMetadata --> GenerateSummary{Summarization Enabled?}
GenerateSummary -->|Yes| Summarize[Generate Summary]
GenerateSummary -->|No| GILCheck
Summarize --> AddSummaryToMetadata[Add Summary to Metadata]
AddSummaryToMetadata --> GILCheck{GIL Extraction?}
GILCheck -->|Yes| ExtractGIL[Extract Insights + Quotes]
GILCheck -->|No| KGCheck
ExtractGIL --> GroundInsights[Ground Insights with Quotes]
GroundInsights --> WriteGI[Write gi.json]
WriteGI --> KGCheck{KG Extraction?}
KGCheck -->|Yes| ExtractKG[Extract Topic Graph]
KGCheck -->|No| Cleanup
ExtractKG --> WriteKG[Write kg.json]
WriteKG --> Cleanup[Cleanup Temp Files]
Cleanup --> End([Complete])
style Start fill:#e1f5ff
style End fill:#d4edda
style ProcessEpisodes fill:#fff3cd
style Transcribe fill:#f8d7da
style GenerateMetadata fill:#d1ecf1
style ExtractGIL fill:#e8daef
style ExtractKG fill:#d5f5e3
Module Roles in the Pipeline¶
- cli.py: Parse/validate CLI arguments, integrate config files, set up progress reporting, trigger
run_pipeline. Optimized for interactive command-line use. - service.py: Service API for programmatic/daemon use. Provides
service.run()andservice.run_from_config_file()functions that return structuredServiceResultobjects. Works exclusively with configuration files (no CLI arguments), optimized for non-interactive use (supervisor, systemd, etc.). Entry point:python -m podcast_scraper.service --config config.yaml. - config.py: Immutable Pydantic model representing all runtime options; JSON/YAML loader with strict validation and normalization helpers. Includes language configuration, NER settings, and speaker detection flags (RFC-010).
- workflow.orchestration: Pipeline coordinator that orchestrates directory prep, RSS parsing, download concurrency, Whisper lifecycle, speaker detection coordination, and cleanup. Emits
.pipeline_status.jsonstage updates whenmonitoris enabled. - monitor/ (RFC-065):
start_monitor_subprocess()(multiprocessingspawn),.pipeline_status.jsonhelpers, cross-process psutil sampler,rich.Liveon stderr or.monitor.log,memray_util/py_spy_listener. See Live Pipeline Monitor. - workflow.stages.scraping: RSS fetch/parse (
fetch_and_parse_feed) and episode selection (prepare_episodes_from_feed) beforeEpisodeconstruction — order, optional publish-date window, offset, and cap (CONFIGURATION.md — Episode selection). Guide: RSS and feed ingestion. - rss.parser: Safe RSS/XML parsing using
defusedxml(ADR-002), discovery of transcript/enclosure URLs, publication-date helpers for filtering, and creation ofEpisodemodels from selected items. -
rss.downloader / rss.http_policy: HTTP session pooling with retry-enabled adapters (configurable via
Config;configure_downloader()andconfigure_http_policy()at pipeline start for retries and optional Issue #522 fair HTTP), streaming downloads, and shared progress hooks. Guide: RSS and feed ingestion. -
workflow.episode_processor: Episode-level decision logic, transcript storage, Whisper job management, delay handling, and file naming rules. Integrates detected speaker names into Whisper screenplay formatting.
- utils.filesystem: Filename sanitization, output directory derivation based on feed hash (ADR-003), run suffix logic, and helper utilities for Whisper output paths.
- Provider System (RFC-013, RFC-029): Protocol-based provider architecture for transcription, speaker detection, and summarization (ADR-020). Each capability has a protocol interface (
TranscriptionProvider,SpeakerDetector,SummarizationProvider) and factory functions that create provider instances based on configuration. Providers implementinitialize(), protocol methods (e.g.,transcribe(),summarize()), andcleanup(). See Provider Implementation Guide for details. - Unified Providers (RFC-029): Nine summarization options; nine unified provider classes (1 local ML + 1 hybrid ML + 7 LLM) implement protocol combinations (ADR-024):
| Provider | Transcription | Speaker Detection | Summarization | Notes |
|---|---|---|---|---|
MLProvider |
Whisper | spaCy NER | Transformers | Local, no API cost |
HybridMLProvider |
No | MAP-REDUCE | LongT5 MAP + Ollama/llama_cpp/transformers REDUCE (RFC-042) | |
OpenAIProvider |
Whisper API | GPT API | GPT API | Cloud, prompt-managed |
GeminiProvider |
Gemini API | Gemini API | Gemini API | 2M context, native audio |
AnthropicProvider |
No | Claude API | Claude API | High quality reasoning |
MistralProvider |
No | Mistral API | Mistral API | OpenAI alternative |
DeepSeekProvider |
No | DeepSeek API | DeepSeek API | Ultra low-cost |
GrokProvider |
No | Grok API | Grok API | Real-time info (xAI) |
OllamaProvider |
No | Ollama API | Ollama API | Local LLM, zero cost |
- Factories: Factory functions in
transcription/factory.py,speaker_detectors/factory.py, andsummarization/factory.pycreate the appropriate unified provider based on configuration. - Capabilities:
providers/capabilities.pydefinesProviderCapabilities— a dataclass describing what each provider supports (JSON mode, tool calls, streaming, etc.). Used by factories and orchestration to select appropriate providers. - Prompt Management (RFC-017):
prompts/store.pyimplements versioned Jinja2 prompt templates organized by<provider>/<task>/<version>.j2. Each of the 9 providers has tuned templates for summarization and NER. LLM providers load prompts viaPromptStore.render()ensuring consistent, version-tracked prompt engineering. - providers/ml/whisper_utils.py: Lazy loading of the third-party
openai-whisperlibrary, transcription invocation with language-aware model selection (preferring.envariants for English), and screenplay formatting helpers that use detected speaker names. Accessed viaMLProvider(unified provider pattern). - speaker_detection.py (RFC-010): Named Entity Recognition using spaCy to extract PERSON entities from episode metadata, distinguish hosts from guests, and provide speaker names for Whisper screenplay formatting. spaCy is a required dependency. Now accessed via
MLProvider(unified provider pattern). - providers/ml/summarizer.py (PRD-005/RFC-012): Episode summarization using local transformer models (BART, PEGASUS, LED) to generate concise summaries from transcripts. Implements a hybrid map-reduce strategy. Accessed via
MLProvider(unified provider pattern). See ML Provider Reference for details. - gi/ (PRD-017): Grounded Insight Layer — structured insight and quote extraction with evidence grounding. Key modules:
pipeline.py(orchestration),schema.py(validation),grounding.py(insight-quote linking),contracts.py(grounding contract),explore.py(CLI exploration),corpus.py(cross-episode operations),io.py(serialization),provenance.py(provenance tracking),compare_runs.py(cross-run comparison),quality_metrics.py(quality scoring). - kg/ (RFC-055): Knowledge Graph extraction — structured topic graphs from transcripts and summaries. Key modules:
pipeline.py(orchestration),schema.py(validation),llm_extract.py(LLM-based extraction),cli_handlers.py(CLI subcommands),io.py(serialization),corpus.py(cross-episode operations),contracts.py(KG contract enforcement),quality_metrics.py(quality scoring). - utils.progress: Minimal global progress publishing API so callers can swap in alternative UIs.
- models/ (package): Simple dataclasses (
RssFeed,Episode,TranscriptionJobinentities.py) shared across modules. - workflow.metadata_generation (PRD-004/RFC-011): Per-episode metadata document generation, capturing feed-level and episode-level information, detected speaker names, transcript sources, processing metadata, and optional summaries in structured JSON/YAML format. Opt-in feature for backwards compatibility.
Module Dependencies Diagram¶
graph TB
subgraph "Public API"
CLI[cli.py]
Config[config.py]
Service[service.py]
Workflow[workflow/orchestration.py]
end
subgraph "Core Processing"
RSSParser[rss/parser.py]
EpisodeProc[workflow/episode_processor.py]
Downloader[rss/downloader.py]
AudioPreproc[preprocessing/]
end
subgraph "Support Modules"
Filesystem[utils/filesystem.py]
Models[models/]
Progress[utils/progress.py]
Schemas[schemas/summary_schema.py]
end
subgraph "Prompt Management"
PromptStore[prompts/store.py]
PromptTemplates[prompts/provider/task/v1.j2]
end
subgraph "Provider System"
TranscriptionFactory[transcription/factory.py]
SpeakerFactory[speaker_detectors/factory.py]
SummaryFactory[summarization/factory.py]
Capabilities[providers/capabilities.py]
end
subgraph "Local ML Provider"
MLProvider[providers/ml/ml_provider.py]
HybridMLProvider[providers/ml/hybrid_ml_provider.py]
Whisper[providers/ml/whisper_utils.py]
SpeakerDetect[providers/ml/speaker_detection.py]
Summarizer[providers/ml/summarizer.py]
NER[providers/ml/ner_extraction.py]
end
subgraph "LLM Providers"
OpenAIProvider[providers/openai/]
GeminiProvider[providers/gemini/]
AnthropicProvider[providers/anthropic/]
MistralProvider[providers/mistral/]
DeepSeekProvider[providers/deepseek/]
GrokProvider[providers/grok/]
OllamaProvider[providers/ollama/]
end
subgraph "Knowledge Extraction"
GI[gi/pipeline.py]
GISchema[gi/schema.py]
GIGrounding[gi/grounding.py]
KG[kg/pipeline.py]
KGSchema[kg/schema.py]
KGExtract[kg/llm_extract.py]
end
subgraph "Optional Features"
Metadata[workflow/metadata_generation.py]
Evaluation[evaluation/]
Cleaning[cleaning/]
end
CLI --> Config
CLI --> Workflow
Service --> Config
Service --> Workflow
Workflow --> RSSParser
Workflow --> EpisodeProc
Workflow --> Downloader
Workflow --> TranscriptionFactory
Workflow --> SpeakerFactory
Workflow --> SummaryFactory
Workflow --> AudioPreproc
TranscriptionFactory --> MLProvider
TranscriptionFactory --> OpenAIProvider
TranscriptionFactory --> GeminiProvider
SpeakerFactory --> MLProvider
SpeakerFactory --> OpenAIProvider
SpeakerFactory --> GeminiProvider
SpeakerFactory --> AnthropicProvider
SpeakerFactory --> MistralProvider
SpeakerFactory --> DeepSeekProvider
SpeakerFactory --> GrokProvider
SpeakerFactory --> OllamaProvider
SummaryFactory --> MLProvider
SummaryFactory --> HybridMLProvider
SummaryFactory --> OpenAIProvider
SummaryFactory --> GeminiProvider
SummaryFactory --> AnthropicProvider
SummaryFactory --> MistralProvider
SummaryFactory --> DeepSeekProvider
SummaryFactory --> GrokProvider
SummaryFactory --> OllamaProvider
MLProvider --> Whisper
MLProvider --> SpeakerDetect
MLProvider --> Summarizer
MLProvider --> NER
HybridMLProvider --> Summarizer
OpenAIProvider --> PromptStore
GeminiProvider --> PromptStore
AnthropicProvider --> PromptStore
MistralProvider --> PromptStore
DeepSeekProvider --> PromptStore
GrokProvider --> PromptStore
OllamaProvider --> PromptStore
PromptStore --> PromptTemplates
Workflow --> Metadata
Workflow --> GI
Workflow --> KG
Workflow --> Filesystem
GI --> GISchema
GI --> GIGrounding
GI --> SummaryFactory
KG --> KGSchema
KG --> KGExtract
KGExtract --> SummaryFactory
EpisodeProc --> Downloader
EpisodeProc --> Filesystem
RSSParser --> Models
Summarizer --> Models
SpeakerDetect --> Models
TranscriptionFactory --> Capabilities
SpeakerFactory --> Capabilities
SummaryFactory --> Capabilities
Generated diagrams:
- Module dependency graph (pydeps) — Regenerate with
make visualize(requires Graphviz). - Workflow call graph (pyan3) — Function-level calls from
workflow/orchestration.py. Regenerate withmake call-graph. - Orchestration flowchart, Service API flowchart — Regenerate with
make flowcharts(code2flow).
Pipeline and Workflow Behavior (Quirks)¶
- Typed, immutable configuration:
Configis a frozen Pydantic model, ensuring every module receives canonicalized values (e.g., normalized URLs, integer coercions, validated Whisper models). This centralizes validation and guards downstream logic. - Resilient HTTP interactions: A per-thread
requests.Sessionwith exponential backoff retry (LoggingRetry) handles transient network issues while logging retries for observability. Retry counts and backoff factors are configurable viaConfigfields (http_retry_total,http_backoff_factor,rss_retry_total,rss_backoff_factor) with resilient defaults (8 retries / 1.0 s backoff for media; 10 retries / 2.0 s backoff for RSS). An additional application-level episode retry (episode_retry_max, default 1) re-runs the full episode download on transient network errors after urllib3 retries are exhausted. Optional per-host pacing, circuit breaker, and RSS conditional GET (Issue #522) are in the same section, with recommended presets. See CONFIGURATION.md -- Download Resilience (canonical). Model loading operations useretry_with_exponential_backofffor transient errors (network failures, timeouts). - Concurrent transcript pulls: Transcript downloads are parallelized via
ThreadPoolExecutor, guarded with locks when mutating shared counters/job queues. Whisper remains sequential to avoid GPU/CPU thrashing and to keep the UX predictable. - Deterministic filesystem layout: Output folders follow
output/rss_<host>_<hash>conventions. Optionalrun_idand Whisper suffixes create run-scoped subdirectories whilesanitize_filenameprotects against filesystem hazards. - Dry-run and resumability:
--dry-runwalks the entire plan without touching disk, while--skip-existingshort-circuits work per episode, making repeated runs idempotent. - Pluggable progress/UI: A narrow
ProgressFactoryabstraction lets embedding applications replace the defaulttqdmprogress without touching business logic. - Optional Whisper dependency: Whisper is imported lazily and guarded so environments without GPU support or
openai-whispercan still run transcript-only workloads. - Optional summarization dependency (PRD-005/RFC-012): Summarization requires
torchandtransformersdependencies and is imported lazily. When dependencies are unavailable, summarization is gracefully skipped. Models are automatically selected based on available hardware (MPS for Apple Silicon, CUDA for NVIDIA GPUs, CPU fallback). See ML Provider Reference for details. - Language-aware processing (RFC-010): A single
languageconfiguration drives both Whisper model selection (preferring English-only.envariants) and NER model selection (e.g.,en_core_web_sm), ensuring consistent language handling across the pipeline. - Automatic speaker detection (RFC-010): Named Entity Recognition extracts speaker names from episode metadata transparently. Manual speaker names (
--speaker-names) are only used as fallback when automatic detection fails, not as override. spaCy is a required dependency for speaker detection. - Host/guest distinction: Host detection prioritizes RSS author tags (channel-level only) as the most reliable source, falling back to NER extraction from feed metadata when author tags are unavailable. Guests are always detected from episode-specific metadata using NER, ensuring accurate speaker labeling in Whisper screenplay output.
- When the feed author is an organization (Issue #393): If the RSS channel author is an organization name (e.g. NPR, BBC, CNN), the pipeline correctly does not treat it as a host. In that case, host detection uses, in order: NER on feed title/description, episode-level authors (e.g.
itunes:authorper episode), and if set, theknown_hostsconfig option. For network podcasts, setknown_hostsin config to supply host names explicitly when auto-detection finds none. - Provider-based architecture (RFC-013): All capabilities (transcription, speaker detection, summarization) use a protocol-based provider system. Providers are created via factory functions based on configuration, allowing pluggable implementations. Providers implement consistent interfaces (
initialize(), protocol methods,cleanup()). See Provider Implementation Guide for complete implementation details. - Local-first summarization (PRD-005/RFC-012): Summarization defaults to local transformer models for privacy and cost-effectiveness. API-based providers (OpenAI) are supported via the provider system. Long transcripts are handled via chunking strategies, and memory optimization is applied for GPU backends (CUDA/MPS). Models are automatically cached and reused across runs, with cache management utilities available via CLI and programmatic APIs. Model loading prefers safetensors format for security and performance (Issue #379). Pinned model revisions ensure reproducibility (Issue #379).
- Reproducibility (Issue #379): Deterministic runs via seed control (
torch,numpy,transformers). Run manifests capture complete system state. Per-episode stage timings enable performance analysis. Run summaries combine manifest and metrics for complete run records. - Operational Hardening (Issue #379): Three-layer retry with configurable parameters and resilient defaults (see CONFIGURATION.md -- Download Resilience). Timeout enforcement for transcription and summarization. Failure handling flags (
--fail-fast,--max-failures) for pipeline control. Structured JSON logging for log aggregation. Path validation and model allowlist validation for security. End-of-runfailure_summaryinrun.jsonaggregates failed episodes by error type for triage.
Run Tracking Files (Issue #379, #429)¶
The pipeline writes several tracking files in each run directory (output_dir/run_<suffix>/):
| File | Purpose |
|---|---|
| run.json | Top-level run summary combining run manifest and pipeline metrics. Includes index_file and run_manifest_file so consumers can locate the episode index and run manifest. When episodes fail, includes a failure_summary section with total_failed, by_error_type (counts), and failed_episode_ids. |
| index.json | Episode index listing every episode in the run with status (ok / failed / skipped), paths (transcript, metadata), and on failure: error_type, error_message, error_stage. Use for scripting or debugging partial runs. |
| run_manifest.json | System state at pipeline start: git SHA, config hash, Python version, OS, GPU info, model versions, seed (reproducibility). |
| metrics.json | Pipeline metrics: episode statuses, stage timings, performance data. |
Run summaries and metrics use schema_version: "1.0.0". index.json uses 1.0.0 by default; runs with append (GitHub #444) write schema_version: "1.1.0" and may include optional pipeline_append: true. Older 1.0.0 index files remain valid; append resume prefers on-disk metadata + episode_id over index rows alone (index may be stale after a crash). For exit codes and partial failures, see Troubleshooting - Exit codes and partial failures.
Operational note (multi-feed + append): With feeds: / rss_urls: (or --feeds-spec) and append: true, each feed still has its own feeds/<stable_feed_id>/run_append_*/ tree. Run the same command twice (fixtures or real RSS) to confirm the second pass skips complete episodes without creating duplicate run_* directories. Example: --feeds-spec path/to/your-two-feed.yaml with append: true on the operator YAML.
Partial multi-feed success — exit code lessons¶
Status: Draft notes (operator / design). Related GitHub: #557 (Whisper 25 MB + incident log), #558
(FFmpeg/ffprobe subprocess UTF-8 replace + preprocessing fallback incident rows). #559 (soft-failure taxonomy; default lenient with
multi_feed_strict: false; strict CI uses multi_feed_strict: true or CLI
--multi-feed-strict) is implemented in code and docs (CONFIGURATION / CLI).
What went well (resilience that already exists)¶
- Per-feed isolation in
service._run_multi_feed: One feed’srun_pipelineexception (e.g. RSS fetchValueError) is caught; other feeds still run. See
src/podcast_scraper/service.py (try / except around workflow.run_pipeline(sub_cfg)).
- Finalize on partial batches:
finalize_multi_feed_batchinsrc/podcast_scraper/workflow/corpus_operations.pystill writes manifest and
corpus_run_summary.json, and runs parent index_corpus where configured; vector index
failure there is logged as non-fatal in a try / except.
- Episode-level transcription errors: Many paths log
transcription raised an unexpected error, increment metrics, and continue the loop unlessfail_fast/max_failures
triggers (see src/podcast_scraper/workflow/stages/transcription.py).
So the long manual run did complete batch work: artifacts, corpus summary, and indexing for feeds that succeeded.
What felt like “blowing up” at the end¶
- Process exit code: Operators often want exit 0 when “best effort” corpus work finished and artifacts are usable, with failures only in structured JSON / logs.
Today, any feed-level error collection in service._run_multi_feed sets
ServiceResult.success=False and aggregates error text; the CLI / acceptance harness
then treats the run as failed even when most feeds succeeded.
- Terminal
Error:line: A single concatenated message listing every bad RSS URL is easy to read as a fatal crash, even though the session already wrote summaries and ran
follow-up steps (e.g. acceptance analysis).
- Semantic mismatch: Some failures are data / network (404 RSS), some are bugs (#558), some are policy / limits (#557). Lumping them all into one “failed run”
bucket makes CI and manual runs harder to interpret.
Design intent to capture¶
| Layer | Desired behavior |
|---|---|
| Orchestration | Always finish finalize (manifest, summary, optional index) when safe; never lose partial work silently. |
| Reporting | Keep corpus_run_summary.json overall_ok accurate (not all feeds ok). Optionally add soft_failures or hard_failures counts for triage. |
| Exit code | Separate “artifacts written” from “all feeds green”. Options: new config flag (e.g. treat_feed_errors_as_warnings), or CLI flag --best-effort-exit-zero, or distinct exit codes (documented). |
| Failures | Skip with reason for: RSS 404, optional oversize Whisper (after #557), optional decode (after #558). Fail or non-zero for: config invalid, lock failure, total loss of corpus root. |
Follow-up¶
#559 exit semantics and optional extra triage fields remain as design notes. #557 and #558 behavior is implemented in code (CONFIGURATION.md). Keep this file as narrative until exit-code / reporting follow-ups ship, then trim or link from docs/guides/ if maintainers want it public.