Podcast Scraper Architecture¶
Strategic Overview: This document provides high-level architectural decisions, design principles, and system structure. For detailed implementation guides, see the Development Guide and other specialized documents linked below.
Navigation¶
This architecture document is the central hub for understanding the system. For detailed information, see:
Core Documentation¶
- ADR Index — The immutable record of architectural laws and decisions
- Architecture — Ways to run and deploy — CLI vs service vs Docker; one pipeline, one Config
- Non-Functional Requirements — Performance, security, reliability, observability, maintainability, scalability
- Development Guide — Detailed implementation instructions, dependency management, code patterns, and development workflows
- Pipeline and Workflow Guide — Pipeline flow, module roles, behavioral quirks, run tracking
- Testing Strategy — Testing philosophy, test pyramid, and quality standards
- Testing Guide — Quick reference and test execution commands
- Unit Testing Guide — Unit test mocking patterns
- Integration Testing Guide — Integration test guidelines
- E2E Testing Guide — E2E server, real ML models
- Critical Path Testing Guide — What to test, prioritization
- CI/CD — Continuous integration and deployment pipeline
API Documentation¶
- API Reference — Complete API documentation
- Configuration — Configuration options and examples
- CLI Reference — Command-line interface documentation
Feature Documentation¶
- Provider Implementation Guide — Complete guide for implementing new providers (includes OpenAI example, testing, E2E server mocking)
- ML Provider Reference — ML implementation details
- Configuration API — Configuration API reference (includes environment variables)
Specifications¶
- PRDs — Product Requirements Documents
- RFCs — Request for Comments (design decisions)
- GIL Ontology — Grounded Insight Layer node/edge types and grounding contract
- GIL Schema — Machine-
readable JSON schema for
gi.jsonvalidation
Goals and Scope¶
- Provide a resilient pipeline that collects podcast episode transcripts from RSS feeds and fills gaps via Whisper transcription.
- Offer both CLI and Python APIs with a single
configuration surface (
Config) and deterministic filesystem layout. - Keep the public surface area small (
Config,load_config_file,run_pipeline,service.run,service.run_from_config_file,cli.main) while exposing well-factored submodules for advanced use. - Provide a service API (
service.py) optimized for non-interactive use (daemons, process managers) with structured error handling and exit codes. - Support a multi-provider architecture with
local ML (
MLProvider), hybrid MAP-REDUCE (HybridMLProvider), and 7 LLM providers (OpenAI, Gemini, Anthropic, Mistral, DeepSeek, Grok, Ollama) enabling choice across cost, quality, privacy, and latency dimensions. - Enable structured knowledge extraction via the Grounded Insight Layer (GIL) — extracting insights and verbatim quotes with evidence grounding from podcast transcripts (PRD-017).
- Enable Knowledge Graph (KG) extraction from transcripts and summaries, producing structured topic graphs per episode.
Ways to run and deploy¶
The system has one pipeline (workflow.run_pipeline) and one configuration model (Config). Both entry points produce a Config and call the same pipeline; only how config is supplied and how results are surfaced differs.
| Mode | Use case | Config source | Entry / deployment |
|---|---|---|---|
| CLI | Interactive runs, ad-hoc flags, progress bars | CLI args + optional --config file |
podcast-scraper <rss_url>, python -m podcast_scraper.cli. Subcommands: doctor, cache. |
| Service | Daemons, automation, process managers | Config file only (no CLI args) | python -m podcast_scraper.service --config config.yaml. Returns ServiceResult; exit code 0/1. For supervisor, systemd, etc. |
| Docker | Service-oriented deployment | Config file (default /app/config.yaml or PODCAST_SCRAPER_CONFIG) |
Container runs service mode: no CLI arguments, config from file and env. See Docker Service Guide. |
Programmatic use: Import config.load_config_file, Config, and either workflow.run_pipeline (returns count + summary) or service.run / service.run_from_config_file (returns ServiceResult). See API Reference and Service API.
Providers: Nine providers (1 local ML + 1 hybrid ML + 7 LLM) supply transcription, speaker detection, and summarization; capability matrix and selection are in Pipeline and Workflow Guide. Adding or extending providers: Provider Implementation Guide.
Architectural Decisions (ADRs)¶
The following architectural principles govern this system. For the full history and rationale, see the ADR Index.
Core Patterns¶
- Concurrency: IO-bound threading for downloads, sequential CPU/GPU tasks for ML (ADR-001). MPS exclusive mode (ADR-046) serializes GPU work on Apple Silicon to prevent memory contention when both Whisper and summarization use MPS (enabled by default).
- Providers: Protocol-based discovery (ADR-020) using unified provider classes (ADR-024), library-based naming (ADR-025), and per-capability provider selection (ADR-026).
- Lazy Loading: Heavy ML dependencies are loaded only when needed (ADR-005).
Data & Filesystem¶
- Storage: Hash-based deterministic directory layout (ADR-003) with flat archives (ADR-004).
- Identity: Universal GUID-based episode identification (ADR-007).
- Metadata: Unified JSON schema compatible with SQL and NoSQL (ADR-008).
ML & AI Processing¶
- Summarization: Hybrid MAP-REDUCE strategy (ADR-010, ADR-043) favoring local models (ADR-009).
- Audio: Mandatory preprocessing (ADR-036) with content-hash caching (ADR-037) using FFmpeg (ADR-038) and Opus (ADR-039).
- Governance: Explicit benchmarking gates (ADR-042) and golden dataset versioning (ADR-040).
Development & CI¶
- Workflow: Git worktree-based isolation (ADR-032) with independent environments (ADR-034).
- Quality: Three-tier test pyramid (ADR-019) with automated health metrics (ADR-023).
Reproducibility & Operational Hardening (Issue #379)¶
- Determinism: Seed-based reproducibility for
torch,numpy, andtransformersensures consistent outputs across runs. Seeds are configurable via environment variables or config files. - Run Tracking: Comprehensive run manifests capture system state (Python version, OS, GPU info, model versions, git commit SHA, config hash) for complete reproducibility. Per-episode stage timings track processing duration for performance analysis.
- Failure Handling: Configurable
--fail-fastand--max-failuresflags allow operators to control pipeline behavior on episode failures. Episode-level failures are tracked in metrics without affecting exit codes. - Retry Policies: Exponential backoff retry for transient errors (network failures, model loading errors) with configurable retry counts and delays. HTTP requests use urllib3 retry adapters.
- Timeout Enforcement: Configurable timeouts for transcription and summarization stages prevent hung operations.
- Security: Path validation prevents directory traversal attacks. Model allowlist validation restricts HuggingFace model sources. Safetensors format preference improves security and performance.
trust_remote_code=Falseenforced on all model loading. - Structured Logging:
--json-logsflag enables structured JSON logging for log aggregation systems (ELK, Splunk, CloudWatch). - Diagnostics:
podcast_scraper doctorcommand validates environment (Python version, ffmpeg, write permissions, model cache, network connectivity).
Pipeline and Workflow¶
- Entry:
podcast_scraper.cli.mainparses CLI args (optionally merging JSON/YAML configs) into a validatedConfigobject and applies global logging preferences. - Run orchestration:
workflow.orchestration.run_pipelinecoordinates the end-to-end job: output setup, RSS acquisition, episode materialization, transcript download, optional Whisper transcription, optional metadata generation, optional summarization, and cleanup. - Episode handling: For each
Episode,workflow.episode_processor.process_episode_downloadeither saves an existing transcript or enqueues media for Whisper. - Speaker detection (RFC-010): When automatic speaker detection is enabled, host names are extracted from RSS author tags (channel-level
<author>,<itunes:author>,<itunes:owner>) as the primary source, falling back to NER extraction from feed metadata if no author tags exist. Guest names are extracted from episode-specific metadata (titles and descriptions) using Named Entity Recognition (NER) with spaCy. Manual speaker names are only used as fallback when detection fails. Note: The pipeline logs debug messages when transcription parallelism is ignored due to provider limitations (e.g., Whisper always uses sequential processing). - Audio Preprocessing (RFC-040): When preprocessing is enabled, audio files are optimized before transcription: converted to mono, resampled to 16 kHz, silence removed via VAD, loudness normalized, and compressed with Opus codec. This reduces file size (typically 10-25× smaller) and ensures API compatibility (e.g., OpenAI 25 MB limit). Preprocessing happens at the pipeline level in
workflow.episode_processor.transcribe_media_to_textbefore any provider receives the audio. All providers benefit from optimized audio. - Transcription: When Whisper fallback is enabled,
workflow.episode_processor.download_media_for_transcriptiondownloads media to a temp area andworkflow.episode_processor.transcribe_media_to_textpersists Whisper output using deterministic naming. Detected speaker names are integrated into screenplay formatting when enabled. - Metadata generation (PRD-004/RFC-011): When enabled, per-episode metadata documents are generated alongside transcripts, capturing feed-level and episode-level information, detected speaker names, and processing metadata in JSON/YAML format.
- Summarization (PRD-005/RFC-012): When enabled,
episode transcripts are summarized using the
configured provider — local transformer models
(BART, PEGASUS, LED) via
MLProvider; the hybrid_ml provider (MAP with LongT5 + REDUCE via Ollama, llama.cpp, or transformers); or any of 7 LLM providers (OpenAI, Gemini, Anthropic, Mistral, DeepSeek, Grok, Ollama) via prompt templates. See ML Provider Reference for ML architecture details. - Run Tracking (Issue #379): Run manifests capture system state at pipeline start. Per-episode stage timings track processing duration. Run summaries combine manifest and metrics. Episode index files list all processed episodes with status.
- Progress/UI: All long-running operations
report progress through the pluggable factory in
utils.progress, defaulting torichin the CLI. - GIL Extraction (PRD-017): When enabled,
the Grounded Insight Layer extracts structured
insights and verbatim quotes from transcripts,
links them via grounding relationships, and
writes a
gi.jsonfile per episode. This step runs after summarization and uses the same multi-provider architecture. Seegi/module. - KG Extraction: When enabled, Knowledge Graph
extraction produces structured topic graphs from
transcripts and summaries, writing
kg.jsonper episode. Seekg/module.
Pipeline Flow Diagram¶
flowchart TD
Start([CLI Entry]) --> Parse[Parse CLI Args & Config Files]
Parse --> Validate[Validate & Normalize Config]
Validate --> Setup[Setup Output Directory]
Setup --> FetchRSS[Fetch & Parse RSS Feed]
FetchRSS --> ExtractEpisodes[Extract Episode Metadata]
ExtractEpisodes --> DetectSpeakers{Speaker Detection Enabled?}
DetectSpeakers -->|Yes| ExtractHosts[Extract Host Names from RSS]
ExtractHosts --> ExtractGuests[Extract Guest Names via NER]
DetectSpeakers -->|No| ProcessEpisodes[Process Episodes]
ExtractGuests --> ProcessEpisodes
ProcessEpisodes --> CheckTranscript{Transcript Available?}
CheckTranscript -->|Yes| DownloadTranscript[Download Transcript]
CheckTranscript -->|No| QueueWhisper[Queue for Whisper]
DownloadTranscript --> SaveTranscript[Save Transcript File]
QueueWhisper --> DownloadMedia[Download Media File]
DownloadMedia --> Preprocess{Preprocessing Enabled?}
Preprocess -->|Yes| PreprocessAudio[Preprocess Audio: Mono, 16kHz, VAD, Normalize, Opus]
Preprocess -->|No| Transcribe
PreprocessAudio --> Transcribe[Whisper Transcription]
Transcribe --> FormatScreenplay[Format with Speaker Names]
FormatScreenplay --> SaveTranscript
SaveTranscript --> GenerateMetadata{Metadata Generation?}
GenerateMetadata -->|Yes| CreateMetadata[Generate Metadata JSON/YAML]
GenerateMetadata -->|No| Cleanup
CreateMetadata --> GenerateSummary{Summarization Enabled?}
GenerateSummary -->|Yes| Summarize[Generate Summary]
GenerateSummary -->|No| GILCheck
Summarize --> AddSummaryToMetadata[Add Summary to Metadata]
AddSummaryToMetadata --> GILCheck{GIL Extraction?}
GILCheck -->|Yes| ExtractGIL[Extract Insights + Quotes]
GILCheck -->|No| KGCheck
ExtractGIL --> GroundInsights[Ground Insights with Quotes]
GroundInsights --> WriteGI[Write gi.json]
WriteGI --> KGCheck{KG Extraction?}
KGCheck -->|Yes| ExtractKG[Extract Topic Graph]
KGCheck -->|No| Cleanup
ExtractKG --> WriteKG[Write kg.json]
WriteKG --> Cleanup[Cleanup Temp Files]
Cleanup --> End([Complete])
style Start fill:#e1f5ff
style End fill:#d4edda
style ProcessEpisodes fill:#fff3cd
style Transcribe fill:#f8d7da
style GenerateMetadata fill:#d1ecf1
style ExtractGIL fill:#e8daef
style ExtractKG fill:#d5f5e3
cli.py: Parse/validate CLI arguments, integrate config files, set up progress reporting, triggerrun_pipeline. Optimized for interactive command-line use.service.py: Service API for programmatic/daemon use. Providesservice.run()andservice.run_from_config_file()functions that return structuredServiceResultobjects. Works exclusively with configuration files (no CLI arguments), optimized for non-interactive use (supervisor, systemd, etc.). Entry point:python -m podcast_scraper.service --config config.yaml.config.py: Immutable Pydantic model representing all runtime options; JSON/YAML loader with strict validation and normalization helpers. Includes language configuration, NER settings, and speaker detection flags (RFC-010).workflow.orchestration: Pipeline coordinator that orchestrates directory prep, RSS parsing, download concurrency, Whisper lifecycle, speaker detection coordination, and cleanup.rss.parser: Safe RSS/XML parsing usingdefusedxml(ADR-002), discovery of transcript/enclosure URLs, and creation ofEpisodemodels.rss.downloader: HTTP session pooling with retry-enabled adapters, streaming downloads, and shared progress hooks.workflow.episode_processor: Episode-level decision logic, transcript storage, Whisper job management, delay handling, and file naming rules. Integrates detected speaker names into Whisper screenplay formatting.utils.filesystem: Filename sanitization, output directory derivation based on feed hash (ADR-003), run suffix logic, and helper utilities for Whisper output paths.- Provider System (RFC-013, RFC-029): Protocol-based provider architecture for transcription, speaker detection, and summarization (ADR-020). Each capability has a protocol interface (
TranscriptionProvider,SpeakerDetector,SummarizationProvider) and factory functions that create provider instances based on configuration. Providers implementinitialize(), protocol methods (e.g.,transcribe(),summarize()), andcleanup(). See Provider Implementation Guide for details. - Unified Providers (RFC-029): Nine unified provider classes implement protocol combinations (ADR-024):
| Provider | Transcription | Speaker Detection | Summarization | Notes |
|---|---|---|---|---|
MLProvider |
✅ Whisper | ✅ spaCy NER | ✅ Transformers | Local, no API cost |
HybridMLProvider |
❌ | ❌ | ✅ MAP-REDUCE | LongT5 MAP + Ollama/llama_cpp/transformers REDUCE (RFC-042) |
OpenAIProvider |
✅ Whisper API | ✅ GPT API | ✅ GPT API | Cloud, prompt-managed |
GeminiProvider |
✅ Gemini API | ✅ Gemini API | ✅ Gemini API | 2M context, native audio |
AnthropicProvider |
❌ | ✅ Claude API | ✅ Claude API | High quality reasoning |
MistralProvider |
❌ | ✅ Mistral API | ✅ Mistral API | OpenAI alternative |
DeepSeekProvider |
❌ | ✅ DeepSeek API | ✅ DeepSeek API | Ultra low-cost |
GrokProvider |
❌ | ✅ Grok API | ✅ Grok API | Real-time info (xAI) |
OllamaProvider |
❌ | ✅ Ollama API | ✅ Ollama API | Local LLM, zero cost |
- Factories: Factory functions in
transcription/factory.py,speaker_detectors/factory.py, andsummarization/factory.pycreate the appropriate unified provider based on configuration. - Capabilities:
providers/capabilities.pydefinesProviderCapabilities— a dataclass describing what each provider supports (JSON mode, tool calls, streaming, etc.). Used by factories and orchestration to select appropriate providers. - Prompt Management (RFC-017):
prompts/store.pyimplements versioned Jinja2 prompt templates organized by<provider>/<task>/<version>.j2(e.g.,openai/summarization/long_v1.j2). Each of the 9 providers has tuned templates for summarization and NER. LLM providers load prompts viaPromptStore.render()ensuring consistent, version-tracked prompt engineering. providers/ml/whisper_utils.py: Lazy loading of the third-partyopenai-whisperlibrary, transcription invocation with language-aware model selection (preferring.envariants for English), and screenplay formatting helpers that use detected speaker names. Accessed viaMLProvider(unified provider pattern).providers/ml/speaker_detection.py(RFC-010): Named Entity Recognition using spaCy to extract PERSON entities from episode metadata, distinguish hosts from guests, and provide speaker names for Whisper screenplay formatting. spaCy is a required dependency. Accessed viaMLProvider(unified provider pattern).providers/ml/summarizer.py(PRD-005/RFC-012): Episode summarization using local transformer models (BART, PEGASUS, LED) to generate concise summaries from transcripts. Implements a hybrid map-reduce strategy. Accessed viaMLProvider(unified provider pattern). See ML Provider Reference for details.providers/ml/model_registry.py(RFC-044): Centralized model metadata registry (ModelRegistryclass) withModelCapabilitiesdataclass for all models (summarization, embedding, QA, NLI).gi/(PRD-017): Grounded Insight Layer — structured insight and quote extraction with evidence grounding. Key modules:pipeline.py(orchestration),schema.py(validation),grounding.py(insight↔quote linking),contracts.py(grounding contract),explore.py(CLI exploration),corpus.py(cross-episode operations).kg/(RFC-055): Knowledge Graph extraction — structured topic graphs from transcripts and summaries. Key modules:pipeline.py(orchestration),schema.py(validation),llm_extract.py(LLM-based extraction),cli_handlers.py(CLI subcommands).utils.progress: Minimal global progress publishing API so callers can swap in alternative UIs.models/(package): Simple dataclasses (RssFeed,Episode,TranscriptionJobinentities.py) shared across modules.workflow.metadata_generation(PRD-004/RFC-011): Per-episode metadata document generation, capturing feed-level and episode-level information, detected speaker names, transcript sources, processing metadata, and optional summaries in structured JSON/YAML format. Opt-in feature for backwards compatibility.
Module Dependencies Diagram¶
graph TB
subgraph "Public API"
CLI[cli.py]
Config[config.py]
Service[service.py]
Workflow[workflow/orchestration.py]
end
subgraph "Core Processing"
RSSParser[rss/parser.py]
EpisodeProc[workflow/episode_processor.py]
Downloader[rss/downloader.py]
AudioPreproc[preprocessing/]
end
subgraph "Support Modules"
Filesystem[utils/filesystem.py]
Models[models/]
Progress[utils/progress.py]
Schemas[schemas/summary_schema.py]
end
subgraph "Prompt Management"
PromptStore[prompts/store.py]
PromptTemplates[prompts/provider/task/v1.j2]
end
subgraph "Provider System"
TranscriptionFactory[transcription/factory.py]
SpeakerFactory[speaker_detectors/factory.py]
SummaryFactory[summarization/factory.py]
Capabilities[providers/capabilities.py]
end
subgraph "Local ML Provider"
MLProvider[providers/ml/ml_provider.py]
HybridMLProvider[providers/ml/hybrid_ml_provider.py]
ModelRegistry[providers/ml/model_registry.py]
Whisper[providers/ml/whisper_utils.py]
SpeakerDetect[providers/ml/speaker_detection.py]
Summarizer[providers/ml/summarizer.py]
NER[providers/ml/ner_extraction.py]
end
subgraph "LLM Providers"
OpenAIProvider[providers/openai/]
GeminiProvider[providers/gemini/]
AnthropicProvider[providers/anthropic/]
MistralProvider[providers/mistral/]
DeepSeekProvider[providers/deepseek/]
GrokProvider[providers/grok/]
OllamaProvider[providers/ollama/]
end
subgraph "Knowledge Extraction"
GI[gi/pipeline.py]
GISchema[gi/schema.py]
GIGrounding[gi/grounding.py]
KG[kg/pipeline.py]
KGSchema[kg/schema.py]
KGExtract[kg/llm_extract.py]
end
subgraph "Optional Features"
Metadata[workflow/metadata_generation.py]
Evaluation[evaluation/]
Cleaning[cleaning/]
end
CLI --> Config
CLI --> Workflow
Service --> Config
Service --> Workflow
Workflow --> RSSParser
Workflow --> EpisodeProc
Workflow --> Downloader
Workflow --> TranscriptionFactory
Workflow --> SpeakerFactory
Workflow --> SummaryFactory
Workflow --> AudioPreproc
TranscriptionFactory --> MLProvider
TranscriptionFactory --> OpenAIProvider
TranscriptionFactory --> GeminiProvider
SpeakerFactory --> MLProvider
SpeakerFactory --> OpenAIProvider
SpeakerFactory --> GeminiProvider
SpeakerFactory --> AnthropicProvider
SpeakerFactory --> MistralProvider
SpeakerFactory --> DeepSeekProvider
SpeakerFactory --> GrokProvider
SpeakerFactory --> OllamaProvider
SummaryFactory --> MLProvider
SummaryFactory --> HybridMLProvider
SummaryFactory --> OpenAIProvider
SummaryFactory --> GeminiProvider
SummaryFactory --> AnthropicProvider
SummaryFactory --> MistralProvider
SummaryFactory --> DeepSeekProvider
SummaryFactory --> GrokProvider
SummaryFactory --> OllamaProvider
MLProvider --> Whisper
MLProvider --> SpeakerDetect
MLProvider --> Summarizer
MLProvider --> NER
MLProvider --> ModelRegistry
HybridMLProvider --> Summarizer
HybridMLProvider --> ModelRegistry
Summarizer --> ModelRegistry
OpenAIProvider --> PromptStore
GeminiProvider --> PromptStore
AnthropicProvider --> PromptStore
MistralProvider --> PromptStore
DeepSeekProvider --> PromptStore
GrokProvider --> PromptStore
OllamaProvider --> PromptStore
PromptStore --> PromptTemplates
Workflow --> Metadata
Workflow --> GI
Workflow --> KG
Workflow --> Filesystem
GI --> GISchema
GI --> GIGrounding
GI --> SummaryFactory
KG --> KGSchema
KG --> KGExtract
KGExtract --> SummaryFactory
EpisodeProc --> Downloader
EpisodeProc --> Filesystem
RSSParser --> Models
Summarizer --> Models
SpeakerDetect --> Models
TranscriptionFactory --> Capabilities
SpeakerFactory --> Capabilities
SummaryFactory --> Capabilities
Actual import relationships (pydeps):
The diagram below shows the actual import relationships between modules, generated from code analysis. Compare with the Mermaid diagram above to validate that implementation matches design.
Generated by pydeps. Regenerate with make visualize (requires Graphviz).
Workflow call graph (function-level):
The following diagram shows which functions call which in the pipeline entry point (workflow/orchestration.py). Useful for understanding orchestration flow and hot paths.
Generated by pyan3.
Flowcharts: For control flow within key modules, see orchestration flowchart and service API flowchart (generated by code2flow).
- Typed, immutable configuration:
Configis a frozen Pydantic model, ensuring every module receives canonicalized values (e.g., normalized URLs, integer coercions, validated Whisper models). This centralizes validation and guards downstream logic. - Resilient HTTP interactions: A per-thread
requests.Sessionwith exponential backoff retry (LoggingRetry) handles transient network issues while logging retries for observability. Model loading operations useretry_with_exponential_backofffor transient errors (network failures, timeouts). - Concurrent transcript pulls: Transcript downloads are parallelized via
ThreadPoolExecutor, guarded with locks when mutating shared counters/job queues. Whisper remains sequential to avoid GPU/CPU thrashing and to keep the UX predictable. - Deterministic filesystem layout: Output folders follow
output/rss_<host>_<hash>conventions. Optionalrun_idand Whisper suffixes create run-scoped subdirectories whilesanitize_filenameprotects against filesystem hazards. - Dry-run and resumability:
--dry-runwalks the entire plan without touching disk, while--skip-existingshort-circuits work per episode, making repeated runs idempotent. - Pluggable progress/UI: A narrow
ProgressFactoryabstraction lets embedding applications replace the defaulttqdmprogress without touching business logic. - Optional Whisper dependency: Whisper is imported lazily and guarded so environments without GPU support or
openai-whispercan still run transcript-only workloads. - Optional summarization dependency (PRD-005/RFC-012): Summarization requires
torchandtransformersdependencies and is imported lazily. When dependencies are unavailable, summarization is gracefully skipped. Models are automatically selected based on available hardware (MPS for Apple Silicon, CUDA for NVIDIA GPUs, CPU fallback). See ML Provider Reference for details. - Language-aware processing (RFC-010): A single
languageconfiguration drives both Whisper model selection (preferring English-only.envariants) and NER model selection (e.g.,en_core_web_sm), ensuring consistent language handling across the pipeline. - Automatic speaker detection (RFC-010): Named Entity Recognition extracts speaker names from episode metadata transparently. Manual speaker names (
--speaker-names) are ONLY used as fallback when automatic detection fails, not as override. spaCy is a required dependency for speaker detection. - Host/guest distinction: Host detection prioritizes RSS author tags (channel-level only) as the most reliable source, falling back to NER extraction from feed metadata when author tags are unavailable. Guests are always detected from episode-specific metadata using NER, ensuring accurate speaker labeling in Whisper screenplay output.
- Provider-based architecture (RFC-013): All capabilities (transcription, speaker detection, summarization) use a protocol-based provider system. Providers are created via factory functions based on configuration, allowing pluggable implementations (e.g., Whisper vs OpenAI for transcription, NER vs OpenAI for speaker detection, local transformers vs OpenAI for summarization). Providers implement consistent interfaces (
initialize(), protocol methods,cleanup()) ensuring type safety and easy testing. See Provider Implementation Guide for complete implementation details. - Local-first summarization (PRD-005/RFC-012): Summarization defaults to local transformer models for privacy and cost-effectiveness. API-based providers (OpenAI) are supported via the provider system. Long transcripts are handled via chunking strategies, and memory optimization is applied for GPU backends (CUDA/MPS). Models are automatically cached and reused across runs, with cache management utilities available via CLI and programmatic APIs. Model loading prefers safetensors format for security and performance (Issue #379). Pinned model revisions ensure reproducibility (Issue #379).
- Reproducibility (Issue #379): Deterministic runs via seed control (
torch,numpy,transformers). Run manifests capture complete system state. Per-episode stage timings enable performance analysis. Run summaries combine manifest and metrics for complete run records. - Operational Hardening (Issue #379): Retry policies with exponential backoff for transient errors. Timeout enforcement for transcription and summarization. Failure handling flags (
--fail-fast,--max-failures) for pipeline control. Structured JSON logging for log aggregation. Path validation and model allowlist validation for security.
Architecture Evolution¶
The system has evolved through four phases that expand ML capabilities and enable structured knowledge extraction. This section documents the architectural progression and how components integrate with the existing system.
Phase 1: Model Registry (RFC-044)¶
Status: Implemented
Centralizes all model metadata (architecture limits,
capabilities, memory footprint, device defaults) into
a ModelRegistry class in
providers/ml/model_registry.py. Eliminates hardcoded
model limits scattered across the codebase. Every model
(summarization, embedding, QA, NLI) is registered with
a ModelCapabilities dataclass.
Impact on existing architecture:
providers/ml/summarizer.py— uses registry for token limits instead of hardcoded valuesconfig.py— registry validates model selections- Module:
providers/ml/model_registry.py
Phase 2: Hybrid ML Platform (RFC-042)¶
Status: Implemented (Hybrid MAP-REDUCE summarization + embedding/QA/NLI extensions).
Implemented:
- Hybrid MAP-REDUCE summarization: Use
summary_provider: hybrid_mlwith MAP phase (LongT5-base or other transformers) and REDUCE phase via transformers (FLAN-T5), ollama (local LLMs, e.g. llama3.1:8b, mistral:7b), or llama_cpp (GGUF). See ML Provider Reference and Ollama Provider Guide (Ollama as REDUCE backend).
New modules (present):
providers/ml/hybrid_ml_provider.py— Hybrid MAP-REDUCE provider; REDUCE backends: transformers, Ollama, llama.cpp
Implemented (Phase 2 extensions):
providers/ml/embedding_loader.py— Sentence-transformers for topic deduplicationproviders/ml/extractive_qa.py— Extractive QA for grounding validationproviders/ml/nli_loader.py— NLI cross-encoders for entailment checkingMLProviderextensions with embedding, QA, NLI models (lazy-loaded)
Phase 2b: Local LLM Prompt Optimization (RFC-052)¶
Status: Planned (parallel with Phase 2)
Creates model-specific prompt templates for Ollama
models (Qwen2.5, Llama 3.1, Mistral 7B, Gemma 2,
Phi-3). Adds GIL extraction prompts
(extraction/insight_v1.j2, topic_v1.j2,
quote_v1.j2) to the existing prompts/ structure.
Impact on existing architecture:
- New prompt directories under
prompts/ollama/ - No structural changes; extends existing
PromptStore
Phase 3: Grounded Insight Layer (RFC-049)¶
Status: Implemented
Structured knowledge extraction in the pipeline:
Transcript → Insight Extraction → Quote Extraction
→ Grounding (Insight↔Quote linking)
→ Topic Assignment → gi.json
Modules (implemented):
gi/pipeline.py— GIL orchestration (called after summarization in pipeline)gi/schema.py—gi.jsonvalidation againstdocs/architecture/gi/gi.schema.jsongi/io.py— Serializes GIL output togi.jsonper episodegi/grounding.py— Insight↔Quote grounding logicgi/contracts.py— Grounding contract enforcementgi/explore.py— GIL data exploration utilitiesgi/corpus.py— Cross-episode corpus operationsgi/quality_metrics.py— GIL quality scoringgi/provenance.py— Provenance trackinggi/compare_runs.py— Cross-run comparison
Three extraction tiers:
| Tier | Models | Quality | Cost |
|---|---|---|---|
| ML-only | FLAN-T5 + RoBERTa QA + NLI | Good | Free |
| Hybrid | Ollama (Qwen/Llama) + QA | Better | Free |
| Cloud LLM | OpenAI/Gemini + QA | Best | API cost |
All tiers use extractive QA for grounding contract compliance (every quote must be verbatim).
Phase 3a: KG Extraction (RFC-055)¶
Status: Implemented
Knowledge Graph extraction produces structured topic graphs from transcripts and summaries.
Modules (implemented):
kg/pipeline.py— KG extraction orchestrationkg/schema.py— KG schema validationkg/io.py— Serializes KG output tokg.jsonkg/llm_extract.py— LLM-based KG extractionkg/contracts.py— KG contract enforcementkg/corpus.py— Cross-episode KG operationskg/quality_metrics.py— KG quality scoringkg/cli_handlers.py— KG CLI subcommands
Phase 3b: Use Cases & DB Projection¶
RFC-050 (Use Cases): Implemented. CLI commands
(gi inspect, gi show-insight, gi explore) for
consuming GIL data. See gi/explore.py.
RFC-051 (Database Projection): Projects gi.json
(GIL) and KG artifacts (RFC-055) into separate
Postgres tables for fast SQL queries (e.g. GIL:
insights, quotes, insight_support; KG: kg_nodes,
kg_edges per RFC-051). Enables Insight Explorer,
notebook workflows, and KG discovery queries.
Phase 4: Adaptive Routing (RFC-053)¶
Status: Planned
Selects optimal summarization and extraction strategies based on episode characteristics (duration, structure, content type). Uses episode profiling and deterministic routing rules. Enables expansion beyond podcasts to interviews, lectures, panels, etc.
Execution Order Summary¶
Phase 1: RFC-044 (Model Registry) ✅ Implemented
Phase 2: RFC-042 (Hybrid ML Platform) ✅ Implemented (core + extensions)
2b: RFC-052 (LLM Prompts) parallel
Phase 3: RFC-049 (GIL Core) ✅ Implemented
3a: RFC-050 (Use Cases) ✅ Implemented
3b: RFC-051 (DB Projection) planned
3c: KG Extraction (RFC-055) ✅ Implemented
Phase 4: RFC-053 (Adaptive Routing) planned
Third-Party Dependencies¶
The project uses a layered dependency approach: core dependencies (always required) provide essential functionality, while ML dependencies (optional) enable advanced features like transcription and summarization.
Core Dependencies: requests, pydantic,
defusedxml, tqdm, platformdirs, PyYAML,
Jinja2
ML Dependencies (optional, install via
pip install -e .[ml]): openai-whisper, spacy,
torch, transformers, sentencepiece,
accelerate, protobuf
API Provider Dependencies (optional):
openai— OpenAI, DeepSeek, Grok (OpenAI-compat)google-genai— Google Gemini (migrated from google-generativeai in Issue #415)anthropic— Anthropic Claudemistralai— Mistral AIollama— Ollama (local LLMs)
Additional ML Dependencies (RFC-042):
sentence-transformers, llama-cpp-python (optional)
For detailed dependency information including rationale, alternatives considered, version requirements, and dependency management philosophy, see Dependencies Guide.
Module Dependency Analysis¶
The project uses pydeps for visualizing module dependencies, detecting circular imports, and tracking architectural health over time. This tooling helps maintain clean module boundaries and identify coupling issues early. The dependency graph is integrated into this document above (see Module Dependencies Diagram).
Tools and Commands¶
Makefile Targets:
make deps-graph- Generate module dependency graphs (SVG) indocs/architecture/make deps-graph-full- Generate full module dependency graph with all dependenciesmake call-graph- Generate workflow call graph (pyan3, orchestration entry point)make flowcharts- Generate flowcharts for orchestration and service (code2flow)make visualize- Generate all architecture visualizations (deps + call graph + flowcharts) intodocs/architecture/make release-docs-prep- Regenerate diagrams and create release notes draft before release, then commitmake deps-check- Check dependencies and exit with error if issues foundmake deps-analyze- Run full dependency analysis with JSON report
Analysis Script:
python scripts/tools/analyze_dependencies.py- Analyze dependencies and detect issues--check- Exit with error if issues found--report- Generate detailed JSON report
Note: Generating SVG graphs requires Graphviz (dot on PATH). Install with brew install graphviz (macOS) or your system package manager.
Key Metrics¶
| Metric | Description | Threshold |
|---|---|---|
| Max depth | Longest dependency chain | <5 levels |
| Circular imports | Cycles in import graph | 0 |
| Fan-out | Modules a file imports | <15 |
| Fan-in | Modules importing a file | Monitor only |
CI Integration¶
Dependency analysis runs automatically in the nightly workflow:
- Generates dependency graphs (SVG) for visualization
- Checks for circular imports
- Runs full dependency analysis with JSON report
- Artifacts are uploaded for download (90-day retention)
Output Files¶
docs/architecture/diagrams/dependency-graph.svg- Module dependency graph (clustered, for documentation)docs/architecture/diagrams/dependency-graph-simple.svg- Simplified dependency graph (clustered, max-bacon=2)docs/architecture/diagrams/dependency-graph-full.svg- Full dependency graph with all dependencies (frommake deps-graph-full)docs/architecture/diagrams/workflow-call-graph.svg- Function call graph for orchestration (frommake call-graph, pyan3)docs/architecture/diagrams/orchestration-flow.svg,service-flow.svg- Flowcharts for orchestration and service (frommake flowcharts, code2flow)reports/deps-analysis.json- Detailed analysis report (when usingmake deps-analyzeor script--report)
Related Documentation¶
- Architecture visualizations - Generated diagrams in
docs/architecture/diagrams/ - RFC-038: Continuous Review Tooling - Module coupling analysis implementation
- Issue #170 - Module coupling analysis tooling
- Issue #425 - Codebase visualization tools for documentation
- CI/CD Documentation - Nightly workflow details
Constraints and Assumptions¶
- Python 3.10+ with third-party packages:
requests,tqdm,defusedxml,platformdirs,pydantic,PyYAML,Jinja2,spacy(required for speaker detection), and optionallyopenai-whisper+ffmpegwhen ML transcription is required, and optionallytorch+transformerswhen ML summarization is required. - Network-facing operations assume well-formed HTTPS endpoints; malformed feeds raise early during parsing to avoid partial state.
- The system supports 9 providers with different
capability profiles (see provider table above).
Provider selection is driven by
Configfields (transcription_provider,speaker_detector_provider,summary_provider). - Whisper transcription supports multiple languages
via
languageconfiguration, with English ("en") as the default. Transcription remains sequential by design; concurrent transcription is intentionally out of scope due to typical hardware limits. - Speaker name detection via NER (RFC-010) requires
spaCy. When automatic detection fails, the system
falls back to manual speaker names (if provided) or
default
["Host", "Guest"]labels. - Output directories must live in safe roots (cwd, user home, or platform data/cache dirs); other locations trigger warnings for operator review.
- GIL extraction produces
gi.jsonper episode conforming to a versioned schema. The grounding contract requires every quote to be verbatim and every insight to declare grounding status. - KG extraction produces
kg.jsonper episode with structured topic graphs derived from transcripts and summaries.
Configuration Flow¶
flowchart TD
Input[CLI Args + Config Files] --> Merge[Merge Sources]
Merge --> Validate[Pydantic Validation]
Validate --> Normalize[Normalize Values]
Normalize --> Config[Immutable Config Object]
Config --> Workflow[workflow.orchestration.run_pipeline]
Config --> EpisodeProc[workflow.episode_processor]
Config --> RSSParser[rss.parser]
Config --> Downloader[rss.downloader]
Config --> Whisper[providers.ml.whisper_utils]
Config --> SpeakerDetect[providers.ml.speaker_detection]
Config --> Metadata[workflow.metadata_generation]
Config --> Summarizer[providers.ml.summarizer]
style Input fill:#e1f5ff
style Config fill:#fff3cd
style Validate fill:#f8d7da
models.Episodeencapsulates the RSS item, chosen transcript URLs, and media enclosure metadata, keeping parsing concerns separate from processing.- Transcript filenames follow
<####> - <episode_title>[ _<run_suffix>].<ext>with extensions inferred from declared types, HTTP headers, or URL heuristics. - Whisper output names append the Whisper model/run identifier to differentiate multiple experimental runs inside the same base directory. Screenplay formatting uses detected speaker names when available.
- Temporary media downloads land in
<output>/ .tmp_media/and always get cleaned up (best effort) after transcription completes. - Episode metadata documents (per PRD-004/RFC-011)
are generated when
generate_metadatais enabled, storing detected speaker names, feed information, transcript sources, and other episode details alongside transcripts in JSON/YAML format for downstream use cases. When summarization is enabled, metadata documents include summary and key takeaways fields with model information and generation timestamps. - GIL artifacts (PRD-017): When GIL extraction is
enabled, a
gi.jsonfile is generated per episode containing structured insights, verbatim quotes, topics, and their grounding relationships. The file conforms todocs/architecture/gi/gi.schema.jsonand is co-located with other episode artifacts. - KG artifacts (RFC-055): When KG extraction is
enabled, a
kg.jsonfile is generated per episode containing structured topic graphs. - Run tracking files (Issue #379, #429): The pipeline writes
run.json,index.json,run_manifest.json, andmetrics.jsonin each run directory. See Pipeline and Workflow Guide - Run tracking files for details.
Filesystem Layout¶
graph TD
Root[output/rss_hostname_hash/] --> RunDir{Run ID<br/>Specified?}
RunDir -->|Yes| RunSubdir[run_id/]
RunDir -->|No| BaseDir[Base Directory]
RunSubdir --> Episodes[Episode Files]
BaseDir --> Episodes
Episodes --> Transcript1["0001 - Episode Title.txt"]
Episodes --> Transcript2["0002 - Episode Title.txt"]
Episodes --> WhisperDir{Whisper<br/>Used?}
WhisperDir -->|Yes| WhisperSubdir[whisper_model_run/]
WhisperSubdir --> WhisperFiles["0001 - Episode Title_whisper_base.txt"]
Episodes --> Metadata{Metadata<br/>Enabled?}
Metadata -->|Yes| MetaFiles["0001 - Episode Title.json<br/>0001 - Episode Title.yaml"]
Episodes --> GIL{GIL<br/>Enabled?}
GIL -->|Yes| GIFile["gi.json<br/>Insights + Quotes + Edges"]
Episodes --> KGCheck{KG<br/>Enabled?}
KGCheck -->|Yes| KGFile["kg.json<br/>Topic Graph"]
Root --> TempDir[.tmp_media/]
TempDir --> TempFiles[Temporary Media Files<br/>Cleaned Up After Use]
style Root fill:#e1f5ff
style Episodes fill:#fff3cd
style TempDir fill:#f8d7da
style Metadata fill:#d1ecf1
- RSS and HTTP failures raise
ValueErrorearly with descriptive messages; CLI wraps these in exit codes for scripting. - Transcript/Media downloads log warnings rather than hard-fail the pipeline, allowing other episodes to proceed.
- Filesystem operations sanitize user-provided paths, emit warnings when outside trusted roots, and handle I/O errors gracefully.
- Unexpected exceptions inside worker futures are caught and logged without terminating the executor loop.
For detailed error handling patterns and implementation guidelines, see Development Guide - Error Handling.
Extensibility Points¶
- Configuration: Extend
Config(and CLI) when introducing new features; validation rules keep downstream logic defensive. Language and NER configuration (RFC-010) demonstrate this pattern. - Progress: Replace
progress.set_progress_factoryto integrate with custom UIs or disable progress output entirely. - Download strategy:
downloadercentralizes HTTP behavior—alternate adapters or auth strategies can be injected by decoratingfetch_url/http_get. - Episode transforms: New transcript processors can reuse
models.Episodeandfilesystemhelpers without modifying the main pipeline. - CLI embedding:
cli.mainaccepts override callables (apply_log_level_fn,run_pipeline_fn,logger) to facilitate testing and reuse from other entry points. - Speaker detection (RFC-010): NER implementation is modular and can be extended with custom heuristics, additional entity types, or alternative NLP libraries. Configuration allows disabling detection behavior or providing manual fallback names.
- Language support: Language configuration drives both Whisper and NER model selection, enabling multi-language support through consistent configuration. New languages can be added by extending model selection logic and spaCy model support.
- Metadata generation (PRD-004/RFC-011): Metadata document generation is opt-in and can be extended with additional fields or alternative output formats. The schema is versioned to support future evolution.
- Provider system (RFC-013): The provider architecture enables extensibility for all capabilities. New providers can be added by implementing protocol interfaces and registering in factory functions. The system supports 9 providers: 1 local ML + 1 hybrid ML + 7 LLM (OpenAI, Gemini, Anthropic, Mistral, DeepSeek, Grok, Ollama). E2E testing infrastructure includes mock endpoints for API providers. See Provider Implementation Guide for complete implementation patterns.
- Prompt templates (RFC-017): LLM providers use
versioned Jinja2 prompt templates managed by
PromptStore. New tasks can be added by creating<provider>/<task>/<version>.j2files. GIL extraction prompts will follow this pattern. - Summarization (PRD-005/RFC-012): Summarization is opt-in and integrated with metadata generation. Local transformer models are preferred for privacy and cost-effectiveness, with automatic hardware-aware model selection. The implementation supports multiple model architectures (BART, PEGASUS, LED, DistilBART) and all 7 LLM providers via prompt templates. Long transcript handling via chunking strategies ensures scalability.
- GIL Extraction (PRD-017): The Grounded Insight
Layer is an opt-in pipeline stage that produces
gi.jsonfiles per episode. The three-tier extraction model (ML-only, Hybrid, Cloud LLM) reuses the existing provider architecture. New extraction capabilities can be added by implementing theStructuredExtractorprotocol (RFC-042). Thegi.jsonschema is versioned and validated againstdocs/architecture/gi/gi.schema.json. - KG Extraction (RFC-055): Knowledge Graph
extraction is an opt-in pipeline stage that produces
kg.jsonfiles per episode. Supports stub, summary-bullet-derived, and LLM-based extraction modes viakg_extraction_sourceconfig.
Testing¶
The project follows a three-tier testing strategy (Unit, Integration, E2E). For comprehensive testing information:
| Document | Purpose |
|---|---|
| Testing Strategy | Testing philosophy, test pyramid, decision criteria |
| Testing Guide | Quick reference, test execution commands |
| Unit Testing Guide | Unit test mocking patterns and isolation |
| Integration Testing Guide | Integration test guidelines |
| E2E Testing Guide | E2E server, real ML models |
| Critical Path Testing Guide | What to test, prioritization |
| Provider Implementation Guide | Provider-specific testing |