RFC-003: Transcript Download Processing¶
- Status: Completed
- Authors: GPT-5 Codex (initial documentation)
- Stakeholders: Maintainers, networking contributors
- Related PRD:
docs/prd/PRD-001-transcript-pipeline.md - Related guide: RSS and feed ingestion (feed fetch, parsing, and
Episodeconstruction before transcript download) - Related ADRs:
- ADR-001: Hybrid Concurrency Strategy
- ADR-003: Deterministic Feed Storage
- ADR-004: Flat Filesystem Archive Layout
Abstract¶
Explain how the system selects, downloads, and stores transcript assets derived from RSS metadata, including error handling, content-type reconciliation, and delay controls.
Problem Statement¶
Transcript URLs vary in format, may expose inaccurate MIME types, and can include duplicates. The pipeline must consistently choose the best candidate, download it reliably, and save it under deterministic filenames while respecting user controls (dry-run, skip-existing, delay-ms).
Constraints & Assumptions¶
- HTTP stack uses
requestswith retry-enabled adapters (RFC-004 covers filesystem interactions). Retry counts and backoff factors are configurable viaConfigfields (http_retry_total,http_backoff_factor,rss_retry_total,rss_backoff_factor) with resilient defaults. An additional application-level episode retry (episode_retry_max, default 1) re-runs the full episode download on transient network errors after urllib3 retries are exhausted. Optional per-host pacing, circuit breaker, and RSS conditional GET (Issue #522) are documented in the same section. See CONFIGURATION.md -- Download Resilience. - Downloads should not halt the overall pipeline when individual episodes fail. End-of-run
failure_summaryinrun.jsonaggregates failures by error type for triage. - User-specified preferences (
prefer_type) are honored when selecting candidates.
Design & Implementation¶
- Candidate selection
rss_parser.find_transcript_urlsprovides candidates;choose_transcript_urlapplies preference ordering.- Preference list compares case-insensitively against MIME types and URL suffixes.
- Download execution
episode_processor.process_transcript_downloadfetches bytes viadownloader.http_get(streaming with progress updates).- Content-Type headers inform extension inference alongside declared types and URL heuristics.
- Filename derivation
- Base pattern:
<idx:04d> - <title_safe>[ _<run_suffix>]. - Extension is re-evaluated post-download to capture actual media type.
- Idempotency:
--skip-existingchecks for any file matching base pattern before download (supports historical runs with different extensions). - Operational flags
--dry-run: logs planned URL and destination path without network calls.--delay-ms: optional sleep between episodes to respect rate-limits.- Error handling
- Network exceptions log warnings and return
False; pipeline continues. - Filesystem errors are logged and treated as failures for that episode.
Key Decisions¶
- Extension inference after download ensures actual content type is reflected even when RSS metadata is wrong.
- Glob-based skip check allows for transcripts saved with different extensions or run suffixes while still stopping duplicate work.
- Progress integration uses shared factory to keep download UI consistent with other operations.
Alternatives Considered¶
- Strict MIME enforcement: Requiring matching MIME types would drop transcripts due to inconsistent feeds; rejected in favor of heuristics.
- Always overwriting files: Rejected to maintain auditability and support manual curation between runs.
Testing Strategy¶
- Unit tests cover
derive_transcript_extensionedge cases. - Integration tests simulate HTTP responses with various headers to ensure extension selection behaves as expected.
Rollout & Monitoring¶
- Logging includes success, failure, and skip events for traceability.
run.jsonincludes afailure_summarywhen episodes fail, grouping failures by error type with counts and episode IDs.- Future enhancements (e.g., checksum validation) can extend this RFC without breaking existing behavior.
References¶
- Source:
podcast_scraper/episode_processor.py - HTTP stack:
docs/rfc/RFC-004-filesystem-layout.mdandpodcast_scraper/downloader.py - Orchestration:
docs/rfc/RFC-001-workflow-orchestration.md