RSS and feed ingestion guide¶

This guide describes how RSS (and Atom-style) podcast feeds are fetched, cached, parsed, and turned into Episode objects before transcript download and downstream stages. It is the canonical ingress story for URL-based feeds today. Other content sources (manual catalogs, APIs, non-RSS manifests) may be added later; they should be documented alongside this path rather than overloading this document.

Scope¶

In scope	Out of scope (linked elsewhere)
Feed URL configuration, multi-feed layout	Whisper / LLM providers — ML Provider Reference, Provider Implementation
HTTP fetch for feed XML and policy (retries, Issue #522)	Transcript/media byte download after episodes exist — RFC-003: Transcript download processing
Safe XML parsing, item selection, `Episode` construction	Full filesystem layout — RFC-004, ADR-003
Feed-level disk cache vs conditional GET cache	Eval materialization — EXPERIMENT_GUIDE

End-to-end flow (single feed)¶

For one Config with a single rss_url, the pipeline does the following early in workflow.orchestration.run_pipeline (scraping + parsing stages):

fetch_and_parse_feed (workflow/stages/scraping.py): resolve feed XML bytes (cache or HTTP), then parse_rss_items (rss/parser.py) → RssFeed (title, authors, items, base_url).
extract_feed_metadata_for_generation: reuse the same XML bytes for optional feed-level metadata (description, image, last updated) without a second network fetch.
prepare_episodes_from_feed: apply episode selection (order, date window, offset, cap), then create_episode_from_item per retained item → list of Episode models.
Later stages use those episodes for transcript discovery, download, transcription, metadata, summaries, GIL, KG, etc.

Multi-feed mode repeats the same inner sequence once per feed URL with a per-feed output_dir; see Multi-feed corpus below.

flowchart LR
  subgraph ingress["RSS ingress"]
    A[Config rss_url] --> B[feed_cache optional]
    B --> C[fetch_rss_feed_url]
    C --> D[parse_rss_items]
    D --> E[RssFeed]
    E --> F[prepare_episodes_from_feed]
    F --> G[Episode list]
  end
  G --> H[episode_processor / downstream]

Module map¶

Concern	Primary code	Notes
Orchestration	`workflow/orchestration.py`, `workflow/stages/scraping.py`	Stages `scraping` / `parsing` timings in `metrics.json`
Feed HTTP GET	`rss/downloader.py` — `fetch_rss_feed_url`	RSS-tuned session; optional If-None-Match / If-Modified-Since (Issue #522)
Media/transcript HTTP	`rss/downloader.py` — `fetch_url`, `http_get`, …	Different retry defaults than feed GET
HTTP policy	`rss/http_policy.py`	Throttle, Retry-After, circuit breaker, conditional GET body cache
Feed XML disk cache	`rss/feed_cache.py`	When `PODCAST_SCRAPER_RSS_CACHE_DIR` is set
XML parsing	`rss/parser.py`	`defusedxml` — ADR-002
Episode selection	`prepare_episodes_from_feed`	Order, dates, offset, `max_episodes` — Episode selection
Transcript URL discovery	`find_transcript_urls`, `choose_transcript_url`	`prefer_type` / `prefer_types` in `Config`
Public re-exports	`rss/__init__.py`	Stable import surface for `fetch_rss_feed_url`, `parse_rss_items`, etc.

Configuring feed URLs¶

Single feed: rss: / rss_url in YAML, or positional / --rss on the CLI.
Multiple feeds: feeds: / rss_urls: in operator YAML (URL strings or objects with url + optional per-feed overrides), corpus feeds.spec.yaml / JSON (--feeds-spec), or legacy --rss-file line list — with a corpus parent output_dir; each feed writes under feeds/<stable_name>/. See CONFIGURATION.md — RSS and multi-feed corpus, RFC-063, RFC-077.
Normalization: Request URLs go through normalize_url in the downloader (encoding, scheme/host handling) before cache keys and some logs.

CLI feed resolution (positional, --rss, --rss-file, --feeds-spec, config merge) lives in cli.py; programmatic use builds Config directly or via load_config_file().

CLI recipe: profile + operator YAML + feed file¶

For multi-feed (and single-feed) runs, the usual argv split is:

--profile <name> — Packaged defaults from config/profiles/<name>.yaml.
--config path/to/operator.yaml — Thin operator file: output_dir, max_episodes, download flags, etc. (overrides the profile). You do not need to copy the whole profile into this file.
--feeds-spec path/to/feeds.yaml — Structured { feeds: [...] } document (same shape as corpus feeds.spec.yaml). Legacy alternative: --rss-file (one URL per line) or a feed URL on the command line.

python -m podcast_scraper.cli \
  --profile cloud_balanced \
  --config config/manual/operator_defaults.yaml \
  --feeds-spec config/manual/feeds.spec.registry_10.yaml

Same pattern in the root README.md, CLI.md — Quick Start, and CONFIGURATION.md — Multi-feed compose.

HTTP layer for the feed document¶

fetch_rss_feed_url uses a thread-local RSS session with urllib3 Retry tuned for RSS (rss_retry_total, rss_backoff_factor in Download resilience).
configure_downloader and configure_http_policy run at pipeline start from orchestration._setup_pipeline_environment; optional Issue #522 behavior (per-host spacing, circuit breaker, RSS conditional GET) applies here.
304 Not Modified: When conditional GET is enabled and validators match, the downloader may synthesize a 200 response from the on-disk conditional cache. If the cache has no body for a 304, the fetch is treated as a failure (warning + None). See Download resilience and Troubleshooting in CONFIGURATION.md.
Failure: None response from the fetch path surfaces as a hard error when building the feed (pipeline does not continue without a parsable feed).

Two different “RSS caches”¶

Do not confuse these; both can point at the same directory if you rely only on PODCAST_SCRAPER_RSS_CACHE_DIR for the conditional cache base when rss_cache_dir is unset.

Mechanism	Env / config	What is stored	When it runs
Feed XML cache	`PODCAST_SCRAPER_RSS_CACHE_DIR`	Full RSS XML blob per normalized URL (`rss_*.xml` under cache dir)	Before parse, in `feed_cache.read_cached_rss` / `write_cached_rss`
Conditional GET cache	`rss_conditional_get` + `rss_cache_dir` or same env	ETag / Last-Modified metadata + body for validators	Around HTTP in `fetch_rss_feed_url`

Details: CONFIGURATION.md — RSS feed cache and Download resilience (conditional GET table).

Parsing and security¶

Parsing uses defusedxml to mitigate XML bombs and unsafe expansions (ADR-002).
parse_rss_items returns feed title, channel-level authors, and a list of items (episodes).
Publication dates are extracted with extract_episode_published_date / helpers used for episode_since / episode_until filtering. Items without a parseable date are kept when a date filter is active (logged); see CONFIGURATION episode selection notes.

Episode selection (order, window, offset, cap)¶

Order of operations in prepare_episodes_from_feed:

Start from the feed’s item list (document order = typically newest-first).
episode_order: oldest — reverse the list.
Optional episode_since / episode_until — calendar-date filter on parsed pubDate; items with missing dates remain in the set (warning).
episode_offset — skip N items after the above.
max_episodes — keep at most N items.

Reference: CONFIGURATION.md — Episode selection (GitHub #521).

From RSS item to `Episode`¶

create_episode_from_item assigns a stable index (idx) in the selected list, resolves links, enclosures, and text fields, and sets base_url from the resolved feed URL when needed.
Transcript candidates: find_transcript_urls gathers links (including podcast:transcript and generic link discovery); choose_transcript_url applies prefer_type ordering.
Media: find_enclosure_media supports downstream transcription when transcripts are missing and transcribe_missing is enabled.

Full download and filename rules: RFC-003.

Multi-feed corpus (GitHub #440)¶

When rss_urls (or merged feeds) has two or more URLs:

cli.main and service.run loop feeds sequentially (isolated failures per feed).
Each inner run uses rss_url set to one URL and output_dir set to feeds/<stable_id>/ under the corpus parent (see filesystem.corpus_feed_output_dir).
Optional parent-level corpus_manifest.json, corpus_run_summary.json, and a single vector index pass when configured — see ARCHITECTURE.md and RFC-063.

Append / resume (#444) interacts with episode_id and on-disk metadata; it does not change RSS fetch semantics, only which episodes are skipped after selection.

Operations and observability¶

Logs: Feed title and item counts after fetch/selection; cache hits log at INFO when using PODCAST_SCRAPER_RSS_CACHE_DIR.
Metrics: Stage timings scraping and parsing; download-resilience counters in metrics.json (see Experiment Guide).
Run artifacts: failure_summary in run.json aggregates episode failures after the full pipeline; RSS fetch failure aborts before episode processing.

Twelve-factor and deployment¶

Feed URLs and tuning are normal Config fields; secrets and per-deploy overrides typically come from the environment. See CONFIGURATION.md — Twelve-factor app alignment (config) and DOCKER_SERVICE_GUIDE for container-style injection.

Testing¶

Unit: Parser and URL choice tests under tests/unit/podcast_scraper/ (RSS / downloader).
Integration: Real HTTP against a local server — tests/integration/rss/test_http_integration.py (integration_http); workflow tests often mock fetch_url / fetch_rss_feed_url.
E2E: Fixture feeds and mock HTTP server — E2E Testing Guide.

Document	Why read it
CONFIGURATION.md	All RSS-related fields, multi-feed, episode selection, download resilience
CLI.md	Flags for RSS URL(s), episode selection, retries
PIPELINE_AND_WORKFLOW.md	Where RSS sits in the full pipeline
ARCHITECTURE.md	Multi-feed and orchestration summary
RFC-003: Transcript downloads	After `Episode` exists: choosing and fetching transcript bytes
ADR-002: XML security	Parsing threat model
ADR-003: Deterministic feed storage	Output path derivation from feed identity

Version: 1.0
Created: 2026-04-12 — consolidates RSS ingress for future non-RSS sources.