RSS and feed ingestion guide¶
This guide describes how RSS (and Atom-style) podcast feeds are fetched, cached, parsed, and
turned into Episode objects before transcript download and downstream stages. It is the
canonical ingress story for URL-based feeds today. Other content sources (manual catalogs,
APIs, non-RSS manifests) may be added later; they should be documented alongside this path rather
than overloading this document.
Scope¶
| In scope | Out of scope (linked elsewhere) |
|---|---|
| Feed URL configuration, multi-feed layout | Whisper / LLM providers — ML Provider Reference, Provider Implementation |
| HTTP fetch for feed XML and policy (retries, Issue #522) | Transcript/media byte download after episodes exist — RFC-003: Transcript download processing |
Safe XML parsing, item selection, Episode construction |
Full filesystem layout — RFC-004, ADR-003 |
| Feed-level disk cache vs conditional GET cache | Eval materialization — EXPERIMENT_GUIDE |
End-to-end flow (single feed)¶
For one Config with a single rss_url, the pipeline does the following early in
workflow.orchestration.run_pipeline (scraping + parsing stages):
fetch_and_parse_feed(workflow/stages/scraping.py): resolve feed XML bytes (cache or HTTP), thenparse_rss_items(rss/parser.py) →RssFeed(title, authors, items,base_url).extract_feed_metadata_for_generation: reuse the same XML bytes for optional feed-level metadata (description, image, last updated) without a second network fetch.prepare_episodes_from_feed: apply episode selection (order, date window, offset, cap), thencreate_episode_from_itemper retained item → list ofEpisodemodels.- Later stages use those episodes for transcript discovery, download, transcription, metadata, summaries, GIL, KG, etc.
Multi-feed mode repeats the same inner sequence once per feed URL with a per-feed
output_dir; see Multi-feed corpus below.
flowchart LR
subgraph ingress["RSS ingress"]
A[Config rss_url] --> B[feed_cache optional]
B --> C[fetch_rss_feed_url]
C --> D[parse_rss_items]
D --> E[RssFeed]
E --> F[prepare_episodes_from_feed]
F --> G[Episode list]
end
G --> H[episode_processor / downstream]
Module map¶
| Concern | Primary code | Notes |
|---|---|---|
| Orchestration | workflow/orchestration.py, workflow/stages/scraping.py |
Stages scraping / parsing timings in metrics.json |
| Feed HTTP GET | rss/downloader.py — fetch_rss_feed_url |
RSS-tuned session; optional If-None-Match / If-Modified-Since (Issue #522) |
| Media/transcript HTTP | rss/downloader.py — fetch_url, http_get, … |
Different retry defaults than feed GET |
| HTTP policy | rss/http_policy.py |
Throttle, Retry-After, circuit breaker, conditional GET body cache |
| Feed XML disk cache | rss/feed_cache.py |
When PODCAST_SCRAPER_RSS_CACHE_DIR is set |
| XML parsing | rss/parser.py |
defusedxml — ADR-002 |
| Episode selection | prepare_episodes_from_feed |
Order, dates, offset, max_episodes — Episode selection |
| Transcript URL discovery | find_transcript_urls, choose_transcript_url |
prefer_type / prefer_types in Config |
| Public re-exports | rss/__init__.py |
Stable import surface for fetch_rss_feed_url, parse_rss_items, etc. |
Configuring feed URLs¶
- Single feed:
rss:/rss_urlin YAML, or positional /--rsson the CLI. - Multiple feeds:
feeds:/rss_urls:in operator YAML (URL strings or objects withurl+ optional per-feed overrides), corpusfeeds.spec.yaml/ JSON (--feeds-spec), or legacy--rss-fileline list — with a corpus parentoutput_dir; each feed writes underfeeds/<stable_name>/. See CONFIGURATION.md — RSS and multi-feed corpus, RFC-063, RFC-077. - Normalization: Request URLs go through
normalize_urlin the downloader (encoding, scheme/host handling) before cache keys and some logs.
CLI feed resolution (positional, --rss, --rss-file, --feeds-spec, config merge) lives in cli.py;
programmatic use builds Config directly or via load_config_file().
CLI recipe: profile + operator YAML + feed file¶
For multi-feed (and single-feed) runs, the usual argv split is:
--profile <name>— Packaged defaults fromconfig/profiles/<name>.yaml.--config path/to/operator.yaml— Thin operator file:output_dir,max_episodes, download flags, etc. (overrides the profile). You do not need to copy the whole profile into this file.--feeds-spec path/to/feeds.yaml— Structured{ feeds: [...] }document (same shape as corpusfeeds.spec.yaml). Legacy alternative:--rss-file(one URL per line) or a feed URL on the command line.
python -m podcast_scraper.cli \
--profile cloud_balanced \
--config config/manual/operator_defaults.yaml \
--feeds-spec config/manual/feeds.spec.registry_10.yaml
Same pattern in the root README.md, CLI.md — Quick Start, and CONFIGURATION.md — Multi-feed compose.
HTTP layer for the feed document¶
fetch_rss_feed_urluses a thread-local RSS session with urllib3Retrytuned for RSS (rss_retry_total,rss_backoff_factorin Download resilience).configure_downloaderandconfigure_http_policyrun at pipeline start fromorchestration._setup_pipeline_environment; optional Issue #522 behavior (per-host spacing, circuit breaker, RSS conditional GET) applies here.- 304 Not Modified: When conditional GET is enabled and validators match, the downloader may
synthesize a 200 response from the on-disk conditional cache. If the cache has no body for a
304, the fetch is treated as a failure (warning +
None). See Download resilience and Troubleshooting in CONFIGURATION.md. - Failure:
Noneresponse from the fetch path surfaces as a hard error when building the feed (pipeline does not continue without a parsable feed).
Two different “RSS caches”¶
Do not confuse these; both can point at the same directory if you rely only on
PODCAST_SCRAPER_RSS_CACHE_DIR for the conditional cache base when rss_cache_dir is unset.
| Mechanism | Env / config | What is stored | When it runs |
|---|---|---|---|
| Feed XML cache | PODCAST_SCRAPER_RSS_CACHE_DIR |
Full RSS XML blob per normalized URL (rss_*.xml under cache dir) |
Before parse, in feed_cache.read_cached_rss / write_cached_rss |
| Conditional GET cache | rss_conditional_get + rss_cache_dir or same env |
ETag / Last-Modified metadata + body for validators | Around HTTP in fetch_rss_feed_url |
Details: CONFIGURATION.md — RSS feed cache and Download resilience (conditional GET table).
Parsing and security¶
- Parsing uses
defusedxmlto mitigate XML bombs and unsafe expansions (ADR-002). parse_rss_itemsreturns feed title, channel-level authors, and a list of items (episodes).- Publication dates are extracted with
extract_episode_published_date/ helpers used forepisode_since/episode_untilfiltering. Items without a parseable date are kept when a date filter is active (logged); see CONFIGURATION episode selection notes.
Episode selection (order, window, offset, cap)¶
Order of operations in prepare_episodes_from_feed:
- Start from the feed’s item list (document order = typically newest-first).
episode_order: oldest— reverse the list.- Optional
episode_since/episode_until— calendar-date filter on parsedpubDate; items with missing dates remain in the set (warning). episode_offset— skip N items after the above.max_episodes— keep at most N items.
Reference: CONFIGURATION.md — Episode selection (GitHub #521).
From RSS item to Episode¶
create_episode_from_itemassigns a stable index (idx) in the selected list, resolves links, enclosures, and text fields, and setsbase_urlfrom the resolved feed URL when needed.- Transcript candidates:
find_transcript_urlsgathers links (includingpodcast:transcriptand generic link discovery);choose_transcript_urlappliesprefer_typeordering. - Media:
find_enclosure_mediasupports downstream transcription when transcripts are missing andtranscribe_missingis enabled.
Full download and filename rules: RFC-003.
Multi-feed corpus (GitHub #440)¶
When rss_urls (or merged feeds) has two or more URLs:
cli.mainandservice.runloop feeds sequentially (isolated failures per feed).- Each inner run uses
rss_urlset to one URL andoutput_dirset tofeeds/<stable_id>/under the corpus parent (seefilesystem.corpus_feed_output_dir). - Optional parent-level
corpus_manifest.json,corpus_run_summary.json, and a single vector index pass when configured — see ARCHITECTURE.md and RFC-063.
Append / resume (#444) interacts with episode_id and on-disk metadata; it does not change RSS fetch semantics, only which episodes are skipped after selection.
Operations and observability¶
- Logs: Feed title and item counts after fetch/selection; cache hits log at INFO when using
PODCAST_SCRAPER_RSS_CACHE_DIR. - Metrics: Stage timings
scrapingandparsing; download-resilience counters inmetrics.json(see Experiment Guide). - Run artifacts:
failure_summaryinrun.jsonaggregates episode failures after the full pipeline; RSS fetch failure aborts before episode processing.
Twelve-factor and deployment¶
Feed URLs and tuning are normal Config fields; secrets and per-deploy overrides typically
come from the environment. See CONFIGURATION.md — Twelve-factor app alignment (config) and DOCKER_SERVICE_GUIDE for container-style injection.
Testing¶
- Unit: Parser and URL choice tests under
tests/unit/podcast_scraper/(RSS / downloader). - Integration: Real HTTP against a local server —
tests/integration/rss/test_http_integration.py(integration_http); workflow tests often mockfetch_url/fetch_rss_feed_url. - E2E: Fixture feeds and mock HTTP server — E2E Testing Guide.
Related documentation¶
| Document | Why read it |
|---|---|
| CONFIGURATION.md | All RSS-related fields, multi-feed, episode selection, download resilience |
| CLI.md | Flags for RSS URL(s), episode selection, retries |
| PIPELINE_AND_WORKFLOW.md | Where RSS sits in the full pipeline |
| ARCHITECTURE.md | Multi-feed and orchestration summary |
| RFC-003: Transcript downloads | After Episode exists: choosing and fetching transcript bytes |
| ADR-002: XML security | Parsing threat model |
| ADR-003: Deterministic feed storage | Output path derivation from feed identity |
Version: 1.0
Created: 2026-04-12 — consolidates RSS ingress for future non-RSS sources.