ADR-074: Multi-Feed Corpus Parent Layout and Machine-Readable Manifest¶
- Status: Accepted
- Date: 2026-04-11
- Authors: Podcast Scraper Team
- Related RFCs: RFC-063, RFC-004
- Related PRDs: (tracked via issues #440, #444)
Context & Problem Statement¶
Single-feed output layout (per ADR-003 and
ADR-004) does not define how N feeds share one
corpus parent, how append/resume keeps stable per-feed workspaces, or how semantic index,
gi explore, and podcast serve discover metadata when episodes live under
feeds/<stable_feed_id>/… instead of a flat metadata/ tree at the corpus root.
Without a recorded decision, contributors could reintroduce duplicate indexes per feed, break
composite (feed_id, episode_id) keys, or treat manifest files as ad hoc logs rather than
operational inputs for CLI, HTTP (corpus_metrics), and dashboards.
Decision¶
- Layout A (corpus parent): For multi-feed configurations, the corpus parent is the
authoritative root. It contains
feeds/<stable_feed_id>/subtrees (each preserving existing pipeline folder semantics) and, when built, a unifiedsearch/index at the parent per RFC-061 / RFC-063. - Explicit parent for N > 1: Two or more feeds require an explicit corpus parent
output_dir; the tool must not infer the parent from only the first RSS URL. - Unified discovery: Indexing, exploration, and server-side corpus routes use recursive metadata discovery and composite keys so GI, KG, and search operate on all feeds under one parent without duplicating artifacts.
- Machine-readable summaries: Where implemented,
corpus_manifest.jsonand optionalcorpus_run_summary.jsonat the corpus parent are normative operational artifacts for throughput and status (documented in CORPUS_MULTI_FEED_ARTIFACTS.md), not informal debug dumps.
Rationale¶
- One corpus, one index: Operators and the viewer expect a single “Corpus root folder” for
serveand search; Layout A preserves that mental model. - Append safety: Stable per-feed directories under append avoid unbounded
run_*churn and align skip/retry with artifact validation andepisode_id. - HTTP and UI alignment: Aggregates such as
GET /api/corpus/statsand Dashboard charts consume the same parent-level artifacts as the CLI (RFC-071).
Alternatives Considered¶
- Separate output tree per feed with no shared parent: Rejected; breaks unified search, viewer corpus path, and cross-feed analytics.
- SQLite as primary resume store (this phase): Rejected for v1 of RFC-063; filesystem and
index.jsonremain the source of truth (RFC-051 remains future). - Infer corpus parent from first feed only: Rejected; ambiguous with multiple feeds and unsafe for automation.
Consequences¶
- Positive: Clear boundary for server, indexer, and viewer; manifest files become reviewable contracts in PRs and docs.
- Negative: More paths to test (recursive discovery, composite keys); docs must stay aligned with CORPUS_MULTI_FEED_ARTIFACTS.md.
- Neutral: Single-feed behavior remains backward compatible when N = 1.
Implementation Notes¶
- Docs: CORPUS_MULTI_FEED_ARTIFACTS.md, RFC-063.
- Server:
src/podcast_scraper/server/routes/corpus_metrics.py(parent-relative manifest andrun.jsondiscovery). - Config / CLI: multi-feed
feeds:list,--append, corpus parent validation.