RFC-063: Multi-Feed Corpus, Append/Resume, and Unified Discovery¶
- Status: Completed (v2.6.0)
- Authors: Marko Dragoljevic
- Stakeholders: Maintainer
- Related PRDs:
- None — GitHub #440 and #444 track scope.
- Related ADRs:
- ADR-074 — Layout A parent, unified discovery, manifest as operational artifact
- ADR-051 — per-episode GI/KG artifacts; union at query time
- ADR-060 — vector index abstraction (RFC-061)
- ADR-061 — FAISS Phase 1 indexing
- Related RFCs:
- RFC-001 — pipeline orchestration
- RFC-004 — output directories and run scoping (extended by this RFC)
- RFC-007 — CLI and validation
- RFC-008 — configuration model
- RFC-049 — GIL artifacts
- RFC-055 — KG artifacts
- RFC-061 — semantic index; recursive corpus discovery and composite keys align here
- RFC-062 — viewer and
serve; corpus root path semantics - Related UX specs:
- UXS-001 — GI/KG viewer (corpus root folder)
- Related Documents:
- GitHub #440 — multiple RSS feeds
- GitHub #444 — append / incremental resume
- GitHub #505 — unified corpus parent indexing (recursive metadata, composite keys, search/explore)
- GitHub #506 — corpus manifest, status CLI, machine-readable multi-feed summary
- CORPUS_MULTI_FEED_ARTIFACTS.md — normative
corpus_manifest.json/corpus_run_summary.json(#506) docs/architecture/ARCHITECTURE.md— multi-feed outer loop documented (GitHub #440)docs/architecture/TESTING_STRATEGY.md— new test tiers as needed- A two-feed feeds document passed with
--feeds-spec;USE_FIXTURES=1withmake test-acceptancerewrites external feed URLs to the E2E fixture server when used from merged operator YAML
Abstract¶
This RFC specifies multi-feed ingestion (N RSS feeds in one config / CLI run), opt-in append/resume
(filesystem + index.json, no SQLite in this phase), a unified corpus layout (parent directory with
feeds/<stable_feed_id>/ subtrees), and corpus-wide discovery for semantic index, search, gi explore,
and serve so KG, GI, and search work across all feeds without duplicating artifacts.
Architecture Alignment: Additive to the modular pipeline: outer orchestration over feeds, inner
run_pipeline behavior and episode parallelism unchanged unless explicitly gated (e.g. vector auto-index).
Extends RFC-004 run/output semantics where append requires stable per-feed workspaces.
Problem Statement¶
Today the main CLI is built around one optional RSS URL; Config exposes a single rss_url. Large
overnight jobs need many feeds, isolated failures (one feed or episode must not block others), and
resumability after crashes or code fixes. The current run_* timestamped directories and the fact that
_finalize_pipeline (including index.json and maybe_index_corpus) is skipped when processing raises
make naive “rerun” and index-only resume fragile.
Downstream, RFC-061 indexing and RFC-062 viewer assume a corpus root with metadata/ at one level;
nested feeds/<id>/…/metadata/ requires recursive metadata discovery and path resolution relative to
each episode’s feed root (metadata_path.parent.parent).
Use Cases:
- Overnight multi-feed run: Ten feeds, one config; failures logged per feed; next morning rerun with
--appendcompletes only missing or failed episodes. - Unified search: One FAISS index under
<corpus>/search; filter byfeed_idfrom existing metadata. - Viewer: Same “Corpus root folder” as
serve --output-dirpointing at parent; list/load GI/KG across feeds (artifacts API alreadyrglob; search tab needs parent index). - Fix and resume: After a bugfix, append run skips completed work keyed by
episode_id+ artifact validation, not only path globs.
Goals¶
- Multi-feed CLI and config: Backward-compatible single
rss; add repeatable--rssand/or@file; config allows a list of feeds without duplicating the full config object. - Layout A: Corpus parent with
feeds/<stable_feed_id>/; each subtree keepstranscripts/,metadata/, manifests, and per-run artifacts consistent with existing pipeline outputs. - Opt-in append: Stable
effective_output_dirper feed under append (no unbounded newrun_<timestamp>churn); skip/retry driven by artifact validation +episode_id, withindex.jsonas accelerator. - Failure isolation: One failing feed does not abort others; episode failures respect
fail_fast/max_failureswithin a feed. - Unified discovery:
index_corpus,searchCLI helpers,gi exploremetadata maps, and documentedservebehavior work with corpus parent + recursive metadata + composite(feed_id, episode_id)keys in the vector index fingerprint map. - Exit semantics: Non-zero if any feed failed; structured per-feed summary in logs (and optional machine-readable summary — see §7).
Constraints & Assumptions¶
Constraints:
- Backward compatibility: Single feed, no new flags → behavior matches current releases (including
derive_output_dir(rss)whenoutput_diromitted). - Explicit parent for N > 1: When two or more feeds are configured, require explicit corpus parent
output_dir(do not infer only from “first RSS”). - Must not break existing acceptance configs; add new configs for multi-feed / append.
Assumptions:
- Maintainer documents single-writer per corpus parent for v1 (optional lockfile later).
- Rate limits and API cost scale with N feeds; callers may throttle externally.
Non-Goals (this phase)¶
- SQLite or other DB as primary resume store (may follow later; RFC-051 remains complementary).
- Cross-feed parallelism beyond “sequential feeds, parallel episodes per feed” unless trivial.
- Stage-level checkpoints (transcribe vs metadata) — episode-level v1 only.
Design & Implementation¶
1. Corpus layout (Layout A)¶
<corpus_parent>/
feeds/
<stable_feed_id_a>/
transcripts/
metadata/
index.json # per-feed run index (optional location TBD in implementation)
run_manifest.json # when written
<stable_feed_id_b>/...
search/ # unified vector index (RFC-061), when built at parent
corpus_manifest.json # optional — see §7
Stable feed id: Derived deterministically from feed URL (hash + short slug); exact algorithm left as implementation detail with collision tests.
Single-feed compatibility: Either keep today’s default output/rss_<host>_<hash>/ without a feeds/
segment when only one feed and no multi-feed mode, or always use feeds/<id>/ under an explicit parent
— open choice (see Open Questions).
2. Multi-feed orchestration¶
- One
Configtemplate (or clone per feed) soworkers, providers, and feature flags match across feeds. - Outer loop: for each feed, set
rss_url, set per-feedoutput_dirtojoin(corpus_parent, "feeds", stable_feed_id), wrap in try/except so one feed’s exception does not skip remaining feeds. - Inner: existing
run_pipeline(cfg)(or extracted core) unchanged except append/stable-dir and optionalskip_auto_vector_index(§5). - Programmatic API: Prefer a dedicated multi-feed entry point returning per-feed results; keep
service.runsingle-feed until extended (avoid blocking onServiceResultredesign).
3. Append / resume¶
- Flag: e.g.
--append(opt-in). Without it, behavior matches current defaults (including interaction with--skip-existing). - Stable directory: Under append, reuse the same per-feed workspace; do not create a new timestamped
run_*subtree each invocation — requires coordinated changes withpodcast_scraper.utils.filesystem.setup_output_directoryand callers. - Truth source: Filesystem + validation first (
episode_id, parseable metadata/transcript where configured);index.jsonsecond for speed and last-error strings. Reconcile after crashes because finalize may not run on exception. - Schema: Extend or version
index.jsonas needed; document migration for old files.
4. Failure isolation¶
| Scope | Behavior |
|---|---|
| Feed | Log failure; continue other feeds; record status for append. |
| Episode | Existing fail_fast / max_failures per feed. |
| Viewer | Incomplete subtree must not break listing other feeds’ artifacts. |
5. Unified semantic index (RFC-061 integration)¶
- Discovery: Under corpus parent, glob
**/metadata/*.metadata.json(and yaml variants), respecting existingrun_*nesting if still present. Iffeeds/exists and the parent has a top-levelmetadata/directory, both are included (hybrid layout; GitHub #505 follow-up). - Path resolution: For each metadata file,
episode_root = metadata_path.parent.parent; joingrounded_insights.artifact_path,knowledge_graph.artifact_path, and transcript rel paths againstepisode_root, not the corpus parent alone. - Keys: Fingerprint / vector row identity uses composite
(feed_id, episode_id)(or equivalent string) to avoid GUID collisions across feeds. maybe_index_corpus: For multi-feed batch toward one parent index, disable per-feed auto-index during inner runs and invokeindex_corpus(corpus_parent)once after the batch (or document incremental per-feed alternative and cost).
KG CLI: Existing rglob for .kg.json already suits parent corpus; verify any flat-metadata-only paths.
6. CLI and config surface¶
- Positional
rsspreserved; add--rss URL(repeatable),--feeds-specfor structured corpusfeeds.spec.yaml, and/or legacy--rss-fileline list (RFC-077). - Config file:
rssas string or list, or dedicatedfeeds:/rss_urls:list (entries may include per-feed overrides) — merge rules documented in CONFIGURATION.md. - Aggregated exit code: non-zero if any feed failed.
7. Supplementary artifacts (holistic operations)¶
These are small, optional additions that pair well with multi-feed corpora. Normative field
tables and operational notes live in
docs/api/CORPUS_MULTI_FEED_ARTIFACTS.md (published under
API → Multi-feed corpus artifacts).
corpus_manifest.json(at<corpus_parent>/):schema_version,tool_version,corpus_parent,updated_at, andfeeds[]withfeed_url,stable_feed_dir,last_run_finished_at(per-feed completion time from the runner),ok,error,episodes_processed.- Corpus status command (
corpus-status/podcast corpus-status): print per-feed metadata counts, sampleindex.jsonerrors, and whether<corpus_parent>/searchexists — GitHub #506. corpus_run_summary.json: batch summary withfinished_at,overall_ok, andfeeds[](feed_url,ok,error,episodes_processed, optionalfinished_atper row). A structured log linecorpus_multi_feed_summaryechoes the same payload.service.runreturns this document onServiceResult.multi_feed_summaryfor multi-feed runs.
Partial batches: Manifest and summary are written even when some feeds fail (overall_ok: false).
With vector_search + FAISS, parent index_corpus still runs after finalize so successful feeds are
searchable; failed feeds add no metadata until a later successful run.
Key Decisions¶
- Persistence: Filesystem +
index.jsononly for this RFC; DB later if needed. - Append default: Opt-in
--append; non-append preserves legacy semantics for single-feed workflows. - Resume identity: Prefer
episode_idfrom RSS over filename-only heuristics; document interaction withskip_existing. - Unified index location: Default
<corpus_parent>/searchwhen indexing at parent. run_id: Audit and manifest, not a forced new output tree on every append invocation.
Risks and Code-Informed Gaps¶
- Run directory churn: Timestamped
run_*undersetup_output_directoryfragments output unless append explicitly stabilizes per-feed roots. - Finalize skipped on exception: If
_process_episodes_with_threadingraises,_finalize_pipelinedoes not run —index.jsonmay lag; reconcile from disk. episode.idxin filenames vs GUIDepisode_id: Feed reordering changes paths; resume must not rely on paths alone.- Double process / no lock: Advisory exclusive lock
.podcast_scraper.lockat the corpus parent during multi-feed CLI and service batches (filelock). Override withPODCAST_SCRAPER_CORPUS_LOCK=0when a second process must read-only the tree or for tests. - Resource use: N sequential
run_pipelinecalls may reload ML models N times; optional shared provider session later. - Docs: Update README, viewer README, semantic search guide, and
ARCHITECTURE.mdwhen behavior ships.
Alternatives Considered¶
-
Per-feed only (no parent index) Pros: No indexer changes. Cons: Conflicts with unified search/viewer goal; N manual index passes.
-
Flat global
metadata/(nofeeds/) Pros: Simple glob. Cons: Collides with stable feed scoping, migration pain, harder per-feed deletes. -
SQLite-first resume Pros: Strong consistency. Cons: Out of scope for v1; user chose filesystem +
index.jsonfirst.
Testing Strategy¶
- Unit:
stable_feed_id, path resolution (episode_root), composite index key formatting. - Integration: Two feeds, one failing RSS URL; assert other completes; append second run skips completed episodes.
- Regression: Single-feed acceptance configs unchanged.
- Optional slow: Acceptance config for multi-feed overnight profile (tiered per
TESTING_STRATEGY.md).
Rollout & Phasing¶
- Phase 1: Multi-feed orchestration + layout A + explicit parent
output_dir+ per-feed isolation. - Phase 2: Append/stable dir +
index.jsonextensions + filesystem reconciliation. - Phase 3: Unified
index_corpus/search/gi explore/ docs + composite keys + batch vector index policy. - Phase 4:
corpus_manifest.json,corpus-statusCLI, machine-readable batch summary (GitHub #506 — manifest +corpus_run_summary.json+ structured log line;ServiceResult.multi_feed_summarymirrors the summary JSON; per-feedfinished_at/last_run_finished_at; normative contractdocs/api/CORPUS_MULTI_FEED_ARTIFACTS.md; hybrid parentmetadata/+feeds/discovery in #505).
Relationship to Other RFCs¶
| RFC | Relationship |
|---|---|
| RFC-004 | This RFC extends filesystem/run layout rules for multi-feed and append-stable behavior. |
| RFC-061 | This RFC requires corpus-parent indexing and recursive metadata; composite vector keys. |
| RFC-062 | Viewer corpus root = parent; artifacts API already recursive; search needs parent index. |
| RFC-001 | Orchestration gains an outer multi-feed loop; inner pipeline unchanged. |
| RFC-064 | Performance profiling freeze script assumes single-feed execution; multi-feed profiling (per-feed profiles, aggregate metrics, N model reloads) is a future consideration once both RFCs ship. |
Migration Path¶
- Ship multi-feed behind explicit parent
output_dirand new flags — no change for existing single-feed invocations. - Document “point
serve/search indexat corpus parent” once Phase 3 lands. - Optional ADRs may spin out: stable append layout, composite index keys, resume truth ordering.
Open Questions¶
- Exact
stable_feed_idalgorithm (length, collision handling). - Non-append multi-feed: same flat
feeds/<id>/tree as append, or retain nestedrun_*for parity with historical single-feed runs. - Incremental
index_corpus(parent)after each feed vs batch-only after all feeds (latency vs cost).
References¶
src/podcast_scraper/workflow/orchestration.py—run_pipeline, finalize pathsrc/podcast_scraper/utils/filesystem.py—setup_output_directory,derive_output_dirsrc/podcast_scraper/cli.py—_build_configsrc/podcast_scraper/search/indexer.py—index_corpus,maybe_index_corpussrc/podcast_scraper/server/routes/artifacts.py— recursive GI/KG listingweb/gi-kg-viewer/README.md— corpus root instructions
Lineage¶
Informal WIP notes for #440 / #444 were retired (2026-04) in favor of this RFC and CORPUS_MULTI_FEED_ARTIFACTS.md for operational JSON contracts.