RFC-067: Corpus Library — Catalog API & Viewer Integration¶
- Status: Completed (v2.6.0) — Phases 1–3 shipped (
/api/corpus/*, Library tab, episode detail, similar episodes) - Authors: Podcast Scraper Team
- Stakeholders: Core team, viewer users, pipeline maintainers
- Related PRDs:
- PRD-022: Corpus Library & Episode Browser
- PRD-004: Per-Episode Metadata
- PRD-005: Episode Summarization
- PRD-021: Semantic Corpus Search
- PRD-023: Corpus Digest & Library Glance
- Related RFCs:
- RFC-062: GI/KG viewer v2 — FastAPI shell, Vue SPA, artifact loading
- RFC-061: Semantic Corpus Search —
/api/searchhandoff - RFC-063: Multi-Feed Corpus —
feeds/layout, discovery - RFC-004: Filesystem Layout — output directory conventions
- RFC-068: Corpus Digest — digest API & viewer (PRD-023)
- Related UX specs:
- UXS-003: Corpus Library — Library tab, Episode subject rail, catalog layout
- VIEWER_IA: Viewer information architecture — shell IA (tabs, subject rail, status bar)
- UXS-001: GI / KG viewer — shared tokens and visual chrome
- Related Documents:
- ADR-064: Canonical server layer — route mounting pattern
Abstract¶
This RFC specifies Phase 1–3 technical design for the Corpus library: new FastAPI
routes that list feeds and episodes from on-disk episode metadata (JSON/YAML),
paginated APIs, an episode detail payload including summary bullets and sibling
GI/KG paths, and Vue integration as a main tab in the existing GI/KG viewer.
Phase 1 is read-only, reuses discover_metadata_files() and corpus path resolution
consistent with /api/artifacts and /api/search. Phase 2 adds scalability (cache or
persisted catalog). Phase 3 adds deeper semantic integration with the vector index.
Phase 4 adds optional corpus-local artwork download + GET /api/corpus/binary
(see § Phase 4 below).
Architecture alignment: The server remains a consumption layer (RFC-062 / ADR-064):
no pipeline mutations, no new required CLI flags for Phase 1. Catalog logic lives in
src/podcast_scraper/server/ and may call shared helpers in
src/podcast_scraper/search/corpus_scope.py and
src/podcast_scraper/utils/corpus_episode_paths.py.
Problem Statement¶
The viewer can list graph artifacts and run semantic search, but there is no efficient, podcast-oriented catalog backed by the same metadata files the pipeline already writes. Users need:
- Feed-scoped episode lists with pagination for large corpora.
- Episode detail including summaries without loading full
.gi.json/.kg.json. - Deterministic handoffs to graph (artifact load) and search (query + filters).
Use cases:
- Browse by show: Select
feed_id→ scroll episodes newest-first. - Skim summaries: Open detail → read bullets → decide whether to open graph.
- Pivot to search: From an episode, run semantic search with feed pre-set and optional title in the query box.
Goals¶
- Phase 1: Ship three REST endpoints under
/api/corpus/with stable JSON shapes, path traversal safety, and tests on synthetic corpus layouts (flat +feeds/). - Phase 1: Add Library UI per UXS-003; reuse
corpusPathfrom the shell store; hand off to artifacts store and search store without duplicating path logic. - Phase 2: Document and implement catalog caching or materialization so 10k+ episodes remain usable.
- Phase 3: Document vector-assisted features (similar episodes, richer facets) using existing indexer metadata where possible.
Non-Goals (Phase 1)¶
- Writing or mutating metadata from the library APIs via catalog routes (metadata is still
produced by the pipeline / CLI, not by
GET /api/corpus/*). - Replacing
/api/artifactsor/api/searchwith a single mega-endpoint. - Generic reverse proxy for arbitrary third-party image URLs without an ingest step — not in scope; operators who need same-origin art enable Phase 4 (optional download at metadata generation).
Phase 4 — Optional corpus-local artwork (implemented)¶
Goal: Same-origin thumbnails when images are copied into the corpus at ingest time,
avoiding hotlinking for those assets. Default: unchanged URL-only behavior when
download_podcast_artwork is false (opt-out via config/CLI) or local files are absent.
Pipeline / metadata (PRD-004):
- Config:
download_podcast_artwork(bool, defaulttrue; CLI--no-download-podcast-artworkto disable). When true with metadata generation (non-dry-run), the workflow downloadsfeed.image_url/episode.image_urltargets into<corpus>/.podcast_scraper/corpus-art/sha256/…and sets optionalfeed.image_local_relpath/episode.image_local_relpath(corpus- relative POSIX paths). - Implementation:
src/podcast_scraper/utils/corpus_artwork.py,src/podcast_scraper/workflow/metadata_generation.py.
Catalog:
CatalogEpisodeRowcarriesfeed_image_local_relpath/episode_image_local_relpathonly when the path is under.podcast_scraper/corpus-art/and the file exists on disk (anti-traversal consistent with other corpus routes).aggregate_feedsmay setimage_local_relpathfor a feed from the first verified feed-level local path seen.
HTTP API:
GET /api/corpus/binary— Query:path(corpus root),relpath(required).relpathmust start with.podcast_scraper/corpus-art/; returnsFileResponsewith guessedContent-Typeor404/400. Implementation:src/podcast_scraper/server/routes/corpus_binary.py.GET /api/corpus/feeds— each feed may include optionalimage_local_relpath.GET /api/corpus/episodes,GET /api/corpus/episodes/detail,GET /api/corpus/episodes/similar— optionalfeed_image_local_relpath,episode_image_local_relpathon each item (mirrors digest row shape in RFC-068).GET /api/health— includescorpus_binary_api: truewhen the binary route is mounted (default for current server builds).
Viewer:
PodcastCover.vue— whencorpusPathis set, prefers episode local → episode URL → feed local → feed URL; localsrcis/api/corpus/binary?path=${encodeURIComponent(corpusPath)}&relpath=${encodeURIComponent(rel)}.
Tests: Integration coverage in tests/integration/server/test_viewer_corpus_library.py
(binary allowlist, verified local fields on list/detail).
- Introducing vue-router as a hard dependency (optional later; MVP uses a third main
tab in App.vue).
Constraints & Assumptions¶
Constraints:
- Path safety: All relative paths are resolved under the resolved corpus root using
the same pattern as
artifacts.pyinsrc/podcast_scraper/server/routes/(..and absolute segments rejected). - Corpus root query param name stays
pathfor consistency with/api/artifacts,/api/search,/api/explore(required or optional per endpoint below; mirror existing semantics: omit →app.state.output_dirwhen set).
Assumptions:
- Episode metadata files match
*.metadata.json,*.metadata.yaml, or*.metadata.ymlas discovered bydiscover_metadata_files()incorpus_scope.pyinsrc/podcast_scraper/search/. - Summary shape matches normalized pipeline output (
summary.bullets, optionalsummary.title) as produced by metadata generation (PRD-005).
Data Model (logical)¶
Each metadata file yields at minimum:
| Field | Source |
|---|---|
metadata_relative_path |
Path relative to corpus root (POSIX) |
feed_id |
From feed.feed_id in metadata doc, normalized (see normalize_feed_id) |
episode_id |
From episode.episode_id when present |
episode_title |
From episode or RSS fields in metadata (implementation maps common keys) |
publish_date |
ISO date string when parseable; nullable |
gi_relative_path / kg_relative_path |
Sibling paths if files exist (same rules as corpus_episode_paths) |
summary_title / summary_bullets |
From summary object when present |
feed_image_url |
From feed.image_url when present (non-empty string) |
episode_image_url |
From episode.image_url when present |
duration_seconds |
From episode.duration_seconds when present (integer ≥ 0) |
episode_number |
From episode.episode_number when present (integer ≥ 0) |
feed_image_local_relpath |
Verified path from feed.image_local_relpath when file exists (Phase 4) |
episode_image_local_relpath |
Verified path from episode.image_local_relpath when file exists (Phase 4) |
Visual metadata (optional, PRD-022): Populated from existing per-episode metadata
documents (PRD-004); no live RSS refetch in the server. Missing keys remain absent or
null in JSON. Feed list includes image_url per feed when any episode row for that
feed_id carries feed.image_url (first non-empty wins during aggregation). Clients should
use episode image first, then fallback to feed image for episode-level thumbnails.
When Phase 4 local paths are present and verified, API consumers and PodcastCover
prefer same-origin GET /api/corpus/binary (see Phase 4) before hotlinked URLs.
Sorting key for pagination: Prefer (publish_date desc, metadata_relative_path desc) or
(publish_date desc, metadata_relative_path asc) — pick one and document; stable across
requests.
Phase 1 — API Design¶
New router module e.g. src/podcast_scraper/server/routes/corpus_library.py, registered
in app.py (src/podcast_scraper/server/app.py) with prefix="/api" (routes below
include /corpus/...).
1. GET /api/corpus/feeds¶
Query:
path(optional if server defaultoutput_diris set): corpus root directory.
Response (success):
path: resolved corpus root stringfeeds: array of{ "feed_id": str, "display_title": str | null, "episode_count": int, "image_url": str | null, "image_local_relpath": str | null }(image_url/image_local_relpathoptional; OMIT ornullwhen unknown)
Episode count (Phase 1): May be computed by a single metadata scan or omitted (null)
if the first implementation defers counts to avoid double walks; PRD allows optional counts.
If omitted, document and return episode_count: null or omit field consistently.
Errors:
404/ JSONdetailwhen path invalid (match existingCorpusPathRequestErrorbehavior).- Empty corpus:
feeds: []with200(not an error).
2. GET /api/corpus/episodes¶
Query:
path— corpus root (optional if default set)feed_id— optional filter (exact match on normalized id)q— optional case-insensitive substring on episode titletopic_q— optional case-insensitive substring on summary title or any summary bulletsince— optionalYYYY-MM-DD(include episodes on or after publish date when parseable)limit— integer, default e.g.50, max e.g.200cursor— opaque string or absent for first page
Response:
path,feed_id(echo),items: array of summary rows:{ "metadata_relative_path", "feed_id", "feed_display_title", "topics", "episode_id", "episode_title", "publish_date", "feed_image_url"?, "episode_image_url"?, "duration_seconds"?, "episode_number"?, "feed_image_local_relpath"?, "episode_image_local_relpath"? }(feed_display_titlefrom metadata when present, elsenull;topics= short strings from summary bullets, capped). Optional visual fields mirror the logical catalog row (see Data Model); omit ornullwhen not in metadata or when local files are missing.next_cursor: string ornullwhen no more pages
Cursor encoding (recommended): Base64url of JSON
{ "p": publish_sort_key, "rel": metadata_relative_path } or simpler offset only if
sort is fully stable — prefer keyset pagination for large corpora.
3. GET /api/corpus/episodes/detail¶
Query:
path— corpus rootmetadata_relpath— relative path to the metadata file (same as listed initems)
Response:
- All list fields plus
summary_title,summary_bullets(array of strings),gi_relative_path,kg_relative_path(nullable strings),has_gi,has_kg(booleans), plus optionalfeed_image_url,episode_image_url,duration_seconds,episode_number,feed_image_local_relpath,episode_image_local_relpathwhen verified on disk (Phase 4)
Errors:
400ifmetadata_relpathescapes corpus root or is not a metadata file404if file missing
Error shape consistency¶
Align with existing JSON error patterns (detail for HTTP errors; for soft “no corpus”
where other routes return embedded errors, follow /api/search style only if we need
200 with error field — prefer HTTP errors for invalid path to match artifacts).
Phase 1 — Frontend (Vue)¶
Shell integration¶
- Extend
mainTabunion inApp.vue(web/gi-kg-viewer/src/App.vue) with'library'. - Add a Library button in the main nav next to Graph / Dashboard with the same styling pattern as existing main-view tabs (VIEWER_IA shell, UXS-003 Library).
Components¶
- New directory
web/gi-kg-viewer/src/components/library/: LibraryView.vue— three-region layout (feeds | episodes | detail)- Optional subcomponents for list rows and detail card
- Cover art: shared
PodcastCover.vuefor feed rows, episode rows, detail header, similar-episode list, and Library glance (24h digest strip): lazy-loaded<img>, fixed aspect ratio,@error→ placeholder; local binary URL (whencorpusPath+ verified*_image_local_relpath) before episode URL, then feed local, then feed URL
State & API¶
- New
src/api/corpusLibraryApi.ts—fetchFeeds,fetchEpisodes,fetchEpisodeDetail. Viewer:LibraryViewsendslimit=20per page (named constant in the component; server query default may differ when the param is omitted). - Optional Pinia store
stores/corpusLibrary.tsor local state insideLibraryViewfor MVP; must readcorpusPathand health fromshellstore (web/gi-kg-viewer/src/stores/shell.ts).
Handoff to graph¶
- Use
useArtifactsStore(web/gi-kg-viewer/src/stores/artifacts.ts): load.gi.jsonand.kg.jsonvia existingGET /api/artifacts/{path}?path=corpusRootflow (or the store’s existing helper). - After successful load, set
mainTabto'graph'. - If only one of GI/KG exists, load what exists; show inline message for missing artifact.
Handoff to semantic search¶
| Action | Behavior |
|---|---|
| Search in corpus (primary) | Set rightOpen, rightTab = 'search'; set search store feed filter to episode’s feed_id if present; set query to episode title (trimmed) or empty if title missing |
| Search from first bullet (optional Phase 1 stretch) | Pre-fill query with first bullet text truncated to N chars (RFC implementer chooses N ≤ 200) |
Debounce and keyboard shortcuts: follow RFC-062 patterns for Search panel focus where
applicable (useViewerKeyboard extension optional in a follow-up).
Optional deep links¶
Reserved query parameters (implement when low cost):
?tab=library— open Library on load?library_feed=<feed_id>— preselect feed after feeds load
Document in README or viewer help when implemented.
Phase 2 — Scale (catalog cache / manifest)¶
Motivation: Full metadata walks on every feeds + episodes request may exceed
latency targets for 10k+ files.
Options (choose one or combine in implementation PR):
- In-process TTL cache keyed by
(corpus_root, mtime_bucket)wheremtime_bucketis max mtime of metadata files under root (expensive to compute naïvely — may sample or use directory mtime heuristics). - On-disk manifest e.g.
<corpus>/.podcast_scraper/catalog.jsonwritten by a newpodcast catalog refreshcommand or lazily by the server on first request (with file lock). - SQLite sidecar (
catalog.sqlite) with tablesfeeds,episodesand indexes on(feed_id, publish_date).
Invalidation: Manifest includes generated_at and source_fingerprint (e.g. hash of
metadata paths + mtimes); server rejects stale manifest when filesystem diverges.
API compatibility: Phase 2 must not break Phase 1 response shapes; only latency and internal storage change.
Phase 3 — Semantic depth¶
Examples:
- Similar episodes: Given
metadata_relpath, issue an internal search using embedding of summary text or first bullet (requires model load — gate behind config or reuse search service). - Facet from index: Cross-reference
feeds_indexedfrom/api/index/statswith library feed list to show “in vector index” badges.
Dependencies: RFC-061 index present; may reuse metadata fields already embedded in
indexer.py in src/podcast_scraper/search/.
Implementation (2026-04-11):
GET /api/corpus/episodes/similar—path,metadata_relpath, optionaltop_k(1–25, default 8). Builds a query fromsummary.title+summary.bullets(fallback: episode title), runs the same FAISS path asrun_corpus_search, dedupes hits by(feed_id, episode_id), drops the source episode, enriches rows with catalog titles/paths viaindex_rows_by_feed_episode. Returns 200 withitemsor a softerror(no_index,insufficient_text,embed_failed, …) like/api/search.- Library UI — Loads
GET /api/index/statswhen feeds load; shows an Indexed chip on feeds present infeeds_indexedand Similar episodes in the detail column (auto-calls the similar endpoint after episode detail loads; lists peers with scores when the index is available). GET /api/index/stats→feeds_indexed— Response list is deduplicated, sorted, andnormalize_feed_id-trimmed so Library chips match catalogfeed_idstrings.- Search handoff — Search in corpus opens the Search panel with the same summary-derived
query as similarity search (not title-only); Feed filter (substring on catalog feed id) matches
GET /api/searchsemantics. - Code:
corpus_similar.pyinsrc/podcast_scraper/search/, route inroutes/corpus_library.py, schemasCorpusSimilarEpisodesResponse/CorpusSimilarEpisodeItem.
Security¶
- Reuse
resolve_corpus_path_paramfrompathutil.pyinsrc/podcast_scraper/server/. - Validate
metadata_relpathwith the same segment rules as artifact paths (no..). - Reject symlinks escaping corpus if the codebase already centralizes that check; otherwise document follow-up.
Testing¶
| Layer | Scope |
|---|---|
| Unit | Pagination ordering; cursor encode/decode; metadata parsing edge cases (missing fields) |
| Integration | FastAPI routes with temp dirs: flat layout + feeds/<id>/metadata/... |
| E2E | After UI lands: Playwright flows for tab switch, list load, handoff (update E2E_SURFACE_MAP first) |
Alternatives Considered¶
- vue-router for
/library— cleaner URLs but adds bundle and boot complexity; defer. - GraphQL or single combined query — unnecessary for local tool; REST stays debuggable.
- Only client-side scan — would require exposing all metadata paths or huge payloads; rejected.
Open Questions¶
- Should
episode_countbe mandatory inGET /api/corpus/feedsPhase 1 or deferred to Phase 2 for a single-pass indexer? - Exact metadata key mapping for
episode_titleacross older runs — document in code asTITLE_KEYSfallback list.
Revision History¶
| Date | Change |
|---|---|
| 2026-04-10 | Initial draft (Phases 1–3) |
| 2026-04-11 | Phase 1 implemented: corpus_catalog.py, /api/corpus/* routes, Library tab, tests + E2E |
| 2026-04-11 | Phase 2 deferred; decision log appended (no Phase 2 code) |
| 2026-04-11 | Phase 3 implemented: /api/corpus/episodes/similar, index badges + similar UI in Library |
| 2026-04-11 | Follow-up: canonical feeds_indexed in /api/index/stats, Library→Search summary handoff, similar empty state, tests/mocks |
| 2026-04-10 | Phase 4: optional download_podcast_artwork, .podcast_scraper/corpus-art/, GET /api/corpus/binary, catalog verified *_image_local_relpath, health corpus_binary_api, PodcastCover same-origin preference |
Phase 2 status (decision log)¶
2026-04-11 — Phase 2 not implemented (deferred). The catalog cache / manifest / SQLite design in Phase 2 — Scale (catalog cache / manifest) above remains the planned shape for when scale demands it; no code changes for Phase 2 were made at this time.
What we concluded
- Phase 1 walks all episode metadata on each
GET /api/corpus/feedsandGET /api/corpus/episodes(after filters, pagination slices in memory). That is fine for exploration and for small-to-medium corpora. - Phase 2 trades that simplicity for invalidation, staleness, and optional
on-disk artifacts (
catalog.json, SQLite) or CLI (podcast catalog refresh–style). That cost is not justified until we see measured pain (slow Library loads, very large trees, or aggressive UI refresh) or until metadata layout stabilizes enough that maintaining a cache is low-risk.
When to revisit
- Representative corpora in the 10k+ metadata file range, or clear user feedback that list endpoints are too slow.
- Need for cheap per-request work (e.g. always-fresh episode counts) without rescanning.
Suggested first increment when we return: In-process TTL cache plus a cheap fingerprint (e.g. max mtime or count of metadata paths) keyed by resolved corpus root — before adopting a persisted manifest or SQLite, unless requirements jump straight to offline/catalog CLI workflows.