PRD-021: Semantic Corpus Search¶
- Status: Implemented (v2.6.0)
- Authors: Podcast Scraper Team
- Related RFCs:
- RFC-061 — semantic corpus search, FAISS + CLI (Completed)
- RFC-070 — Qdrant / platform vector backends (Draft)
- RFC-062 — GI/KG viewer v2 (complete)
- RFC-049 — GIL core (complete; prerequisite, indexable artifacts)
- RFC-050 — GIL use cases (open; UC4/UC5 semantics)
- RFC-051 — database projection (open; complementary SQL serving)
- RFC-055 — KG core (complete; indexable KG artifacts)
- RFC-056 — KG use cases (open)
- Related PRDs:
- PRD-017: Grounded Insight Layer (GIL artifacts are the primary search corpus)
- PRD-019: Knowledge Graph Layer (KG topics/entities benefit from semantic matching)
- PRD-018: Database Projection (complementary — SQL for structured, vectors for semantic)
- Related Documents:
- GitHub #466 — GI + KG depth roadmap
docs/architecture/PLATFORM_ARCHITECTURE_BLUEPRINT.md— Platform context- Related UX specs:
- UXS-005: Semantic Search -- search panel layout, advanced filters, result cards, insights modal
- UXS-001: GI/KG Viewer -- shared design system
Summary¶
Semantic Corpus Search adds meaning-based retrieval over the podcast corpus — insights, quotes, summaries, and transcript chunks — using sentence embeddings and a vector index. Users query in natural language and get ranked results with full GIL provenance (supporting quotes, timestamps, grounding status). This transforms GIL and KG from "write-heavy, read-weak" artifact stores into a navigable, question-driven knowledge layer.
Shipped in v2.6.0: FAISS-backed indexing, podcast search / podcast index, semantic
upgrade for gi explore when an index exists (RFC-061),
and local viewer + podcast serve search surfaces (RFC-062).
Not yet shipped: Qdrant-backed backend and platform-scale serving (RFC-070, Draft).
Background & Context¶
The podcast scraper pipeline produces rich structured artifacts: GIL insights with grounded
quotes (gi.json), KG entities and topics (kg.json), summaries with bullets, and full
transcripts. However, consumption is limited to exact-match filtering:
gi explore --topic "AI Regulation"uses substring matching on insight text — "Government AI Policy" or "tech oversight" won't matchgi querymaps fixed English patterns to the same explore path — not semantickg entitiesandkg topicsroll up by exact string — "Elon Musk" vs "Musk" are separate- No way to ask "what do my podcasts say about X?" across the corpus
The project already loads sentence-transformers (all-MiniLM-L6-v2) for GIL grounding
evidence (QA + NLI), but this capability is not exposed for user-facing search. The
megasketch (Part A.7) explicitly defers "optional search later."
Why now: Shallow v1 (GIL + KG) is hardening. The artifacts are stable, the CLI commands exist, and the evidence stack works. Semantic search is the highest-leverage "depth" feature because it:
- Immediately makes
gi exploreandgi queryuseful at scale (removes the ~100 episode ceiling) - Unlocks RFC-050 UC4 (Semantic QA) which is explicitly deferred as "post-v1"
- Provides infrastructure for digest clustering (megasketch Part C), topic alignment, and entity resolution
- Reuses the embedding model already in the dependency tree
Goals¶
- Meaning-based retrieval: Users query in natural language and find relevant insights/quotes/summaries/transcript moments regardless of exact wording
- Evidence preservation: Search results carry full GIL provenance — grounding status, supporting quotes, timestamps, transcript spans
- CLI-first: Works as a CLI command with no server process, consistent with the project's "CLI stays first-class" constraint
- Incremental indexing: New episodes are added to the index without full rebuild; existing indexes grow as the corpus grows
- Cross-feed discovery: Find related content across different podcast feeds
- Abstracted backend: Vector store protocol supports FAISS (CLI/v1) and Qdrant (platform/service mode) behind the same interface
Non-Goals¶
- Full-text keyword search (BM25) or hybrid keyword+vector search (defer to later)
- Multi-tenant / hosted SaaS search API (platform product; out of scope). In scope and shipped: local FastAPI + Vue viewer search (RFC-062).
- RAG-style answer generation (results are existing artifacts, not generated text)
- Entity resolution or disambiguation (separate concern; embeddings help but don't solve)
- Real-time / streaming index updates (batch after pipeline run)
- Replacing
gi exploreorkg entities— search enhances them, not replaces - Multi-language support (English-only, matching current pipeline)
Personas¶
- Knowledge Worker / Researcher: Subscribes to 10-50 podcasts. Wants to ask "what have my shows said about X?" and get evidence-backed answers without scanning episode by episode.
- Needs: meaning-based queries, ranked results, links to source material
-
Value: saves hours of manual searching; discovers connections across feeds
-
Developer Building on the Corpus: Uses GIL/KG artifacts programmatically. Wants structured search results as JSON for downstream pipelines (RAG, agents, analytics).
- Needs: CLI + JSON output,
SearchResultcontract, filter by type/feed/date -
Value: enables evidence-backed RAG without hallucination
-
Power User / Operator: Manages a large corpus. Wants to understand what the corpus covers, find gaps, validate quality across feeds.
- Needs: index stats, cross-feed topic exploration, temporal filtering
- Value: corpus intelligence without manual file scanning
User Stories¶
- As a researcher, I can search "AI regulation impact on startups" and get ranked insights with supporting quotes from across my podcast library so that I find relevant content by meaning, not exact keywords.
- As a developer, I can query the vector index programmatically and get structured
SearchResultJSON withinsight_id,episode_id,grounding_status, andsupporting_quotesso that I can build evidence-backed applications. - As a power user, I can filter search by feed, date range, speaker, and document type (insight/quote/summary/transcript) so that I narrow results to what matters.
- As an operator, I can run
podcast index --statsto see how many vectors are indexed, which feeds are covered, and when the index was last updated so that I know my corpus is searchable. - As a researcher, I can search "where was quantum computing discussed?" and get transcript chunks with timestamps so that I can jump to the exact moment in the audio.
Functional Requirements¶
FR1: Vector Store Abstraction¶
- FR1.1: Define a
VectorStoreprotocol withupsert(),batch_upsert(),search(),delete(),persist(), andstats()operations - FR1.2: Implement
FaissVectorStoreusingfaiss-cpufor CLI/local use (Phase 1) - FR1.3: Index persisted as files on disk (
vectors.faiss+ metadata sidecar) - FR1.4: Metadata sidecar stores document type, episode_id, feed_id, publish_date, speaker_id, grounding status, and source references per vector
FR2: Embedding and Indexing Pipeline Stage¶
- FR2.1: New pipeline stage runs after GIL (or independently when
vector_search: true) - FR2.2: Embeds and indexes four document types: GIL Insight text, GIL Quote text, summary bullets, and transcript chunks
- FR2.3: Transcript chunking uses sentence-boundary windows (~300 tokens, ~50 token
overlap) with
char_start,char_end, andtimestamp_start_msper chunk - FR2.4: Incremental: only embeds new/changed episodes; skips already-indexed episodes (tracked by episode_id + content hash in metadata)
- FR2.5: Uses the same embedding model as GIL grounding (
all-MiniLM-L6-v2default, configurable viagi_embedding_model) - FR2.6: Index location defaults to
<output_dir>/search/(configurable)
FR3: Search CLI Command¶
- FR3.1:
podcast search "<query>"— encode query, search index, return ranked results - FR3.2: Filter flags:
--type insight|quote|summary|transcript,--feed <name>,--since <date>,--speaker <name>,--grounded-only - FR3.3:
--top-k <n>(default 10) controls result count - FR3.4:
--format json|pretty(defaultpretty) for output format - FR3.5: Results include: rank, similarity score, document text, document type, episode title, episode_id, feed, publish_date, and source references
- FR3.6: For Insight results, include
groundedstatus andsupporting_quotessummary. Per-quotespeaker_id/speaker_namemirror.gi.jsonand segment/diarization rules (often absent without diarization — GitHub #541; Development Guide — GI quotespeaker_id). - FR3.7: For transcript chunk results, include
timestamp_start_msandchar_start. Optional RFC-072liftedenrichment follows the same GI speaker contract forlifted.speaker/lifted.quote(see Semantic Search Guide — quote speaker fields and #541 link above).
FR4: Index Management CLI¶
- FR4.1:
podcast index --rebuild— full re-index (e.g. after model change) - FR4.2:
podcast index --stats— show vector count, document type breakdown, feeds indexed, last updated timestamp, embedding model, index size on disk - FR4.3: Auto-detect embedding model mismatch between index metadata and config;
warn user and suggest
--rebuild
FR5: Enhanced gi explore and gi query¶
- FR5.1: When a vector index exists,
gi explore --topic <query>uses semantic matching instead of substring matching (transparent upgrade) - FR5.2: When no index exists, fall back to current substring matching (backward compatible)
- FR5.3:
gi querysemantic QA uses the vector index for real meaning-based question answering (RFC-050 UC4)
FR6: Configuration¶
- FR6.1:
vector_search: bool(defaultfalse) — enable embedding and indexing - FR6.2:
vector_backend: faiss | qdrant(defaultfaiss) - FR6.3:
vector_index_path: str | null— auto:<output_dir>/search/ - FR6.4:
vector_index_types: list— which document types to index (default:[insights, quotes, summary_bullets, transcript_chunks]) - FR6.5:
vector_chunk_size_tokens: int(default 300) andvector_chunk_overlap_tokens: int(default 50) for transcript chunking
Success Metrics¶
podcast searchreturns semantically relevant results for paraphrased queries (manual validation on 20+ query/corpus pairs)- Synonym / rephrase queries match content that substring search misses (e.g. "government AI policy" finds insights labeled "AI Regulation")
- Search latency < 100 ms for a 100K-vector corpus on standard hardware
- Index build time < 60 seconds for 500 episodes (incremental: < 5 seconds per new episode)
- Zero regression on existing
gi explore/gi querybehavior when no index present - Index size < 500 MB for 500 episodes (with IVF-PQ compression at scale)
Dependencies¶
- PRD-017 / RFC-049: GIL artifacts (
gi.json) provide Insight and Quote text to index - PRD-005: Summaries provide bullet text to index
- PRD-001: Transcripts provide text for chunked indexing
sentence-transformers(already in dependency tree)faiss-cpu(new dependency, ~20 MB)
Constraints & Assumptions¶
Constraints:
- Must work without a server process (CLI-first; FAISS in-process)
- Must not increase pipeline runtime by more than 30% when enabled
- Must reuse the existing
embedding_loader.py/all-MiniLM-L6-v2(no new large model downloads unless user explicitly configures a different model) - Index is append-friendly; full rebuild only required on embedding model change
Assumptions:
- Corpus scale for CLI use is 10-50 feeds, up to ~5,000 episodes (1M+ vectors with transcript chunks)
- Users have existing GIL and/or summary artifacts before enabling search
- English-only content (matching current pipeline language support)
Design Considerations¶
Vector Backend: FAISS vs Qdrant¶
- Decision: FAISS for Phase 1 (CLI); Qdrant for Phase 2 (platform/service)
- Rationale: FAISS is in-process, no server, minimal dependency. Qdrant adds built-in
metadata filtering and native upserts but requires a server for production use. The
VectorStoreprotocol abstracts the choice.
Index Scope: Per-Feed vs Global¶
- Decision: Global corpus index with feed metadata for filtering
- Rationale: Cross-feed discovery is a primary use case. Per-feed indexes would require multi-index search for cross-feed queries.
Embedding Model: Current vs Upgrade¶
- Decision: Keep
all-MiniLM-L6-v2for Phase 1; evaluate upgrades for Phase 2 - Rationale: Already loaded for GIL grounding. Zero marginal cost. 384-dim is efficient
for FAISS. Upgrade path (e.g.
bge-base-en-v1.5) available via config when needed.
Integration with GIL and KG¶
Semantic Corpus Search enhances the GIL and KG layers by adding a discovery dimension:
- GIL Integration: Search results for Insights carry full GIL provenance (grounding,
quotes, timestamps).
gi exploretransparently upgrades to semantic matching when an index exists. UC4 (Semantic QA) becomes functional. - KG Integration: Topic and entity labels can be semantically matched across episodes,
improving the quality of
kg entitiesandkg topicsroll-ups without requiring entity resolution. - Digest Integration: Embedding similarity enables the weekly digest clustering described in the platform megasketch (Part C) — group similar Insights, deduplicate cross-feed coverage, rank by novelty.
Example Output¶
$ podcast search "impact of AI regulation on startups"
Search: "impact of AI regulation on startups"
Index: 47,832 vectors | Model: all-MiniLM-L6-v2 | Last updated: 2026-04-01
Results (top 5):
1. [insight] score=0.87 | GROUNDED
"AI regulation will significantly lag behind the pace of innovation"
Episode: AI Regulation (The Journal) | 2026-02-03
Quotes: 2 supporting → "Regulation will lag innovation by 3–5 years." (Sam Altman, 2:00)
→ gi show-insight --id insight:episode:abc123:a1b2c3d4
2. [insight] score=0.82 | GROUNDED
"Startups face disproportionate compliance burden compared to big tech"
Episode: Startup Policy (The Information) | 2026-01-15
Quotes: 1 supporting → "Small companies can't afford compliance teams..." (CEO, 14:30)
3. [summary] score=0.79
"European AI Act creates new requirements for high-risk applications"
Episode: Tech Policy Update (Pivot) | 2026-03-10
4. [transcript] score=0.74
"...the real question is whether startups can even survive the
regulatory overhead. We're seeing companies move to jurisdictions..."
Episode: Innovation vs Regulation (a16z) | 2026-02-20 | timestamp: 23:45
5. [quote] score=0.71 | speaker: Mary Smith
"Every dollar spent on compliance is a dollar not spent on R&D"
Episode: Founder Stories (YC) | 2026-03-01 | timestamp: 8:12
Related Work¶
- Issue #466: Post-v1 depth backlog (NL consumption, richer aggregation)
- PRD-017: Grounded Insight Layer
- PRD-019: Knowledge Graph Layer
- RFC-050: GIL Use Cases (UC4 Semantic QA, UC5 Insight Explorer)
Release Checklist¶
Tracking: #485 (foundation prerequisites), #484 (Phase 1 implementation), #489 (viewer v2 / RFC-062), epic #466
Phase 1 + local viewer (v2.6.0) — satisfied:
- [x] PRD-021 reviewed (this document reflects shipped scope)
- [x] RFC-061 created with technical design
- [x]
VectorStoreprotocol +FaissVectorStoreimplemented - [x] Transcript chunker implemented
- [x] Embed-and-index pipeline stage implemented
- [x]
podcast searchCLI command implemented - [x]
podcast indexCLI command implemented - [x]
gi exploresemantic upgrade implemented - [x] Config fields added (
vector_search,vector_backend, etc.) - [x] Unit + integration tests cover search round-trip
- [x] Documentation updated (README, config examples, guides)
- [x]
make ci-fastpasses with search enabled and disabled
Future: RFC-070 (Qdrant, pgvector, re-ranking), optional BM25/hybrid search — not required to keep this PRD in Implemented status for v2.6.0.