RFC-061: Semantic Corpus Search (FAISS — Shipped)¶
- Status: Completed (v2.6.0) —
FaissVectorStore, embed-and-index stage,podcast search/podcast index, semanticgi explore --topicwhen an index exists,VectorStoreprotocol (ADR-060). Deferred / platform follow-ups (Qdrant, pgvector, re-ranking, digest-vector fusion) live in RFC-070. - Authors: Podcast Scraper Team
- Stakeholders: Core team, GIL/KG consumers, downstream API/digest users
- Related PRDs:
- PRD-021: Semantic Corpus Search
- PRD-017: Grounded Insight Layer
- PRD-019: Knowledge Graph Layer
- Related RFCs:
- RFC-070: Semantic corpus search — platform & future — Qdrant, scale, hybrid quality (Draft)
- RFC-049: GIL core
- RFC-050: GIL use cases
- RFC-051: Database projection
- RFC-055: KG core
- RFC-056: KG use cases
- Related Documents:
- GitHub #466
- GIL ontology
- Semantic search guide
- Updated: 2026-04-11 (split deferred scope to RFC-070)
Abstract¶
This RFC defines the shipped technical design for Semantic Corpus Search (Phase 1):
a vector index over GIL insights, quotes, summary bullets, transcript chunks, and (when
enabled) KG surfaces — meaning-based retrieval across the podcast corpus using FAISS
(FaissVectorStore) and JSON sidecars under <output_dir>/search/. It introduces the
VectorStore protocol so other backends can be added later without changing CLI or HTTP
callers (RFC-070 tracks Qdrant, pgvector,
re-ranking, and platform-scale choices). Search results preserve GIL provenance — grounding,
quotes, timestamps, and transcript references where applicable.
Architecture alignment: Additive only. No change to gi.json, kg.json, summaries, or
transcript files. Optional pipeline stage (embed-and-index), podcast search / podcast index,
and transparent semantic upgrade for gi explore --topic when an index is present. FastAPI
/api/search (viewer) is a thin wrapper over the same VectorStore.search() path.
Problem Statement¶
GIL and KG produce rich, structured artifacts per episode, but consumption is limited to exact-match and substring filtering:
gi explore --topic "AI Regulation"doeskey in insight_text.lower()— misses "Government AI Policy," "tech oversight," "regulatory impact"gi querymaps regex patterns to the same substring path — not semantickg entities/kg topicsmatch by exact string — "Elon Musk" and "Musk" are separate- No cross-corpus question like "what do my podcasts say about X?"
- RFC-050 explicitly defers UC4 (Semantic QA) as "post-v1, after Insight Explorer validated"
gi explorehits a ~100 episode performance ceiling (file scan)
The project already loads sentence-transformers (all-MiniLM-L6-v2) and has
embedding_loader.py with encode() and cosine_similarity() — but these are only used
for GIL grounding and eval metrics, not user-facing search.
Use Cases:
- Cross-Corpus Semantic Search: "What do my podcasts say about supply chain disruptions?" — finds insights about logistics, shipping delays, port congestion across feeds
- Transcript Deep-Dive: "Where was quantum computing discussed?" — returns timestamped chunks even if the speaker said "qubits" or "quantum advantage"
- Evidence-Backed Discovery: Search returns GIL Insight nodes with their full provenance chain (Insight → Quotes → transcript spans → timestamps) — not generated text
- Semantic
gi exploreUpgrade:gi explore --topic "climate"matches insights about "global warming," "carbon emissions," "net zero" without explicit topic labels - Digest Clustering: Weekly digest groups similar Insight embeddings to find themes and deduplicate cross-feed coverage
Goals¶
VectorStoreprotocol: Backend-agnostic interface; FAISS implementation shipped in this RFC; additional backends — RFC-070- Embed-and-index pipeline stage: Produces and maintains a vector index as part of the pipeline, incremental by default
podcast searchCLI: Meaning-based corpus queries with filtering and structured output- Transparent
gi exploreupgrade: Semantic matching when index available, substring fallback when not - Reuse existing infrastructure: Same embedding model, same
embedding_loader.py, same output directory conventions
Constraints & Assumptions¶
Constraints:
- CLI-first: no server process required (FAISS in-process for Phase 1)
- Must not break existing behavior when
vector_searchis disabled (default:false) - Pipeline runtime increase < 30% when indexing is enabled
- Index files live alongside corpus outputs (no external database for Phase 1)
- Must work with the existing
all-MiniLM-L6-v2model (384-dim)
Assumptions:
- Corpus scale: 10-50 feeds, up to ~5,000 episodes (~1.2M vectors with transcript chunks)
- GIL and/or summary artifacts exist before search is enabled
- English-only content
Design & Implementation¶
1. VectorStore Protocol¶
A minimal protocol implemented by FaissVectorStore in v2.6; future backends (e.g. Qdrant)
must satisfy the same contract (ADR-060,
RFC-070):
from __future__ import annotations
from dataclasses import dataclass
from typing import Protocol
@dataclass
class SearchResult:
doc_id: str
score: float
metadata: dict
@dataclass
class IndexStats:
total_vectors: int
doc_type_counts: dict[str, int]
feeds_indexed: list[str]
embedding_model: str
embedding_dim: int
last_updated: str
index_size_bytes: int
class VectorStore(Protocol):
def upsert(
self, doc_id: str, embedding: list[float], metadata: dict
) -> None: ...
def batch_upsert(
self, doc_ids: list[str], embeddings: list[list[float]],
metadata_list: list[dict]
) -> None: ...
def search(
self, query_embedding: list[float], top_k: int = 10,
filters: dict | None = None
) -> list[SearchResult]: ...
def delete(self, doc_ids: list[str]) -> None: ...
def persist(self) -> None: ...
def stats(self) -> IndexStats: ...
Key design decisions:
doc_idis a string likeinsight:<episode_id>:<hash>orchunk:<episode_id>:<index>— stable, deterministic, aligns with GIL/KG ID conventionsmetadatais a flat dict with known keys (doc_type,episode_id,feed_id,publish_date,speaker_id,grounded,char_start,char_end,timestamp_start_ms,source_id)filtersis a dict of metadata field → value (or list of values) for pre-/post-filteringbatch_upsertfor efficient bulk indexing (FAISS benefits from batched adds)
2. FaissVectorStore Implementation¶
src/podcast_scraper/search/
__init__.py
protocol.py # VectorStore protocol + SearchResult + IndexStats
faiss_store.py # FaissVectorStore implementation
chunker.py # Transcript chunking
indexer.py # Embed-and-index pipeline logic
Index structure on disk:
<output_dir>/search/
vectors.faiss # FAISS IndexIDMap wrapping IndexFlatIP (or IVF-PQ at scale)
metadata.json # doc_id → metadata mapping (or .sqlite for large corpora)
index_meta.json # embedding_model, dim, created_at, last_updated, version
FAISS index type selection:
| Corpus size | Index type | Notes |
|---|---|---|
| < 100K vectors | IndexFlatIP wrapped in IndexIDMap |
Exact search; fast enough |
| 100K - 1M vectors | IndexIVFFlat (nlist=256) + IndexIDMap |
~10x faster, slight accuracy loss |
| > 1M vectors | IndexIVFPQ (nlist=1024, m=48) |
Compressed; ~20x faster |
Auto-selection based on vector count at persist time. Rebuild threshold configurable.
Metadata filtering (FAISS):
FAISS has no built-in metadata filtering. Strategy:
- FAISS search returns top
k * 3candidates (over-fetch) - Post-filter by metadata predicates (type, feed, date, speaker, grounded)
- Return top
kfrom filtered set - If fewer than
kresults after filtering, warn user
This is simple and sufficient for CLI-scale corpora. Native payload filtering (e.g. Qdrant) is out of scope here — see RFC-070.
3. Transcript Chunker¶
@dataclass
class TranscriptChunk:
text: str
chunk_index: int
char_start: int
char_end: int
timestamp_start_ms: int | None
timestamp_end_ms: int | None
def chunk_transcript(
text: str,
target_tokens: int = 300,
overlap_tokens: int = 50,
timestamps: list[dict] | None = None,
) -> list[TranscriptChunk]:
"""Split transcript into overlapping sentence-boundary chunks."""
...
Strategy:
- Split text into sentences (simple regex:
(?<=[.!?])\s+with fallback on\n) - Group sentences into chunks targeting
target_tokens(estimated via whitespace split) - Overlap: carry last ~
overlap_tokensworth of sentences from previous chunk - Track
char_start/char_endper chunk - If
timestampsprovided (from Whisper segments), interpolatetimestamp_start_ms/timestamp_end_msper chunk based on character position alignment
Why sentence boundaries: Avoids splitting mid-sentence, which degrades embedding quality. Simpler and more predictable than recursive splitting. No external dependency.
4. Embed-and-Index Pipeline Stage¶
When it runs:
Existing: RSS → download → transcribe → metadata → summarize → GIL → (KG)
New stage: → embed & index
Runs after GIL and KG (if enabled), or after summarization if GIL is not enabled.
Trigger: vector_search: true in config.
What gets indexed:
| Document type | Source | doc_id pattern | Requires |
|---|---|---|---|
| Insight | gi.json → insight.properties.text |
insight:<episode_id>:<hash> |
generate_gi: true |
| Quote | gi.json → quote.properties.text |
quote:<episode_id>:<hash> |
generate_gi: true |
| Summary bullet | SummarySchema.bullets[i] |
bullet:<episode_id>:<i> |
generate_summaries: true |
| Transcript chunk | Transcript file → chunked | chunk:<episode_id>:<i> |
Transcript on disk |
Incremental logic:
- Load
index_meta.json→ get set of already-indexed episode IDs + content hashes - For each episode in the output directory:
a. Compute content hash (SHA-256 of
gi.json+ summary + transcript paths) b. If hash matches → skip (already indexed) c. If new or changed → delete old vectors for that episode, embed new content, upsert - Persist updated index
Embedding: Uses embedding_loader.encode() with the configured model (default
all-MiniLM-L6-v2). Batch encoding for efficiency (~14K sentences/sec on GPU).
5. Search CLI Command¶
podcast search "<query>" [options]
Options:
| Flag | Type | Default | Description |
|---|---|---|---|
--type |
insight\|quote\|summary\|transcript |
all | Filter by document type |
--feed |
string | all | Filter by feed name |
--since |
date | none | Filter by publish date |
--speaker |
string | none | Filter quotes/insights by speaker |
--grounded-only |
flag | false | Only grounded insights |
--top-k |
int | 10 | Number of results |
--format |
json\|pretty |
pretty | Output format |
--index-path |
path | <output_dir>/search/ |
Index location |
Implementation flow:
- Load
FaissVectorStorefrom--index-path - Encode query using same embedding model
- Search with over-fetch → post-filter by metadata → return top-k
- For each result, resolve full context:
- Insight → load
gi.json, resolve supporting quotes - Quote → load
gi.json, resolve parent insight and speaker - Summary bullet → load metadata, resolve episode
- Transcript chunk → resolve episode and timestamps
- Format and output
6. Index Management CLI¶
podcast index --rebuild [--output-dir ./output]
podcast index --stats [--output-dir ./output]
--rebuild: Delete existing index, re-index all episodes from scratch--stats: PrintIndexStats(vector count, type breakdown, feeds, model, size)
7. Enhanced gi explore and gi query¶
gi explore upgrade:
In explore.py, modify _insight_matches_topic():
def _insight_matches_topic(
artifact, insight_id, insight_text, topic, vector_store=None
):
if not topic or not topic.strip():
return True
if vector_store is not None:
return _semantic_topic_match(insight_id, topic, vector_store)
# Existing substring fallback
key = topic.strip().lower()
...
When gi explore starts, check if a vector index exists at the default path. If yes,
load it and pass to matching functions. If no, use existing substring logic. Zero behavior
change when no index is present.
gi query upgrade:
run_uc4_semantic_qa currently maps patterns to run_uc5_insight_explorer. With a
vector index, it can:
- Encode the user's question as a vector
- Search for top-K matching insights
- Return the same
ExploreOutputcontract — just better results
8. Configuration¶
New fields in Config (all optional, search disabled by default):
vector_search: bool = False
vector_backend: Literal["faiss", "qdrant"] = "faiss" # qdrant reserved — RFC-070
vector_index_path: Optional[str] = None # auto: <output_dir>/search/
vector_index_types: list[str] = [
"insights", "quotes", "summary_bullets", "transcript_chunks"
]
vector_chunk_size_tokens: int = 300
vector_chunk_overlap_tokens: int = 50
CLI flags: --vector-search, --vector-backend, --vector-index-path,
--vector-chunk-size, --vector-chunk-overlap.
Key Decisions¶
- FAISS for shipped Phase 1; other backends deferred
- Decision: Ship
FaissVectorStoreonly in v2.6;vector_backend: qdrantis reserved in config until RFC-070 implements it. -
Rationale: FAISS is in-process, no separate service, sufficient for CLI-scale corpora with auto IVF / IVFPQ upgrade at large N. Qdrant / pgvector / re-ranking are platform-scale concerns, documented in RFC-070.
-
Global corpus index, not per-feed
- Decision: One index for the entire corpus with feed metadata for filtering
-
Rationale: Cross-feed discovery is a primary use case. Per-feed would require multi-index coordination.
-
Post-filter metadata (FAISS)
- Decision: Over-fetch from FAISS, post-filter by metadata
-
Rationale: Simple, no external dependency. Sufficient for CLI corpora at shipped scale. Native filtering — RFC-070.
-
Sentence-boundary chunking
- Decision: Sentence-boundary windows, not fixed-token or paragraph-based
-
Rationale: Preserves semantic coherence; predictable chunk sizes; no external dependency (simple regex splitter).
-
Transparent
gi exploreupgrade - Decision: Detect index existence; use semantic matching when available; substring fallback when not
-
Rationale: Zero breaking change. Users who don't enable search get identical behavior. Users who do get better results from the same command.
-
faiss-cpuonly (no GPU variant in default deps) - Decision:
faiss-cpuin[project.dependencies];faiss-gpuas optional - Rationale: CPU is sufficient for CLI-scale corpora. GPU variant has CUDA dependency that complicates installation.
Alternatives Considered (Phase 1)¶
For shipped CLI-first scope we chose FAISS + VectorStore protocol over Qdrant-only,
ChromaDB, raw FAISS without a protocol, and Postgres-only pgvector for the initial local
index. A fuller comparison table and platform-era revisits (Qdrant, pgvector + RFC-051,
re-ranking) live in RFC-070.
Testing Strategy¶
Test Coverage:
- Unit tests:
VectorStoreprotocol,FaissVectorStoreCRUD, chunker, metadata sidecar, search with filters, incremental indexing logic - Integration tests: Full round-trip: embed sample artifacts → build index → search → verify results include correct GIL provenance
- E2E tests: Pipeline run with
vector_search: true→podcast search→ verify results;gi explorewith and without index
Test Organization:
tests/unit/podcast_scraper/search/— unit tests for search moduletests/integration/test_search_integration.py— index + query round-triptests/e2e/test_search_cli_e2e.py— CLI end-to-end
Test Execution:
- Unit + integration:
make ci-fast - E2E:
make ci(full suite) - Fixtures: small
gi.json+ transcript + summary fixtures for deterministic tests - Mock
embedding_loader.encode()in unit tests (return fixed vectors)
Rollout & Monitoring¶
Rollout Plan (Option A — post-hardening slice):
- Step 1:
VectorStoreprotocol +FaissVectorStore+ unit tests — done - Step 2: Transcript chunker + unit tests — done
- Step 3: Embed-and-index pipeline stage + unit tests (
test_indexer.py); integration-style coverage in search unit tests — done - Step 4:
podcast searchCLI +podcast indexCLI + E2E tests — done - Step 5: Config fields + YAML support (
config.py,config/examples/config.example.yaml) — done - Step 6:
gi exploresemantic upgrade (transparent:<output_dir>/search/vectors.faiss+--topic) — done - Step 7: Documentation update (README, Development Guide,
docs/guides/SEMANTIC_SEARCH_GUIDE.md, MkDocs nav) — done
Deferred (platform / scale — RFC-070):
QdrantVectorStore, native payload filters, optional pgvector, re-ranking, digest-vector fusion — see RFC-070 (Draft).
Monitoring:
- Index build time per episode (logged as
vector_index_secin JSONL metrics) - Search latency (logged per query)
- Index size on disk (reported by
podcast index --stats)
Success Criteria:
podcast searchreturns semantically relevant results for paraphrased queriesgi explore --topicproduces better results with index than without (manual eval)- Zero regression when
vector_search: false(default) make ci-fastpasses with search module included- Index build + search round-trip works in integration tests
Relationship to Other RFCs¶
This RFC (RFC-061) is part of the GIL/KG depth initiative (#466):
RFC-049 (GIL Core) → artifacts to index
RFC-050 (GIL Use Cases) → UC4/UC5 that search enables
↓
RFC-061 (this RFC) → FAISS semantic search + CLI + protocol (Completed)
↓
RFC-070 (Draft) → Qdrant / pgvector / scale / quality extras
↓
RFC-051 (DB Projection) → complementary structured serving
Platform megasketch → multi-tenant search, remote vector DB
Key distinction:
- RFC-049/050: What GIL extracts and how it's consumed (structured)
- RFC-061: Meaning-based discovery over GIL + KG + summary + transcript (FAISS, shipped)
- RFC-070: Platform and backend extensions (Draft)
- RFC-051: SQL-based serving (complementary)
Together, RFC-061 (vectors, local) and RFC-051 (SQL) are complementary; RFC-070 is the planned hook for remote / filtered vector backends when needed.
Benefits¶
- Unlocks UC4 (Semantic QA): The explicitly deferred RFC-050 use case becomes functional
- Removes scale ceiling:
gi exploregoes from ~100 episode file scan to vector index - Cross-feed discovery: "What do all my podcasts say about X?" becomes answerable
- Preserves GIL provenance: Search results carry grounding, quotes, timestamps — not hallucinated text
- Minimal new dependencies:
faiss-cpu(~20 MB); everything else already in the tree - Foundation for platform features: Same index and protocol support viewer
/api/search; digest clustering / Qdrant / re-ranking — RFC-070
Migration Path¶
N/A — this is a new additive feature. Existing behavior is unchanged when
vector_search: false (default). No artifacts, schemas, or CLI commands are modified.
Open Questions¶
- Metadata sidecar format — Resolved for Phase 1: JSON sidecars (
metadata.json,id_map.json,index_meta.json) as implemented undersearch/. SQLite or backend-native storage at extreme scale — RFC-070. - Near-dedup across episodes — Deferred: index stores rows; digest / clustering / RFC-070.
- Cross-encoder re-ranking — Deferred: RFC-070.