Semantic corpus search (RFC-061)¶
Meaning-based retrieval over Grounded Insights (insights and quotes), summary
bullets, transcript chunks, and Knowledge Graph Topic / Entity nodes
(when kg.json is present) in a pipeline output directory. The shipped implementation
uses a local FAISS index (faiss-cpu) and the same embedding stack as GIL evidence
(embedding_loader, sentence-transformers). Qdrant, remote services, and other
platform-scale options are not implemented — see Draft RFC-070.
When to use it¶
- Cross-episode questions: “What do my podcasts say about X?” without exact keywords.
gi explore --topic: when<output_dir>/search/vectors.faissexists, topic matching uses the vector index first (metadata is scanned to findgi.jsonpaths; only matching episodes load full artifacts), then falls back to substring/topic-label matching if the index is missing, fails, or returns no hits.- Ad hoc CLI search:
podcast searchwith filters (--type,--feed,--since,--speaker,--grounded-only,--top-k,--format).
Enable indexing¶
Config (YAML or CLI)¶
In your config file (see config/examples/config.example.yaml):
vector_search: true
# Optional overrides:
# vector_index_path: search # default relative dir under output_dir
# vector_embedding_model: minilm-l6
# vector_chunk_size_tokens: 300
# vector_chunk_overlap_tokens: 50
vector_search: true runs the embed-and-index step after pipeline finalize (when
metadata and content are available). Default index directory is output_dir/search.
Multi-feed corpus parent (GitHub #505)¶
When output_dir is a corpus parent with a feeds/ directory (two or more feeds in
one run), the indexer discovers episode metadata under each
feeds/<stable_id>/…/metadata/ tree (including nested run_* folders). Artifact paths
in metadata are resolved relative to each episode’s workspace (the directory that contains
that feed’s metadata/). Vector row ids and fingerprint keys use a composite
(feed_id, episode_id) scope when feed_id is present in metadata, so GUID collisions
across feeds cannot clobber FAISS rows.
Per-feed pipeline runs set skip_auto_vector_index: true internally when
vector_search and FAISS are enabled; after all feeds complete, the CLI or service builds
one index at <corpus_parent>/search.
Manual index / rebuild¶
python -m podcast_scraper.cli index --output-dir /path/to/run
python -m podcast_scraper.cli index --output-dir /path/to/run --rebuild
python -m podcast_scraper.cli index --output-dir /path/to/run --stats
For a multi-feed tree, --output-dir should be the corpus parent (the directory that
contains feeds/), not an individual feed subdirectory.
Index upgrade / stale rows (multi-feed)¶
If you change embedding model, chunking settings, or composite fingerprint scope (feed + episode keys — GitHub #505), existing FAISS rows may no longer match metadata on disk. Rebuild the parent index with:
python -m podcast_scraper.cli index --output-dir /path/to/corpus_parent --rebuild
Until you rebuild, search may return hits that no longer map cleanly to current files, or miss new episodes. See CORPUS_MULTI_FEED_ARTIFACTS.md for manifest / summary contracts.
Corpus topic clustering (RFC-075)¶
After kg_topic rows exist in the FAISS index, you can build search/topic_clusters.json
with topic-clusters (see CLI reference — section Topic clusters).
Optional --merge-cil-overrides writes merged topic_id_aliases into
cil_lift_overrides.json (see Optional corpus overrides below). The GI/KG viewer loads
clusters via GET /api/corpus/topic-clusters when serving the same corpus root. Details:
RFC-075.
New CLI runs emit schema_version: "2" with distinct ids: graph_compound_parent_id
(tc:…, viewer TopicCluster parents only) and cil_alias_target_topic_id (topic:…,
topic_id_aliases merge target). Older topic_clusters.json files may still use v1
keys (cluster_id, canonical_topic_id); readers accept both.
Search responses: Membership is not copied into FAISS metadata at index time. On each
GET /api/search (and CLI search), the server joins search/topic_clusters.json
by metadata.source_id for doc_type: kg_topic rows and attaches
metadata.topic_cluster with graph_compound_parent_id, canonical_label, and
optional cil_alias_target_topic_id when present. If the JSON file is missing or the topic
is not in any cluster, the field is omitted.
Id spaces (mental model)¶
The stack uses several parallel id systems; they stay aligned by convention, not by a single global ontology:
| Lens | Role |
|---|---|
| Embedding / FAISS | Similarity search over chunk vectors; kg_topic rows use topic:{slug} as metadata.source_id (episode-local KG identity). |
topic:… (KG) |
Canonical Topic node ids in *.kg.json and in the index; CIL aliases can merge variants for lift and timelines. |
tc:… (TopicCluster) |
Viewer-only compound parent ids from graph_compound_parent_id in topic_clusters.json — grouping in Cytoscape, not a FAISS key and not the same field as cil_alias_target_topic_id. |
| Overrides | cil_lift_overrides.json (and optional topic_id_aliases from topic-clusters --merge-cil-overrides) tune identity without rewriting KG files. |
Details and field renames (v1 vs v2): RFC-075 § Artifact schema.
Query¶
python -m podcast_scraper.cli search "your question" --output-dir /path/to/run
python -m podcast_scraper.cli search "your question" --output-dir /path/to/run --format json --top-k 20
Use --index-path if the index is not under <output_dir>/search.
Web UI (GI / KG Viewer v2)¶
The Vue viewer can run the same corpus search as the CLI via GET /api/search when
the FastAPI server is up (pip install -e ".[dev]" + [ml] for FAISS/embeddings) and
the index exists under <corpus>/search/. Set Corpus root in the sidebar to your
pipeline output directory. See
web/gi-kg-viewer/README.md
(Semantic search) and README.md
(GI / KG Viewer).
Response hits are JSON objects with doc_id, score, metadata, text, optional
supporting_quotes (insight rows), optional lifted on transcript rows when
RFC-072 lift applies (see below), and optional metadata.topic_cluster on kg_topic
rows when topic_clusters.json exists (RFC-075 join; see Corpus topic clustering above).
Quote speaker fields (GitHub #541): On
insight hits, each entry in supporting_quotes may include speaker_id /
speaker_name mirroring .gi.json; both are often absent when transcription segments
lack diarization labels. On transcript hits, lifted.speaker / lifted.topic
carry display_name from bridge.json when the graph id resolves; the matched
lifted.quote may still have audio timestamps while speaker display is missing. Canonical
rules: Development Guide — GI quote speaker_id.
Chunk-to-Insight lift and offset verification (RFC-072 / #528)¶
Lift: For hits whose metadata has doc_type: transcript, the server may attach a
lifted dictionary: insight (id, text, grounded, optional insight_type /
position_hint), speaker and topic (canonical id + display_name from
bridge.json when present), and quote timestamps from the matched Quote node.
Matching uses half-open overlap between the chunk’s char_start / char_end and a
Quote span, then walks SUPPORTED_BY, ABOUT, and SPOKEN_BY edges in gi.json.
No FAISS rebuild is required; this is query-time enrichment.
Response counters: On successful search, the JSON may include lift_stats with
transcript_hits_returned (rows in that response with doc_type: transcript) and
lift_applied (those rows carrying a non-null lifted object). Counts apply to
the paginated top_k slice after server-side KG surface dedupe.
Optional corpus overrides: A file cil_lift_overrides.json at the corpus root
(same directory you pass as path / --output-dir) can tune lift without reindexing:
transcript_char_shift(integer) — added to each transcript hit’schar_start/char_endbefore overlap with Quote spans (fixed offset when index and GI use different normalised text bases).entity_id_aliases/topic_id_aliases— string-to-string maps; speaker / topic ids from the graph are resolved through the map beforebridge.jsondisplay_namelookup and before emittinglifted.speaker.id/lifted.topic.id. After RFC-075 clustering,topic-clusters --merge-cil-overridescan appendtopic_id_aliasesderived fromtopic_clusters.json; keys already in the file keep their values (hand overrides win).
Prerequisite: Quote offsets and index transcript chunk offsets must refer to the same normalised transcript text. Validate on a real indexed corpus with:
python -m podcast_scraper.cli verify-gil-chunk-offsets --output-dir /path/to/corpus --strict --min-overlap-rate 0.95
or make verify-gil-offsets-strict (override GIL_OFFSET_VERIFY_DIR). The verifier
walks discovered *.metadata.json paths (including feed-nested feeds/.../metadata/)
so multi-feed outputs are included.
Verifier JSON: The tool prints verdict, overlap_rate, quotes_total (all Quote nodes
in scanned GI), quotes_verifiable_against_index (quotes in episodes that have at least one
transcript chunk in the index), quotes_skipped_no_transcript_index, per-episode breakdown, and
warnings. overlap_rate is only over verifiable quotes (episodes with no transcript
vectors in the index are skipped for the ratio so partial indexing does not count as misalignment).
Useful labels:
aligned— Overlap rate at least 0.99 on the verifiable subset (episodes without indexed transcript chunks may still appear in the report withquotes_skipped_no_transcript_indexon the row).mostly_aligned— Overlap rate at least 0.85 but not meetingaligned.divergent— Overlap rate below 0.85; treat as blocking for lift until offsets or indexing are fixed.no_quotes— No Quote nodes in the GI files that were scanned (empty or missing GIL).no_indexed_transcript_for_quotes— GI has Quotes but no transcript chunk rows in the index for those episodes (strict mode passes; nothing to compare).
If transcript text normalisation (BOM, newlines, Unicode) differs between indexing and GIL generation, overlap drops; use the report to align pipelines (RFC-072 Known Limitations).
Flags: --index-path DIR (default <output-dir>/search), --strict (non-zero exit when
overlap is below --min-overlap-rate or verdict is divergent; no_quotes and
no_indexed_transcript_for_quotes exit zero),
--max-samples N (cap sample quote ids listed per episode when there is no overlap). For
make verify-gil-offsets-strict, set GIL_OFFSET_VERIFY_DIR for the corpus root and
optionally GIL_OFFSET_MIN_RATE (default 0.95).
See also: GIL / KG / CIL cross-layer, RFC-072 (Phase 5 / Known Limitations).
gi explore and gi query¶
gi explore --topic "…"— If a FAISS index is present atoutput_dir/search(or the path implied by your pipeline layout), insights are ranked by embedding similarity to the topic string. Otherwise behavior is unchanged (substring + Topic labels).gi query— Questions that map to topic/speaker filters use the samerun_uc5_insight_explorerpath, so they benefit from the index when present.
Requirements¶
pip install -e ".[ml]"(or equivalent) for sentence-transformers and FAISS.- Embedding models must be available locally (or cached); search uses
allow_download=False— run the pipeline orindexonce with network if needed.
Related docs¶
- RFC-061 (FAISS, shipped): RFC-061
- RFC-072 (CIL, bridge, lift): RFC-072
- RFC-070 (platform / future, Draft): RFC-070
- PRD-021: PRD-021
- GIL CLI: Grounded Insights Guide
- Cross-layer map: GIL / KG / CIL cross-layer