GIL, KG, and cross-layer (CIL)¶
This guide ties together the Grounded Insight Layer (GIL), the Knowledge Graph (KG),
and the Canonical Identity Layer (CIL): shared person: / topic: / org: identities,
per-episode bridge.json, cross-episode HTTP queries, and semantic search lift
from transcript chunks to structured insights. Use it as a map; layer-specific behaviour
stays in the linked guides and RFCs.
How the pieces fit¶
| Piece | Role | Primary doc |
|---|---|---|
GIL (gi.json) |
Evidence-backed insights and verbatim Quote nodes (char_start / char_end, timestamps, optional speaker). |
Grounded Insights Guide, GIL ontology |
KG (*.kg.json) |
Entities, topics, and relationships for navigation and linking. | Knowledge Graph Guide, KG ontology |
bridge.json (per episode) |
Joins GI and KG surfaces under one canonical id per real-world person/topic/org; display_name for UI. Emitted next to gi.json / kg.json stems. |
RFC-072 |
| CIL HTTP API | Read-only queries over on-disk bridge + gi + kg (position arc, person profile, topic timeline, id lists). | Server Guide (/api/persons/*, /api/topics/*) |
| Semantic search lift | For transcript FAISS hits, optional lifted block (insight, speaker, topic, quote times) when chunk spans overlap a Quote and bridge.json resolves names. |
Semantic Search Guide |
| Corpus topic clustering (RFC-075) | Optional search/topic_clusters.json: viewer TopicCluster compound parents (tc: graph_compound_parent_id); optional CIL topic_id_aliases from cil_alias_target_topic_id. Does not replace bridge.json. |
RFC-075, Semantic Search Guide |
| Offset verification | Confirms Quote char ranges overlap transcript chunk metadata in the index (same coordinate space). | Semantic Search Guide — lift and verification |
Viewer: The GI/KG SPA loads artifacts via GET /api/artifacts and merges GI+KG (and bridge-backed dedupe where implemented). See Development Guide — GI / KG browser viewer, RFC-062, UXS-001.
Artifacts on disk¶
Typical episode workspace (paths vary for multi-feed; see CORPUS_MULTI_FEED_ARTIFACTS):
*.metadata.json— episode row, provenance paths togi.json/kg.json.*.gi.json— GIL graph.*.kg.json— KG graph (whengenerate_kgran).*.bridge.json— CIL identity map for that episode (when the pipeline emits it).
Path rules in code: src/podcast_scraper/builders/bridge_artifact_paths.py (bridge next to metadata; GI/KG siblings of bridge; bridge next to gi.json). GIL edge type comparisons: src/podcast_scraper/gi/edge_normalization.py.
The vector index lives at <corpus_root>/search/ (vectors.faiss, metadata.json, …) when vector_search / index has run at the corpus parent for multi-feed trees.
When the corpus is re-built¶
Core artifacts (*.gi.json, *.kg.json, *.bridge.json) and corpus-level helpers such
as search/topic_clusters.json are outputs of the last pipeline run, not an
immutable snapshot of history.
- Re-running GIL/KG extraction, bridge emission, indexing, or topic clustering can
change canonical
person:/topic:/org:strings, graph layout, and RFC-075graph_compound_parent_id(tc:) values. - RFC-073 enrichers never overwrite core files, but their derived outputs can become misaligned until enrichers run again on the new core layer.
Practical stance for APIs and the viewer: load and join using paths from the current catalog; treat CIL and digest topic pills as projections of today’s bridge + index. For reproducible research, record tool versions and session / run identifiers in metadata alongside the corpus path.
Normative detail: RFC-072 — Operational note: re-pipeline, enrichment, and read-path stance.
CLI and Make targets¶
| Command | Purpose |
|---|---|
python -m podcast_scraper.cli verify-gil-chunk-offsets --output-dir <corpus> |
JSON report: Quote vs transcript chunk overlap (RFC-072 Phase 5 gate). Supports feed-nested metadata (feeds/.../metadata/) via discovered metadata files. |
make verify-gil-offsets-strict |
Same verifier with --strict and --min-overlap-rate (default 0.95). Override corpus: GIL_OFFSET_VERIFY_DIR=/path/to/run. |
python -m podcast_scraper.cli search … / index … |
Semantic search and FAISS maintenance (Semantic Search Guide). |
python -m podcast_scraper.cli gi … / kg … |
GIL and KG CLIs (Grounded Insights Guide, Knowledge Graph Guide). |
HTTP API (summary)¶
GET /api/health— includescil_queries_apiwhen CIL routes are mounted.GET /api/search— corpus search; transcript hits may includelifted(dict) per hit when alignment and graph edges allow.GET /api/persons/{id}/positions|brief|topics,GET /api/topics/{id}/timeline|persons— CIL cross-layer queries (Server Guide).
OpenAPI: /docs when the server is running.
Testing (where coverage lives)¶
| Area | Layer | Location / command |
|---|---|---|
| GIL pipeline, schema, CLI | Unit + integration + E2E | tests/unit/gi/, tests/integration/, tests/e2e/test_gi_kg_cli_subprocess_e2e.py (gi validate); see Testing Strategy — GIL Testing |
| KG | Unit + E2E | tests/unit/kg/, tests/e2e/test_gi_kg_cli_subprocess_e2e.py (kg validate, kg inspect) |
| Bridge builder | Unit | tests/unit/builders/test_bridge_builder.py |
| CIL query logic | Unit | tests/unit/podcast_scraper/server/test_cil_queries.py |
| CIL HTTP | Integration | tests/integration/server/test_cil_api.py |
| Search lift + offset verify | Unit | tests/unit/podcast_scraper/search/test_transcript_chunk_lift.py, test_gil_chunk_offset_verify.py |
| Corpus topic clusters (RFC-075) | Unit | tests/unit/podcast_scraper/search/test_topic_clusters.py |
GET /api/corpus/topic-clusters |
Integration | tests/integration/server/test_corpus_topic_clusters.py |
| Viewer topic cluster fetch + overlay | Vitest + Playwright | make test-ui (corpusTopicClustersApi, topicClustersOverlay); e2e/search-to-graph-mocks.spec.ts (mocked API) |
| Bridge integration (pipeline-shaped) | Integration | tests/integration/test_bridge_integration.py |
| FastAPI viewer (search, health, library, …) | Integration | tests/integration/server/test_server_api.py, test_viewer_search.py, … |
| Viewer TS merge / bridge types | Vitest | make test-ui |
| Viewer UX | Playwright | make test-ui-e2e |
Quality gates: make quality-metrics-ci (fixtures), optional make gil-quality-metrics / make kg-quality-metrics on a real run. Offset strict gate: make verify-gil-offsets-strict when you have an indexed corpus path.
CI and acceptance workflows¶
- Pytest jobs already cover unit/integration/E2E for GIL, KG, server, and search helpers.
- Vitest / Playwright cover the Vue viewer (
make test-ui,make test-ui-e2e). - Main / release CI (
.github/workflows/python-app.yml, jobtest-acceptance-fixtures) runsmake test-acceptance-fixtures-fast, thenmake verify-gil-offsets-after-acceptanceon every acceptancerun_*that hassearch/metadata.json. That is not the same as scheduled nightly (nightly.ymlrunsmake test-nightly, not the acceptance matrix). - Default
make ci-faststill skips full acceptance and offset gates; local checks usemake verify-gil-offsets-strictwithGIL_OFFSET_VERIFY_DIRwhen you have an indexed corpus.
See Testing Guide — GIL, KG, CIL, and semantic search and scripts/acceptance/README.md (canonical acceptance + CI wording).
Specifications and ADRs¶
- RFC-072 — CIL,
bridge.json, cross-layer query patterns, semantic search lift path.
RFC-072-canonical-identity-layer-cross-layer-bridge.md - ADR-052 — Separate GIL and KG artifacts.
- ADR-053 — Grounding contract for evidence-backed insights.
- RFC-061 / PRD-021 — Semantic corpus search (FAISS).
- RFC-062 — Viewer v2 (FastAPI + Vue).
- PRD-017 / PRD-019 — GIL and KG product requirements.