Skip to content

GIL, KG, and cross-layer (CIL)

This guide ties together the Grounded Insight Layer (GIL), the Knowledge Graph (KG), and the Canonical Identity Layer (CIL): shared person: / topic: / org: identities, per-episode bridge.json, cross-episode HTTP queries, and semantic search lift from transcript chunks to structured insights. Use it as a map; layer-specific behaviour stays in the linked guides and RFCs.


How the pieces fit

Piece Role Primary doc
GIL (gi.json) Evidence-backed insights and verbatim Quote nodes (char_start / char_end, timestamps, optional speaker). Grounded Insights Guide, GIL ontology
KG (*.kg.json) Entities, topics, and relationships for navigation and linking. Knowledge Graph Guide, KG ontology
bridge.json (per episode) Joins GI and KG surfaces under one canonical id per real-world person/topic/org; display_name for UI. Emitted next to gi.json / kg.json stems. RFC-072
CIL HTTP API Read-only queries over on-disk bridge + gi + kg (position arc, person profile, topic timeline, id lists). Server Guide (/api/persons/*, /api/topics/*)
Semantic search lift For transcript FAISS hits, optional lifted block (insight, speaker, topic, quote times) when chunk spans overlap a Quote and bridge.json resolves names. Semantic Search Guide
Corpus topic clustering (RFC-075) Optional search/topic_clusters.json: viewer TopicCluster compound parents (tc: graph_compound_parent_id); optional CIL topic_id_aliases from cil_alias_target_topic_id. Does not replace bridge.json. RFC-075, Semantic Search Guide
Offset verification Confirms Quote char ranges overlap transcript chunk metadata in the index (same coordinate space). Semantic Search Guide — lift and verification

Viewer: The GI/KG SPA loads artifacts via GET /api/artifacts and merges GI+KG (and bridge-backed dedupe where implemented). See Development Guide — GI / KG browser viewer, RFC-062, UXS-001.


Artifacts on disk

Typical episode workspace (paths vary for multi-feed; see CORPUS_MULTI_FEED_ARTIFACTS):

  • *.metadata.json — episode row, provenance paths to gi.json / kg.json.
  • *.gi.json — GIL graph.
  • *.kg.json — KG graph (when generate_kg ran).
  • *.bridge.json — CIL identity map for that episode (when the pipeline emits it).

Path rules in code: src/podcast_scraper/builders/bridge_artifact_paths.py (bridge next to metadata; GI/KG siblings of bridge; bridge next to gi.json). GIL edge type comparisons: src/podcast_scraper/gi/edge_normalization.py.

The vector index lives at <corpus_root>/search/ (vectors.faiss, metadata.json, …) when vector_search / index has run at the corpus parent for multi-feed trees.


When the corpus is re-built

Core artifacts (*.gi.json, *.kg.json, *.bridge.json) and corpus-level helpers such as search/topic_clusters.json are outputs of the last pipeline run, not an immutable snapshot of history.

  • Re-running GIL/KG extraction, bridge emission, indexing, or topic clustering can change canonical person: / topic: / org: strings, graph layout, and RFC-075 graph_compound_parent_id (tc:) values.
  • RFC-073 enrichers never overwrite core files, but their derived outputs can become misaligned until enrichers run again on the new core layer.

Practical stance for APIs and the viewer: load and join using paths from the current catalog; treat CIL and digest topic pills as projections of today’s bridge + index. For reproducible research, record tool versions and session / run identifiers in metadata alongside the corpus path.

Normative detail: RFC-072 — Operational note: re-pipeline, enrichment, and read-path stance.


CLI and Make targets

Command Purpose
python -m podcast_scraper.cli verify-gil-chunk-offsets --output-dir <corpus> JSON report: Quote vs transcript chunk overlap (RFC-072 Phase 5 gate). Supports feed-nested metadata (feeds/.../metadata/) via discovered metadata files.
make verify-gil-offsets-strict Same verifier with --strict and --min-overlap-rate (default 0.95). Override corpus: GIL_OFFSET_VERIFY_DIR=/path/to/run.
python -m podcast_scraper.cli search … / index … Semantic search and FAISS maintenance (Semantic Search Guide).
python -m podcast_scraper.cli gi … / kg … GIL and KG CLIs (Grounded Insights Guide, Knowledge Graph Guide).

HTTP API (summary)

  • GET /api/health — includes cil_queries_api when CIL routes are mounted.
  • GET /api/search — corpus search; transcript hits may include lifted (dict) per hit when alignment and graph edges allow.
  • GET /api/persons/{id}/positions|brief|topics, GET /api/topics/{id}/timeline|persons — CIL cross-layer queries (Server Guide).

OpenAPI: /docs when the server is running.


Testing (where coverage lives)

Area Layer Location / command
GIL pipeline, schema, CLI Unit + integration + E2E tests/unit/gi/, tests/integration/, tests/e2e/test_gi_kg_cli_subprocess_e2e.py (gi validate); see Testing Strategy — GIL Testing
KG Unit + E2E tests/unit/kg/, tests/e2e/test_gi_kg_cli_subprocess_e2e.py (kg validate, kg inspect)
Bridge builder Unit tests/unit/builders/test_bridge_builder.py
CIL query logic Unit tests/unit/podcast_scraper/server/test_cil_queries.py
CIL HTTP Integration tests/integration/server/test_cil_api.py
Search lift + offset verify Unit tests/unit/podcast_scraper/search/test_transcript_chunk_lift.py, test_gil_chunk_offset_verify.py
Corpus topic clusters (RFC-075) Unit tests/unit/podcast_scraper/search/test_topic_clusters.py
GET /api/corpus/topic-clusters Integration tests/integration/server/test_corpus_topic_clusters.py
Viewer topic cluster fetch + overlay Vitest + Playwright make test-ui (corpusTopicClustersApi, topicClustersOverlay); e2e/search-to-graph-mocks.spec.ts (mocked API)
Bridge integration (pipeline-shaped) Integration tests/integration/test_bridge_integration.py
FastAPI viewer (search, health, library, …) Integration tests/integration/server/test_server_api.py, test_viewer_search.py, …
Viewer TS merge / bridge types Vitest make test-ui
Viewer UX Playwright make test-ui-e2e

Quality gates: make quality-metrics-ci (fixtures), optional make gil-quality-metrics / make kg-quality-metrics on a real run. Offset strict gate: make verify-gil-offsets-strict when you have an indexed corpus path.


CI and acceptance workflows

  • Pytest jobs already cover unit/integration/E2E for GIL, KG, server, and search helpers.
  • Vitest / Playwright cover the Vue viewer (make test-ui, make test-ui-e2e).
  • Main / release CI (.github/workflows/python-app.yml, job test-acceptance-fixtures) runs make test-acceptance-fixtures-fast, then make verify-gil-offsets-after-acceptance on every acceptance run_* that has search/metadata.json. That is not the same as scheduled nightly (nightly.yml runs make test-nightly, not the acceptance matrix).
  • Default make ci-fast still skips full acceptance and offset gates; local checks use make verify-gil-offsets-strict with GIL_OFFSET_VERIFY_DIR when you have an indexed corpus.

See Testing Guide — GIL, KG, CIL, and semantic search and scripts/acceptance/README.md (canonical acceptance + CI wording).


Specifications and ADRs

  • RFC-072 — CIL, bridge.json, cross-layer query patterns, semantic search lift path.
    RFC-072-canonical-identity-layer-cross-layer-bridge.md
  • ADR-052 — Separate GIL and KG artifacts.
  • ADR-053 — Grounding contract for evidence-backed insights.
  • RFC-061 / PRD-021 — Semantic corpus search (FAISS).
  • RFC-062 — Viewer v2 (FastAPI + Vue).
  • PRD-017 / PRD-019 — GIL and KG product requirements.