Grounded Insight Layer Ontology (v1)¶

Status¶

v1 (implementation-ready)

Shipping note (GIL backlog / Issue #460): The extractor currently emits Episode, Insight, Quote, and SUPPORTED_BY edges. Topic nodes and ABOUT edges are defined in this ontology for forward compatibility and for gi explore when artifacts are enriched, but are not produced automatically by the default pipeline yet. Quote.speaker_id is set when transcription segments (e.g. .segments.json) include speaker or speaker_id aligned with the transcript; otherwise null. When speaker_id is set from diarization, the pipeline also emits a Speaker node and a SPOKEN_BY edge (Quote → Speaker) so gi explore can show graph-backed speaker names even when quote properties omit a human-readable label.

Purpose¶

Define the canonical ontology contract for the Grounded Insight Layer (GIL):

Node & edge types
Required properties
Identity (ID) rules
Grounding contract (the 2025 moat)
Provenance & evidence requirements

This document is the source of truth for contributors. All gi.json outputs MUST conform to this ontology and the companion schema (docs/architecture/gi/gi.schema.json).

Design Principles¶

Insight-centric: Focus on takeaways (Insights) and evidence (Quotes), not just claims
Evidence first-class: Quotes are nodes, not metadata—enables trust and navigation
Explicit grounding: Every Insight must declare grounded=true/false
Stable IDs: Global concepts must have stable identifiers across episodes
Entities deferred: Entity extraction deferred to v1.1 to focus on core value

The Grounding Contract (Critical)¶

The grounding contract is what makes GIL trustworthy:

Hard Rules (Invariants)¶

Every Quote MUST be verbatim
Quote.text must exactly match transcript[char_start:char_end]
No paraphrasing, no summarization, no rewording
Timestamps must correspond to the quoted span
Every Insight MUST have explicit grounding status
grounded=true: Insight has ≥1 SUPPORTED_BY edge to a Quote
grounded=false: Insight is extracted but lacks supporting quote (rare, but honest)
SUPPORTED_BY edges are evidence links
An Insight can have multiple supporting Quotes
Each Quote provides evidence for the Insight's validity

Why This Matters¶

Trust: Users know exactly which Insights have evidence
Quality Metrics: System can measure % insights grounded and quote validity rate
RAG Applications: Downstream systems can filter for grounded-only Insights
Debugging: Ungrounded Insights are visible, not hidden

Top-Level Concepts¶

Node Types (v1)¶

Node Type	Description
Podcast	A podcast feed
Episode	A single podcast episode
Speaker	A person speaking (optional if no diarization)
Topic	An abstract subject discussed (lightweight)
Insight	A key takeaway / conclusion extracted from content
Quote	Verbatim transcript span used as evidence

Node Types (v1.1 - Deferred)¶

Node Type	Description
Entity	Person, company, product, place (deferred to v1.1)

Edge Types (v1)¶

Edge	From -> To	Description
HAS_EPISODE	Podcast -> Episode	Podcast contains episode
SPOKE_IN	Speaker -> Episode	Speaker participated
HAS_INSIGHT	Episode -> Insight	Episode contains insight
SUPPORTED_BY	Insight -> Quote	Quote provides evidence for insight
SPOKEN_BY	Quote -> Speaker	Speaker said the quote
ABOUT	Insight -> Topic	Insight is about topic
RELATED_TO	Topic <-> Topic	Semantic relationship (optional)

Common Properties¶

Required on all nodes¶

id (string) - Unique identifier
type (enum) - Node type
properties (object) - Type-specific properties

Required on all edges¶

type (enum) - Edge type
from (node id) - Source node
to (node id) - Target node

Provenance (required for ML-derived content)¶

Any node produced by ML extraction SHOULD include:

confidence (0.0–1.0) - Extraction certainty (not factual truth)

The root gi.json file MUST include:

model_version - Model identifier used for extraction
prompt_version - Prompt version used (enables A/B testing)

Identity Rules (IDs)¶

Shipped pipeline (GI + KG alignment)¶

Episode node: episode:{episode_id} — same episode_id string as the artifact root field and as KG (RSS GUID family).
Insight node: insight:{16-hex} — SHA-256 over (episode_id, index, insight_text prefix); properties.episode_id anchors the episode.
Quote node: quote:{16-hex} — SHA-256 over (episode_id, quote_index, text prefix, char_start, char_end); properties.episode_id anchors the episode.
Speaker node: speaker:{slug(name)} — global by normalized name slug (merged across episodes in combined graphs).

Topic / ABOUT (when enriched): topic:{slug(label)} — global, same family as KG topics.

Node Definitions¶

Podcast¶

Definition: A podcast feed.

Required properties:

title (string)
rss_url (string)

Optional:

publisher (string)

Episode¶

Definition: A single podcast episode.

Required properties:

podcast_id (string)
title (string)
publish_date (ISO date-time string)

Optional:

audio_url (string)
duration_ms (integer)

Speaker¶

Definition: A person speaking in the episode.

Required properties:

name (string)

Optional:

aliases (string[])

Topic¶

Definition: An abstract subject discussed. Lightweight and mergeable.

Required properties:

label (string)

Optional:

aliases (string[])

Insight (NEW)¶

Definition: A key takeaway or conclusion extracted from episode content.

Unlike traditional "claims," Insights:

Focus on what users want to know (takeaways)
Have explicit grounding status (grounded=true/false)
Link to supporting Quote nodes for evidence

Required properties:

text (string) - The insight statement (can be rephrased for clarity)
episode_id (string) - Episode where insight was extracted
grounded (boolean) - Whether insight has ≥1 supporting quote

Optional:

confidence (number, 0.0-1.0) - Extraction confidence

Quote (NEW)¶

Definition: A verbatim transcript span that serves as evidence.

Making Quote a first-class node enables:

Evidence-backed retrieval (Insight → Quote → timestamp)
Trust verification (users can check Quote against transcript)
Quality metrics (quote validity rate)
Speaker attribution when available

Required properties:

text (string) - Verbatim text from transcript (no paraphrasing!)
episode_id (string) - Episode containing the quote
char_start (integer) - Character start in transcript text
char_end (integer) - Character end in transcript text
timestamp_start_ms (integer) - Timestamp start (milliseconds)
timestamp_end_ms (integer) - Timestamp end (milliseconds)
transcript_ref (string) - Reference to transcript artifact

Optional:

speaker_id (string, nullable) - Speaker who said the quote (if diarization available)

Entity (v1.1 - Deferred)¶

Definition: A real-world named entity. Deferred to v1.1.

Required properties:

name (string)
entity_type (enum: person, company, product, place, org, event, other)

Optional:

external_ids (object, e.g. { "wikidata": "Q..." })

Edge Definitions¶

HAS_EPISODE (Podcast → Episode)¶

Required properties: none

SPOKE_IN (Speaker → Episode)¶

Required properties: none

HAS_INSIGHT (Episode → Insight) (NEW)¶

Required properties: none

SUPPORTED_BY (Insight → Quote) (NEW)¶

Definition: Links an Insight to a Quote that provides evidence for it.

Required properties: none (Quote already carries evidence/provenance)

Semantics: If an Insight has ≥1 SUPPORTED_BY edge, it is grounded=true.

SPOKEN_BY (Quote → Speaker) (NEW)¶

Definition: Links a Quote to the Speaker who said it.

Required properties: none

Note: Only present if speaker diarization is available.

ABOUT (Insight → Topic)¶

Definition: Links an Insight to a Topic it discusses.

Optional properties:

confidence (number, 0.0-1.0)

RELATED_TO (Topic ↔ Topic)¶

Definition: Semantic relationship between topics.

Optional properties:

confidence (number, 0.0-1.0)

Required Output Artifact: gi.json¶

Each episode output folder contains a gi.json capturing nodes/edges for the episode.

Root-level fields¶

schema_version (string, required) - Schema version (e.g., "1.0")
model_version (string, required) - Model used for extraction
prompt_version (string, required) - Prompt version used
episode_id (string, required) - Episode identifier
nodes (array, required) - All nodes
edges (array, required) - All edges

Guidance¶

Episode-local Insight and Quote nodes live here
Global nodes (Topic, Speaker) may be referenced or introduced
The logical full GIL is the union of all episode gi.json files
Every Insight must have grounded field set explicitly

Minimal Example¶

{
  "schema_version": "1.0",
  "model_version": "gpt-4.1-mini-2026-01-xx",
  "prompt_version": "v2.1",
  "episode_id": "episode:abc123",
  "nodes": [
    {
      "id": "podcast:the-journal",
      "type": "Podcast",
      "properties": {
        "title": "The Journal",
        "rss_url": "https://feeds.example.com/the-journal"
      }
    },
    {
      "id": "episode:abc123",
      "type": "Episode",
      "properties": {
        "podcast_id": "podcast:the-journal",
        "title": "AI Regulation",
        "publish_date": "2026-02-03T00:00:00Z"
      }
    },
    {
      "id": "speaker:sam-altman",
      "type": "Speaker",
      "properties": {
        "name": "Sam Altman"
      }
    },
    {
      "id": "topic:ai-regulation",
      "type": "Topic",
      "properties": {
        "label": "AI Regulation"
      }
    },
    {
      "id": "insight:episode:abc123:a1b2c3d4",
      "type": "Insight",
      "properties": {
        "text": "AI regulation will significantly lag behind the pace of innovation",
        "episode_id": "episode:abc123",
        "grounded": true
      },
      "confidence": 0.85
    },
    {
      "id": "quote:episode:abc123:e5f6g7h8",
      "type": "Quote",
      "properties": {
        "text": "Regulation will lag innovation by 3–5 years. That's my prediction.",
        "episode_id": "episode:abc123",
        "speaker_id": "speaker:sam-altman",
        "char_start": 10234,
        "char_end": 10302,
        "timestamp_start_ms": 120000,
        "timestamp_end_ms": 135000,
        "transcript_ref": "transcript.json"
      }
    },
    {
      "id": "quote:episode:abc123:i9j0k1l2",
      "type": "Quote",
      "properties": {
        "text": "We'll see laws that are already outdated when they pass.",
        "episode_id": "episode:abc123",
        "speaker_id": "speaker:sam-altman",
        "char_start": 10890,
        "char_end": 10945,
        "timestamp_start_ms": 142000,
        "timestamp_end_ms": 148000,
        "transcript_ref": "transcript.json"
      }
    }
  ],
  "edges": [
    {
      "type": "HAS_EPISODE",
      "from": "podcast:the-journal",
      "to": "episode:abc123"
    },
    {
      "type": "SPOKE_IN",
      "from": "speaker:sam-altman",
      "to": "episode:abc123"
    },
    {
      "type": "HAS_INSIGHT",
      "from": "episode:abc123",
      "to": "insight:episode:abc123:a1b2c3d4"
    },
    {
      "type": "SUPPORTED_BY",
      "from": "insight:episode:abc123:a1b2c3d4",
      "to": "quote:episode:abc123:e5f6g7h8"
    },
    {
      "type": "SUPPORTED_BY",
      "from": "insight:episode:abc123:a1b2c3d4",
      "to": "quote:episode:abc123:i9j0k1l2"
    },
    {
      "type": "SPOKEN_BY",
      "from": "quote:episode:abc123:e5f6g7h8",
      "to": "speaker:sam-altman"
    },
    {
      "type": "SPOKEN_BY",
      "from": "quote:episode:abc123:i9j0k1l2",
      "to": "speaker:sam-altman"
    },
    {
      "type": "ABOUT",
      "from": "insight:episode:abc123:a1b2c3d4",
      "to": "topic:ai-regulation",
      "properties": {
        "confidence": 0.79
      }
    }
  ]
}

Grounding Example¶

The above example shows a grounded Insight:

Insight: "AI regulation will significantly lag behind the pace of innovation"
  grounded: true
  confidence: 0.85

  SUPPORTED_BY → Quote 1: "Regulation will lag innovation by 3–5 years..."
  SUPPORTED_BY → Quote 2: "We'll see laws that are already outdated..."

An ungrounded Insight would look like:

{
  "id": "insight:episode:abc123:x9y8z7w6",
  "type": "Insight",
  "properties": {
    "text": "European approach may become the global standard",
    "episode_id": "episode:abc123",
    "grounded": false
  },
  "confidence": 0.45
}

Note: No SUPPORTED_BY edges exist for this Insight, and grounded=false is explicit.

Version History¶

Version	Date	Changes
1.0	2026-02-06	Initial v1: Insight + Quote model, grounding contract