Grounded Insights Guide¶
This guide explains grounded insights: structured takeaways from podcast episodes that are linked to verbatim quotes as evidence. It covers how to enable the feature, where output is written, how to use the CLI, and how to validate and troubleshoot.
Summaries, KG, and grounded insights (how they fit together)¶
Episode summaries, the Knowledge Graph (KG), and grounded insights (GIL) address different jobs. A practical way to use them together:
| Layer | Role |
|---|---|
| Summaries (PRD-005: Episode summarization) | Consume quickly: skim what an episode is about without reading the transcript. |
| KG (Knowledge Graph Guide) | Navigate a large library: entities, themes, and how episodes connect. |
| Grounded insights (GIL, this guide) | Focus on value and trust: short takeaways with optional verbatim quotes when grounding succeeds. |
Summaries compress nuance and are not a full substitute for primary sources when stakes are high. GI adds traceability to the transcript where grounding works. KG does not replace summaries or GI; it helps you find where to read or drill down (see the same stack from the KG side in Knowledge Graph Guide § How KG fits with summaries and grounded insights).
What Are Grounded Insights?¶
Grounded insights are key takeaways extracted from episode content, each with an explicit grounding status and optional supporting quotes from the transcript.
- Insight: A short, clear statement (e.g. "AI regulation will lag behind innovation by several years").
- Quote: A verbatim span from the transcript used as evidence. Quotes are first-class: they have character offsets, timestamps, and optional speaker attribution (
speaker_id). When.segments.jsonexists next to the transcript and segments includespeakerorspeaker_id(e.g. diarized pipelines), GIL fillsspeaker_idfrom the segment overlapping the quote span; otherwise it isnull(see ontology). UC2 /gi explore --speakerare best effort when diarization is missing—see Recorded product decisions (v1, issue 460) below. - Grounding contract: Every insight declares
grounded=true(has ≥1 supporting quote) orgrounded=false(extracted but no quote linked). This makes it clear which insights are evidence-backed.
This evidence-first design supports trust, quality metrics, and downstream use (e.g. RAG, search, citation).
Enabling Grounded Insights¶
Grounded insights are produced by the Grounded Insight Layer (GIL) pipeline stage. Enable it in config or via CLI:
generate_gi: Set totruein your config file to run GIL extraction. Default:false. You can also pass--generate-gion the command line when running the main pipeline (same effect asgenerate_gi: truein config).gi_insight_source: Source of insight texts. One of:stub(default): Single placeholder insight (original behaviour).summary_bullets: Use the first N bullets from the episode summary as insight texts (requiresgenerate_summariesand summary metadata with bullets).provider: Call the summarization provider’s optionalgenerate_insights(transcript, ...)to extract key takeaways. LLM providers that implement it (OpenAI, Ollama, Anthropic, Gemini, Mistral, Grok, DeepSeek) return a list of short statements; ML/hybrid providers return an empty list so GIL falls back to stub.gi_max_insights: Maximum number of insights when usingproviderorsummary_bullets(default:20, range1–50). Capped in code for safety.
You can set --gi-insight-source and --gi-max-insights on the command line to override config.
ML summarization and GIL insight wording¶
If summary_provider is transformers or hybrid_ml, treat generate_insights as unsupported for meaningful free-form takeaways: those providers do not populate insight wording via the LLM path. For real insight text with ML summaries, set gi_insight_source: summary_bullets, enable generate_summaries, and ensure the summary pipeline produces bullets (your map/reduce or provider config must emit them). Alternatively, switch summary_provider to an LLM and use gi_insight_source: provider (or still use summary_bullets if you prefer bullets as the insight list).
The default gi_insight_source: stub is appropriate for tests and smoke runs only. When generate_gi is on and gi_insight_source is stub, the CLI logs a warning (outside pytest) so production configs are less likely to ship with placeholder insights by accident.
ML bullets, gi_require_grounding, and local NLI¶
With summary_provider: transformers (or hybrid_ml) and gi_insight_source: summary_bullets, insight text comes from map/reduce (or heuristic) bullets, not a JSON bullet schema. gi_require_grounding: true is still best-effort: grounded quotes need extractive QA + NLI scores above configured thresholds on noisy transcripts (e.g. Whisper). If you see zero grounded quotes, check logs for evidence-step failures; local CrossEncoder NLI models return multi-class logits—the pipeline maps them to P(entailment) using the model’s id2label. For stricter or cleaner bullets, prefer an LLM summary_provider or relax thresholds / disable gi_require_grounding for exploratory ML runs.
Topic nodes, ABOUT edges, and gi explore --topic¶
When <output_dir>/search/vectors.faiss exists (semantic corpus index, RFC-061),
gi explore --topic ranks Insight rows by embedding similarity to the topic
string first; if the index is missing, errors, or yields no hits after filters, it falls
back to the behavior below. See Semantic Search Guide.
The GIL ontology includes Topic nodes and ABOUT (Insight → Topic) edges. Current pipeline output does not yet emit those nodes/edges; artifacts are Episode, Insight, Quote plus SUPPORTED_BY. Therefore gi explore --topic matches (1) Topic labels if present in a hand-edited or future-enriched artifact, and (2) always substring match on insight text (case-insensitive). Until Topic/ABOUT are produced automatically, treat --topic as “search insights (and optional Topic labels)”, not as a dedicated topic-model facet. A follow-up change may add Topic extraction to gi.json without changing this filter semantics. The formal v1 milestone wording for this row is in § Recorded product decisions (v1, issue 460) below.
Recorded product decisions (v1, issue 460)¶
GitHub #460 tracks closing ambiguity for shallow v1—not implementing every GIL enhancement in one release. This table is the operator-facing record of what v1 promises.
| Decision area | v1 choice |
|---|---|
| Insight wording + ML | transformers / hybrid_ml do not supply meaningful free-form text via generate_insights. For real insight wording with ML summaries, use gi_insight_source: summary_bullets with generate_summaries and summary bullets, or switch to an LLM summary_provider and use provider or summary_bullets. stub is for tests and smoke only. The CLI emits a warning when generate_gi and stub run together outside pytest. Details: ML summarization and GIL insight wording. |
Topic nodes + gi explore --topic |
The pipeline does not yet emit Topic nodes and ABOUT edges automatically. gi explore --topic matches substring on insight text and any Topic labels already in the artifact. Richer topic graphing is post–v1 (see GitHub #466). |
| Speaker / UC2 | speaker_id on quotes comes from overlapping segments when speaker / speaker_id exists; otherwise null. gi explore --speaker and gi query speaker phrasing match graph-backed speaker names or speaker_id substrings when the graph has that data—expect weak or empty results if the run had no speaker signals. |
| Scale / SQL | Consumption stays file-based (gi export, gi explore, gi query). Postgres projection (PRD-018, RFC-051) is separate work—track with a dedicated issue if none exists. |
| Consumption CLI | gi explore / gi query implement RFC-050-style use cases with deterministic pattern mapping, not open-ended LLM question answering. Scope: GitHub #439. |
KG (same release): If you also enable generate_kg, see Knowledge Graph Guide § Recorded product decisions (v1, KG shallow) for extraction modes, CLI scope, and shared Postgres deferral.
Providers that support generate_insights (optional on the summarization protocol) produce real insight text; the pipeline then runs the evidence stack (QA + NLI) per insight to find grounded quotes. If a provider does not implement it or the call fails, GIL falls back to a single stub insight.
Pipeline order: The main pipeline runs: scrape → transcribe → (optional) summarization → write metadata. GIL runs as an optional next step after metadata for each episode. So you can run a pipeline that “ends with summaries” (and metadata) and treat GI as an optional add-on: same run, no separate stage—when generate_gi is true, each episode gets its metadata file written first, then gi.json is written alongside it. GI is after summaries (and after the metadata file that contains them), not in parallel.
Evidence stack config (used when GIL is enabled) is already available:
embedding_model: Model for sentence embeddings (e.g.sentence-transformers/all-MiniLM-L6-v2or aliasminilm-l6).embedding_device: Device for embedding model (cpu,cuda,mps, ornullfor auto).extractive_qa_model: Model for extractive question-answering (e.g.deepset/roberta-base-squad2or aliasroberta-squad2).extractive_qa_device: Device for QA model.nli_model: Model for entailment scoring (e.g.cross-encoder/nli-deberta-v3-baseor aliasnli-deberta-base).nli_device: Device for NLI model.
Models load lazily when GIL (or another feature that uses them) is first used.
Provider-based evidence (QA + NLI)¶
GIL grounding always goes through find_grounded_quotes_via_providers in gi/pipeline.py: for each insight it calls extract_quotes then score_entailment on the configured backends. quote_extraction_provider and entailment_provider (config enums, same set as summary_provider: transformers, hybrid_ml, openai, gemini, mistral, grok, deepseek, anthropic, ollama) choose which implementations run. Defaults transformers use local extractive QA and NLI (aligned with gi_qa_model / gi_nli_model and the embedding/QA/NLI settings above); LLM backends use chat-style quote and entailment calls.
quote_extraction_provider: Backend for quote extraction (QA) — find a span that supports the insight.entailment_provider: Backend for entailment (NLI) — score premise (quote) vs hypothesis (insight).
The workflow usually passes provider instances into build_artifact. If either instance is omitted, create_gil_evidence_providers(cfg, summary_provider=...) in gi/deps.py constructs clients from config and reuses summary_provider when the configured types match (so callers do not need to wire four objects by hand).
For each insight the pipeline then:
extract_quotes(transcript, insight_text)→ list of candidate spans (char_start, char_end, text, qa_score).- For each candidate,
score_entailment(premise=quote.text, hypothesis=insight_text)→ float in [0, 1]. - Candidates that pass the QA and NLI thresholds become
GroundedQuotenodes andSUPPORTED_BYedges.
Tests and advanced use: find_grounded_quotes in gi/grounding.py calls local QA/NLI loaders directly without a provider wrapper; it is not used inside build_artifact.
Quote spans (LLM extractors): The model returns quote_text JSON; the pipeline maps it to char_start / char_end in the transcript with resolve_llm_quote_span (gi/grounding.py): exact match, apostrophe normalization (' vs ’), whitespace-tolerant token regex, and longest contiguous sub-phrase when the model drops a leading word (e.g. And). If nothing matches, that candidate is dropped (no fake offsets). Among equal-length regex matches, the earliest occurrence in the transcript wins.
Retries: Config gi_evidence_extract_retries (default 1) adds extra extract_quotes calls when the first attempt yields no list of candidates. Later attempts append a short “verbatim substring” reminder to the insight text passed to the extractor only; NLI still uses the original insight string.
Thresholds (config): gi_qa_score_min (default 0.3) and gi_nli_entailment_min (default 0.5) gate candidates before and after NLI. Lower them slightly for noisy ML summary_bullets + Whisper transcripts if you want more grounded quotes at the cost of precision.
Windowed local QA (long transcripts): When gi_qa_window_chars > 0 and the transcript is longer, local extractive QA runs on overlapping windows of that size and keeps the best-scoring span (then NLI as usual). gi_qa_window_overlap_chars controls overlap (must be < the window size). Default windowing is on (1800 / 300); set gi_qa_window_chars: 0 to restore a single QA pass over the full transcript (often weak on very long episodes).
Resilience and degraded artifacts:
- Strict grounding (
GILGroundingUnsatisfiedError): Whengi_require_groundingis true and the stack produced zero grounded quotes while there are insights to ground, the pipeline logs a warning and setsgi_grounding_degradedon metrics when present. Ifgi_fail_on_missing_groundingis true, it raisesGILGroundingUnsatisfiedErrorand the episode fails. - Provider failures:
GILGroundingUnsatisfiedErroris always re-raised if it escapes the evidence stack. Any other exception from the stack is caught, logged at debug, and the pipeline emits a stub or multi-insight artifact with empty quote lists (insights ungrounded) so the run can continue when strict mode is off. - Malformed returns: Provider return types are validated. If
extract_quotesreturns a non-list (e.g. dict or null), that pass yields no candidates and does not crash. Each candidate must havechar_start,char_end,text, andqa_score; candidates missing these are skipped. Ifscore_entailmentfails for a candidate, that candidate is skipped and the rest are still processed.
Defaults vs API summaries: Raw defaults for quote_extraction_provider and entailment_provider are transformers (local extractive QA + CrossEncoder NLI; install .[ml] / sentence-transformers). When gil_evidence_match_summary_provider is true (default) and generate_gi is on, Config rewrites both fields from transformers to summary_provider if summary_provider is an API LLM (openai, gemini, anthropic, mistral, deepseek, grok, ollama) or hybrid_ml — so a typical “OpenAI + summary bullets” run uses OpenAI for grounding without extra keys. Set gil_evidence_match_summary_provider: false to keep local evidence with API summaries (intentional hybrid). If you override only one evidence field to an API and leave the other on transformers, the CLI logs a WARNING about local NLI deps.
GIL evidence: capability × provider matrix¶
Each summary_provider / evidence backend implements extract_quotes and score_entailment for GIL (see gi/grounding.py, provider modules). generate_insights is separate: used only when gi_insight_source: provider.
| Backend | extract_quotes |
score_entailment |
generate_insights (insight wording) |
Notes |
|---|---|---|---|---|
transformers |
Yes (local QA) | Yes (CrossEncoder NLI) | No (returns []) |
Requires .[ml]; gi_qa_model / gi_nli_model. |
hybrid_ml |
Yes | Yes | No | Same local evidence stack; map/reduce summaries. |
openai |
Yes (LLM JSON span) | Yes (LLM score) | Yes | API key; optional openai_insight_model for insights only. |
gemini |
Yes | Yes | Yes | GEMINI_API_KEY. |
anthropic |
Yes | Yes | Yes | ANTHROPIC_API_KEY. |
mistral |
Yes | Yes | Yes | MISTRAL_API_KEY. |
deepseek |
Yes | Yes | Yes | DEEPSEEK_API_KEY. |
grok |
Yes | Yes | Yes | GROK_API_KEY. |
ollama |
Yes | Yes | Yes | Local Ollama server. |
Mixing backends (e.g. OpenAI summaries + local entailment) is supported; use gil_evidence_match_summary_provider and explicit quote_extraction_provider / entailment_provider to make the choice deliberate. See CONFIGURATION — GIL evidence providers.
Output Artifact: gi.json¶
When GIL is enabled, each episode output directory receives a gi.json file alongside the existing metadata document (e.g. metadata/<episode>.metadata.json). The metadata document is extended for consistency: it includes an optional grounded_insights section with provenance only (artifact path, insight count, generated_at, schema_version). The full GI graph stays in gi.json; the metadata model stays the single place that describes all episode artifacts (summary + GI index). For full insight content and quotes, use gi.json; for a quick index and discovery, use the metadata file’s grounded_insights field.
The gi.json file contains:
- Schema and provenance:
schema_version,model_version,prompt_version,episode_id. - Nodes: Podcast, Episode, Speaker, Topic, Insight, Quote.
- Edges: e.g.
HAS_INSIGHT(Episode → Insight),SUPPORTED_BY(Insight → Quote),ABOUT(Insight → Topic).
Insight text provenance in gi.json (model_version)¶
There is no separate gi_insight_model config field. Top-level model_version records which model lineage produced the insight strings for that artifact, derived at build time by podcast_scraper.gi.provenance.resolve_gil_artifact_model_version:
gi_insight_source: stub→"stub".summary_bullets→ the summarization model (from the livesummary_provider.summary_modelwhen present, otherwise fromConfig— e.g.summary_model/openai_summary_modelper provider).provider→ the model used forgenerate_insights. For OpenAI, that isopenai_insight_modelwhen set, otherwiseopenai_summary_model(same instance exposesinsight_modelon the provider after init).
Use this field for audits and metrics, not a second hand-maintained model id in YAML.
The file is co-located with the transcript and summary. The logical “full” set of grounded insights across the project is the union of all per-episode gi.json files (no global store in v1).
Schema and Validation¶
- Ontology: GIL Ontology — node/edge types, required properties, grounding contract, ID rules.
- JSON Schema: gi.schema.json — machine-readable validation.
Generated gi.json files must conform to the schema. Validation utilities (e.g. in the gi package) can be used in tests and CI.
- CLI:
podcast_scraper gi validate <paths…> [--strict]— same idea askg validate(see CLI API). - Makefile / script:
make validate-gi-schema [ARTIFACTS_DIR=path]orscripts/tools/validate_gi_schema.py(always strict) for batch checks in CI. - PRD-017 metrics:
make gil-quality-metrics DIR=<run_root>orscripts/tools/gil_quality_metrics.py— aggregation over.gi.json(grounding rate, quote span/timestamp validity, per-episode density). Use--enforceand--min-*flags to gate releases.
Two views of “quote validity”: The file-based quality script checks schema (spans, transcript_ref, timestamps) and does not load transcript files. During a pipeline run, Metrics / record_gi_success_counts can also compute verbatim agreement between quote text and the transcript slice when the transcript is available. Use the same view when comparing numbers; they are related but not identical.
ML vs LLM evidence — outcome benchmark (v1)¶
For comparing outcomes on the same episodes (not YAML threshold parity), use a
fixed episode set and two runs — e.g. local transformers evidence vs OpenAI
extract_quotes / score_entailment with the same summary_provider and
gi_insight_source. Pair configs live under config/manual/gil_paired_benchmark_*.yaml.
After runs, compare with make compare-gil-runs REF=<ref_run_root> CAND=<cand_run_root>
or python scripts/tools/compare_gil_runs.py <ref_run_root> <cand_run_root> — per-episode
quote/grounded counts and a short agreement summary.
Methodology and limits: WIP: GIL ML vs OpenAI outcome benchmark.
CLI¶
Available commands (RFC-050):
gi validate: Validate one or more.gi.jsonfiles or directories (recursive). Options:--strict(full JSON Schema),-q/--quiet(only failures). Exit 3 if no artifacts are found; 1 if any file fails validation (aligned withkg validate).gi export: Export all*.gi.jsonunder--output-diras--format ndjson(default) ormerged(singlegi_corpus_bundleJSON). Options:--out PATHor stdout,--strict(fail on schema errors). Mirrorskg export(RFC-056 symmetry).gi inspect: Inspect a single episode’s grounded insights. Use--episode-path <path>to point to a.gi.jsonfile, or--output-dir <dir>and--episode-id <id>to locate the artifact. Options:--format pretty|json,--show(full insight text and quotes),--stats/--no-stats,--strict(strict schema validation).gi show-insight: Show one insight by ID with its quotes and evidence spans. Use--id INSIGHT_ID(required). Locate the episode via--episode-path <path>or--output-dir <dir>(scans for the artifact containing that insight). Options:--format pretty|json,--context-chars N(transcript context around quotes).gi explore: Cross-episode query (RFC-050 UC5). Use--output-dir <path>(required). Optional--topic <label>(Topic label or substring in insight text) and--speaker <substring>(substring match on quotespeaker_idfrom diarization or on graph-backedspeaker_nameresolved from SPOKEN_BY → Speaker when the quote has no id). Options:--limit N,--grounded-only,--min-confidence,--sort confidence|time,--format pretty|json,--out <path>,--strict. JSON output follows the RFC-050 Insight Explorer shape (topicobject when--topicis set, nestedepisodeandsupporting_quotes,top_speakers). Exit codes: 0 success, 2 invalid args, 3 no artifacts / bad output dir, 4 no matching insights when--topicor--speakeris set, 5--strictvalidation failed on an artifact.gi query: RFC-050 UC4 — maps a fixed set of English question patterns (not open-ended NL interpretation) togi exploreor a topic leaderboard, then prints an envelope (question,answer,explanation). Use--output-dir <path>and--question "<text>"(required). Optional--limit N,--format pretty|json(default json),--strict. Supported patterns include topic phrasing (What insights about …?,What insights are there about …?,What are insights about …?,Insights about …,Show me insights about …,Tell me about insights on …), speaker phrasing (What did <name> say?), combined speaker + topic (What did <name> say about <topic>?), and topic ranking (Which topics have the most insights?,Top topics, etc.). Leaderboard answers useanswer.topics(insight counts per topic label) instead of the exploreinsightslist. Exit codes: same as explore for load/validation, plus 2 when the question does not match any supported pattern (or required args are missing). See Recorded product decisions (v1, issue 460).
Entrypoint: podcast_scraper gi validate ./output/metadata --strict, gi export --output-dir /path/to/output --format ndjson, gi inspect --episode-path /path/to/ep.gi.json, gi show-insight --id insight:<id-from-gi.json> --output-dir /path/to/output, gi explore --output-dir /path/to/output [--topic "AI"] [--speaker "HOST"], or gi query --output-dir /path/to/output --question 'What insights about inflation?'.
Browser visualization (prototype)¶
To explore gi.json (and optionally kg.json) as an interactive graph — filters,
metrics, Chart.js bars, and either vis-network or Cytoscape.js — run
make serve-gi-kg-viz from the repo root and open http://127.0.0.1:8765/. Prefer this
over opening the HTML files as file://, so browser CDNs load reliably.
Full usage and layout of the static app: Development Guide — GI / KG browser
viewer and
web/gi-kg-viz/README.md.
Use Cases¶
- Topic research: Find all insights (and evidence) for a topic across episodes.
- Speaker mapping: See which speakers said which quotes and which insights they support.
- Evidence-backed retrieval: Filter for grounded-only insights; jump from insight to quote to timestamp in the transcript.
- Insight Explorer: Browse insights by topic with quotes and episode context (RFC-050).
Troubleshooting¶
- No gi.json: Ensure
generate_giistrueand the pipeline ran past the GIL stage. Check that the episode has a transcript and that the GIL stage did not error (logs). - Validation errors: Run artifact validation against
docs/architecture/gi/gi.schema.json; fix any missing required fields or invalid types (see ontology for rules). - Ungrounded insights: Some insights may be
grounded=falseif no quote passed the evidence thresholds. This is intentional transparency; you can tune QA/NLI thresholds in config when available. - Evidence stack / model load: If embedding, QA, or NLI models fail to load, check device (e.g.
embedding_device,nli_device) and thatsentence-transformersandtransformersare installed (e.g.pip install -e ".[ml]"). - Provider-based evidence: If you set
quote_extraction_providerorentailment_providerto an LLM (e.g.openai), ensure that provider is initialized (same as for summarization) and the required API key is set. LLM providers implementextract_quotesandscore_entailmentvia their chat API; ML providers use the local QA and NLI models (seegi_qa_model,gi_nli_model).
Related Documentation¶
- Recorded product decisions (v1, issue 460) — shallow v1 scope table (ML, topics, speakers, SQL deferral, CLI).
- PRD-005: Episode summarization — summaries as the fast consumption layer above transcripts.
- Knowledge Graph Guide — KG as a separate navigation layer (
kgvsgi). - GIL Ontology — full ontology and grounding contract.
- GIL Schema — JSON schema for
gi.json. - PRD-019: Knowledge Graph Layer (KG) — separate feature from GIL (
kgvsgi; entities/linking, not evidence-first insights). - Pipeline and Workflow Guide — where GIL fits in the pipeline.
- Development Guide — GI / KG browser viewer — optional local UI for
gi.json/kg.json. - Architecture — GIL extraction and artifact layout.
- Provider Configuration Quick Reference — config keys and provider options.
- RFC-049 — GIL core concepts and evidence stack; includes implementation note on provider-based QA/NLI (quote_extraction_provider, entailment_provider).