PRD-018: Database Projection for GIL and KG¶
- Status: π Draft
- Authors: Podcast Scraper Team
- Related RFCs:
- RFC-044 (Model Registry β prerequisite)
- RFC-042 (Hybrid ML Platform β prerequisite)
- RFC-049 (Core GIL Concepts β GIL dependency)
- RFC-050 (Use Cases & Insight Explorer β GIL)
- RFC-051 (Database Projection β technical design for GIL and KG)
- RFC-055 (Knowledge Graph β artifact model for KG projection)
- RFC-056 (Knowledge Graph β use cases and consumption patterns for KG queries)
- Related Issues: #31, #40, #50
- Related PRDs (in scope):
- PRD-017: Grounded Insight Layer β source
gi.jsonand GIL semantics - PRD-019: Knowledge Graph Layer β source KG artifacts per RFC-055 (
kgexport and tables are in scope here; implementation may ship after GIL export when KG artifacts exist) - Related Documents:
docs/architecture/gi/ontology.md,docs/architecture/gi/gi.schema.jsonβ GIL ontology and schemadocs/architecture/kg/ontology.md,docs/architecture/kg/kg.schema.jsonβ KG ontology and schema (RFC-055)
Summary¶
This PRD defines database projection for both the Grounded Insight Layer (GIL) and the Knowledge Graph Layer (KG). It enables fast, queryable access by projecting file-based artifacts into a relational database (Postgres). Files remain the canonical, auditable source of truth; the database is a derived serving layer for SQL, indexing, and incremental upserts.
GIL projection covers gi.json (insights, quotes, grounding, topics, speakers) and supports UC1βUC5 / Insight Explorerβstyle queries (RFC-050) without scanning every episode directory.
KG projection covers per-episode KG artifacts defined in RFC-055 (entities, topics, typed edges, episode anchoring). It supports cross-episode linking and discovery queries that are not the same as GILβs evidence-first contract. GIL and KG use separate tables and provenance columns β no merging of the two layers into a single JSON blob or a single undifferentiated βgraphβ table without typed source (see RFC-051).
Together, this transforms the project into a better tooling surface for analysts, notebook users, and integrators, while preserving on-disk layouts and independent feature flags (generate_gi vs KG generation per RFC-055).
Background & Context¶
The Grounded Insight Layer (PRD-017) produces
structured gi.json files per episode, enabling
evidence-backed insight retrieval. However, file-based
access has limitations:
- Scalability: Scanning N folders and parsing N JSON files becomes slow at scale (50 episodes is fine; 5,000 is painful)
- Query Performance: No indexing, filtering, or pagination for global queries
- Integration Friction: Downstream tools (CLI, notebooks, web UIs, agents) must implement file scanning logic
- Iteration Speed: Re-running extraction or comparing runs requires manual file operations
The Core Problem: Files are canonical and auditable, but they're not optimized for fast queries and tooling.
The Solution: Export to Postgres as a "projection layer" that preserves provenance while enabling SQL-based queries, indexing, and incremental updates. Files remain the source of truth; the database becomes the serving layer.
How it relates to existing features:
- Grounded Insight Layer (PRD-017): Database export
projects GIL data from
gi.jsonfiles into tables - Metadata Generation (PRD-004): Episode metadata is also exported for complete episode context
- Transcript Pipeline (PRD-001): Transcript paths are stored for evidence resolution
- GIL Use Cases (RFC-050): Database export makes UC1βUC5 queries fast and ergonomic
- Knowledge Graph (PRD-019, RFC-055): When KG artifacts exist, database export can project KG nodes and edges for fast entity/topic/relationship queries across episodes β same projection product as GIL, separate tables (see RFC-051)
Goals¶
- Enable Fast Queries: Transform file scanning into indexed SQL queries for UC1βUC5 (GIL) and for cross-episode KG traversals defined in RFC-055
- Support Notebook Research Workflows: Enable power users to build topic dossiers, speaker profiles, and insight timelines (GIL); enable entity-centric and relationship queries (KG) where implemented
- Provide Stable Integration Interface: Give downstream tools (CLI, notebooks, web UIs, agents) a consistent SQL interface for both layers
- Preserve Provenance: Every database row is
traceable back to source files (
gi.json, KG artifact path, transcript evidence) and extraction metadata - Enable Incremental Growth: Support upserting new episodes without reprocessing historical data
- Keep GIL and KG Semantically Separate: Same
Postgres instance may host both projections; schemas
and CLI entrypoints (
gi exportvskg export, or a documented combined mode) must not conflate grounding contracts with KG linking rules
Non-Goals (v1)¶
- Replacing file-based outputs as the source of truth (files remain canonical)
- Collapsing GIL and KG into a single on-disk artifact
or a single undifferentiated graph table (both may
coexist in DB with clear
source/ table naming) - Graph database as a hard dependency (relational projection is sufficient for v1; optional graph store remains a future evaluation per RFC-055)
- End-user UI or dashboards (deferred to post-v1)
- Real-time or streaming ingestion (batch export only)
- Storing full transcript text in database (pointers only for v1)
- Advanced analytics (trends, sentiment) β deferred to post-v1
- Cross-run GIL merging logic (episode-scoped in v1)
- Guaranteed same release for GIL export and KG
export (KG projection may follow once
kg.schema.jsonand pipeline outputs are stable)
Personas¶
- Power Users (Researchers, Analysts, Founders): Need instant topic/speaker exploration, insight retrieval with supporting quotes, and the ability to build research workflows in notebooks
- Builders (Developers, Agent Builders): Need a stable SQL interface for integrating podcast data into CLI tools, web UIs, and RAG pipelines
- Iterators (ML Engineers, Pipeline Developers): Need to re-run extraction, re-export, and compare runs without manual file operations
User Stories¶
- As a researcher, I can query all episodes discussing a topic in milliseconds so that I can build topic dossiers without waiting for file scans.
- As an analyst, I can retrieve insights with supporting quotes using SQL so that I can cite findings with confidence.
- As a developer, I can build a CLI tool on top of Postgres so that I can provide instant topic/speaker queries with evidence.
- As a power user, I can build notebook workflows that query Postgres so that I can create speaker profiles, insight timelines, and quality audits.
- As an ML engineer, I can re-export after model changes so that I can compare extraction quality across runs.
Functional Requirements¶
FR1: Export Command & Interface¶
- FR1.1: Provide
gi exportandkg exportCLI commands (or one orchestrated command with explicit layer flags) that scan the output directory and export to Postgres - FR1.2: Support
--target postgreswith--dsnconnection string - FR1.3: Support
--output-dirto specify root output directory - FR1.4: Support
--rebuildflag to drop and recreate all tables - FR1.5: Support incremental mode (default) that only ingests new episodes
FR2: Episode Metadata Export¶
- FR2.1: Export episode-level records (id, title, publish_date, podcast_id, file paths)
- FR2.2: Store paths to
metadata.json,transcript.json,summary.json,gi.json - FR2.3: Include
schema_versionandingestion_run_idfor lineage tracking
FR3: GIL Node Export¶
- FR3.1: Export global nodes (speakers, topics) with deduplicated, slug-based IDs
- FR3.2: Export episode-scoped nodes (insights, quotes) with episode_id foreign key
- FR3.3: Preserve all GIL node properties (text, label, confidence, grounded status, etc.)
- FR3.4: Include confidence scores and provenance metadata (model_version, prompt_version)
FR4: GIL Edge Export¶
- FR4.1: Export relationship tables
(
insight_topics,insight_support) - FR4.2: Preserve edge properties (confidence) where applicable
- FR4.3: Support filtering by confidence thresholds in queries
FR5: Evidence & Provenance Tracking¶
- FR5.1: Store transcript references (path, char_start, char_end) for quotes
- FR5.2: Store timestamp ranges (timestamp_start_ms, timestamp_end_ms) for quotes
- FR5.3: Include
model_versionandprompt_versionfor ML-derived data - FR5.4: Ensure every row is traceable back to
gi.jsonpath and episode_id
FR6: Query Support for UC1βUC5¶
- FR6.1: Support UC1 (Cross-Podcast Topic
Research) via
topics+insight_topics+insights+episodesjoins - FR6.2: Support UC2 (Speaker-Centric Insight
Mapping) via
speakersβquotesβinsight_supportβinsightsjoins - FR6.3: Support UC3 (Evidence-backed Quote/Insight
Retrieval) via
insights+insight_support+quotesjoins - FR6.4: Support UC4 (Semantic Question Answering) via topic β insight filtering
- FR6.5: Support UC5 (Insight Explorer) via
topicsβinsight_topicsβinsightsβinsight_supportβquotesjoins
FR7: Incremental Updates & Rebuilds¶
- FR7.1: Support incremental ingestion (only new episodes by default)
- FR7.2: Support full rebuild (drop and recreate all tables)
- FR7.3: Handle schema version changes gracefully (warn or fail on mismatch)
- FR7.4: Support run comparison via
ingestion_run_idtracking
FR8: KG Artifact Export (RFC-055)¶
- FR8.1: Provide
kg export(or equivalent documented command) that scans the output directory for per-episode KG artifacts and projects them into Postgres tables defined in RFC-051 - FR8.2: Store
kg_path(and optionalkg_schema_version) on episode records when KG data is present; rows must be traceable to the source KG JSON file - FR8.3: Export KG nodes and KG edges (or
normalized equivalents) with episode scope and stable
IDs per RFC-055 β separate from
insights/quotes/ GIL join tables - FR8.4: Support incremental and rebuild modes analogous to GIL export; idempotent upserts per episode
- FR8.5: Validate against
docs/architecture/kg/kg.schema.jsonwhen present (mirrormake validate-gi-schemastory)
FR9: Coexistence and Independence¶
- FR9.1: Exporting GIL only, KG only, or both must be supported when the corresponding artifacts exist (no requirement to enable both pipeline features to run export)
- FR9.2: Optional cross-layer links (e.g. KG node
referencing a GIL
insight_id) remain out of scope for v1 unless added by a follow-up RFC; database schema should not assume such links
Success Metrics¶
- Query Performance: UC1βUC5 queries complete in < 100ms for datasets up to 5,000 episodes; KG relational queries meet targets agreed in RFC-051 for comparable corpus sizes
- Export Speed: Export completes in < 5 minutes for 1,000 episodes (GIL and/or KG, within an order of magnitude of each other for similar row counts)
- Provenance Accuracy: 100% of database rows are
traceable to source
gi.jsonand/or KG artifact files - Notebook Workflow Adoption: Positive feedback from power users on research workflow speed
- Integration Success: CLI tools and notebooks can query Postgres without file scanning
Dependencies¶
- PRD-017: Grounded Insight Layer (GIL data to export)
- PRD-019 / RFC-055: Knowledge Graph artifact model and schema (KG data to export when implemented)
- RFC-044: Model Registry (model metadata tracked in provenance fields)
- RFC-042: Hybrid ML Platform (produces pipeline
outputs including
gi.jsonwhen GIL is enabled) - RFC-049: Core GIL Concepts & Data Model (defines
ontology and
gi.jsonschema) - RFC-050: GIL Use Cases & Insight Explorer (defines UC1βUC5 query patterns)
- RFC-051: Database Projection (technical implementation details for GIL and KG tables)
Constraints & Assumptions¶
Constraints:
- Must not replace file-based outputs as source of truth
- Must preserve full provenance (traceable to
gi.jsonand transcript evidence) - Must support incremental updates without reprocessing historical data
- Must be idempotent and rebuildable from disk
- Must not require graph database as hard dependency
Assumptions:
- Users have access to Postgres instance (local or remote)
- Database schema can be created/managed by export command
- Transcript text can remain on disk (pointers only in v1)
- Users understand SQL or can use provided query examples
- GIL extraction (RFC-049) produces valid
gi.jsonfiles conforming to schema when GIL export is used - KG pipeline (RFC-055) produces valid KG artifacts when KG export is used
Design Considerations¶
Transcript Text Storage¶
- Option A: Store only pointers (transcript path + spans) β DECIDED for v1
- Pros: Simpler, smaller DB, canonical stays on disk, easier migration
- Cons: Quote viewer must read disk files (acceptable for v1)
- Decision: Option A for v1. If query latency
for transcript resolution exceeds 100ms per
quote, add a
quote_textdenormalized column in v1.1.
Database Target Priority¶
- Option A: Postgres primary, others future β DECIDED
- Pros: Structured joins, strong constraints, simple ops, covers UC1βUC5
- Cons: May need columnar store for aggregations later
- Decision: Option A (Postgres v1, graph database evaluated at ~1000+ episodes)
Incremental vs Full Rebuild¶
- Option A: Support both modes β DECIDED
- Pros: Fast updates for new episodes, full control for schema changes
- Cons: More complex export logic
- Decision: Option A (incremental default, rebuild flag for schema changes)
Integration with Existing Features¶
Database export enhances GIL and KG by:
- GIL File Output: Projects
gi.jsonfiles into queryable tables without changing file format - KG File Output: Projects KG artifacts (RFC-055)
into queryable tables without merging into
gi.json - Episode Metadata: Combines episode records with GIL and/or KG paths for complete context
- Transcript Evidence: Stores pointers to transcript files for quote resolution (GIL); KG may reference spans per RFC-055
- Use Case Queries: Makes UC1βUC5 queries fast and ergonomic via SQL (GIL); supports KG discovery queries per RFC-051
Example Use Cases¶
UC1: Cross-Podcast Topic Research (Postgres)¶
Question: "Show me all episodes discussing AI Regulation since Jan 2026"
SELECT e.id, e.title, e.publish_date, e.podcast_id
FROM episodes e
JOIN insight_topics it ON e.id = it.episode_id
JOIN topics t ON it.topic_id = t.id
WHERE t.label = 'AI Regulation'
AND e.publish_date >= '2026-01-01'
ORDER BY e.publish_date DESC;
Benefit: Instant topic-wide views, fast filters by date/podcast/confidence, easy pagination
UC2: Speaker-Centric Insight Mapping (Postgres)¶
Question: "What topics does Sam Altman cover, and what are his top insights?"
SELECT t.label,
COUNT(DISTINCT i.id) AS insight_count
FROM speakers s
JOIN quotes q ON s.id = q.speaker_id
JOIN insight_support isup ON q.id = isup.quote_id
JOIN insights i ON isup.insight_id = i.id
JOIN insight_topics it ON i.id = it.insight_id
JOIN topics t ON it.topic_id = t.id
WHERE s.name = 'Sam Altman'
GROUP BY t.label
ORDER BY insight_count DESC
LIMIT 10;
Benefit: Speaker profiles become one query, easy ranking by confidence or recency
UC3: Evidence-backed Quote/Insight Retrieval¶
Question: "Give me the supporting quotes + timestamps for insight X"
SELECT i.text AS insight_text,
q.text AS quote_text,
q.timestamp_start_ms,
q.timestamp_end_ms,
q.transcript_ref,
q.char_start,
q.char_end,
s.name AS speaker_name
FROM insights i
JOIN insight_support isup ON i.id = isup.insight_id
JOIN quotes q ON isup.quote_id = q.id
LEFT JOIN speakers s ON q.speaker_id = s.id
WHERE i.id = 'insight:episode:abc123:a1b2c3d4';
Benefit: Fetch insights with evidence without opening JSON, build evidence viewer UI/CLI
UC5: Insight Explorer (Postgres)¶
Question: "What are the top insights about AI Regulation with supporting evidence?"
SELECT i.text AS insight,
i.confidence,
i.grounded,
q.text AS supporting_quote,
q.timestamp_start_ms,
q.timestamp_end_ms,
s.name AS speaker,
e.title AS episode_title
FROM topics t
JOIN insight_topics it ON t.id = it.topic_id
JOIN insights i ON it.insight_id = i.id
JOIN insight_support isup ON i.id = isup.insight_id
JOIN quotes q ON isup.quote_id = q.id
LEFT JOIN speakers s ON q.speaker_id = s.id
JOIN episodes e ON i.episode_id = e.id
WHERE t.label = 'AI Regulation'
AND i.grounded = true
ORDER BY i.confidence DESC
LIMIT 20;
Benefit: Full Insight Explorer query β insights, quotes, speakers, episodes in one result set
Notebook Research Workflows¶
Topic Dossier:
import pandas as pd
import sqlalchemy
# Connect to Postgres
engine = sqlalchemy.create_engine("postgresql://...")
# Get insights + quotes + speakers about a topic
query = """
SELECT e.title, e.publish_date,
i.text AS insight, i.confidence,
q.text AS quote,
q.timestamp_start_ms,
s.name AS speaker
FROM topics t
JOIN insight_topics it ON t.id = it.topic_id
JOIN insights i ON it.insight_id = i.id
JOIN insight_support isup ON i.id = isup.insight_id
JOIN quotes q ON isup.quote_id = q.id
LEFT JOIN speakers s ON q.speaker_id = s.id
JOIN episodes e ON i.episode_id = e.id
WHERE t.label = 'AI Regulation'
ORDER BY i.confidence DESC
LIMIT 50;
"""
df = pd.read_sql(query, engine)
# Build topic dossier with insights, quotes, evidence
Speaker Profile:
# What does a speaker discuss most, with top insights
query = """
SELECT t.label,
COUNT(DISTINCT i.id) AS insight_count,
AVG(i.confidence) AS avg_confidence
FROM speakers s
JOIN quotes q ON s.id = q.speaker_id
JOIN insight_support isup ON q.id = isup.quote_id
JOIN insights i ON isup.insight_id = i.id
JOIN insight_topics it ON i.id = it.insight_id
JOIN topics t ON it.topic_id = t.id
WHERE s.name = 'Sam Altman'
GROUP BY t.label
ORDER BY insight_count DESC;
"""
Quality Audit:
# Show ungrounded insights or low-confidence quotes
query = """
SELECT i.id, i.text, i.confidence, i.grounded,
e.title
FROM insights i
JOIN episodes e ON i.episode_id = e.id
WHERE i.grounded = false
OR i.confidence < 0.5
ORDER BY i.confidence ASC;
"""
Resolved Questions¶
All product questions have been resolved. Decisions are recorded here for traceability.
-
Transcript Text Storage: Should transcript text be stored verbatim or referenced only? Pointers only for v1. The
quotestable storeschar_start,char_end,timestamp_start_ms,timestamp_end_ms, andtranscript_ref. Full text is resolved at query time by reading the transcript file. If latency exceeds 100ms per quote, add aquote_textdenormalized column in v1.1. -
Entity/Topic Normalization: How aggressively should normalization occur before export? Slug-based normalization for v1. Topics are normalized to lowercase slugs (e.g., "Machine Learning" β
machine-learning). Semantic dedup deferred to v1.1 when sentence embeddings (RFC-042) are available. -
Graph-Native Store: When does a graph-native store become justified? Not for v1; trigger at ~1000 episodes. Postgres with proper indexes handles the v1 query patterns. Graph databases evaluated when multi-hop traversals or graph algorithms become common queries.
Related Work¶
- PRD-017: Grounded Insight Layer β defines the GIL data that this PRD exports
- PRD-019: Knowledge Graph Layer β defines the KG data that this PRD exports (RFC-055 artifact contract)
- RFC-044: Model Registry β model metadata tracked in provenance fields
- RFC-042: Hybrid ML Platform β produces the
gi.jsonfiles - RFC-049: Core GIL Concepts & Data Model
- RFC-050: GIL Use Cases & Insight Explorer
- RFC-051: Database Projection β technical design
- Issue #31: Metadata persistence & structure
- Issue #40: Data storage / DB integration
- Issue #50: Querying & downstream usage
Release Checklist¶
- [ ] PRD reviewed and approved
- [ ] RFC-051 finalized with technical design (GIL + KG)
- [ ] Database schema designed (GIL: insights, quotes, insight_support, insight_topics; KG: per RFC-051)
- [ ] Export command implemented (
gi export;kg exportor documented combined mode) - [ ] Tests cover export, incremental updates, and rebuilds (GIL; KG when artifacts available)
- [ ] UC1βUC5 queries validated against Postgres (GIL)
- [ ] KG query examples validated when KG projection ships
- [ ] Notebook examples created and documented
- [ ] CLI integration verified (
gi explore, etc.; KG CLI per RFC-055 when applicable) - [ ] Documentation updated (README, query examples)
- [ ] Integration with GIL pipeline verified; KG pipeline verified when KG export ships