RFC-075: Corpus Topic Clustering Layer¶
- Status: Draft
- Authors: Engineering (podcast_scraper)
- Stakeholders: Viewer / server maintainers, search and CIL owners
- Related PRDs:
docs/prd/PRD-021-semantic-corpus-search.md(embeddings source)- Related RFCs:
docs/rfc/RFC-061-semantic-corpus-search.md(FAISS,kg_topicindexing)docs/rfc/RFC-072-canonical-identity-layer-cross-layer-bridge.md(topic_id_aliases, CIL)docs/rfc/RFC-068-corpus-digest-api-viewer.md(digest surfaces topics)- Related UX specs:
docs/uxs/UXS-004-graph-exploration.md(graph rendering)- Related Documents:
docs/architecture/ARCHITECTURE.mdtests/fixtures/search/topic_clusters_validation.example.yaml(minimal schema example for unit tests only; not a maintained dataset)
GitHub tracking:
- #551 — Phase 0: validation dataset
- #552 — Phase 1: clustering module + CLI + tests
- #553 — Phase 2: auto
topic_id_aliases - #554 — Phase 3: API + viewer compound nodes
- #555 — Phase 4 (optional): KG prompt nudges
- #556 — Post-implementation: 3-pass holistic review
Abstract¶
This RFC defines a Corpus Topic Clustering Layer: an optional, well-bounded post-processing step that groups semantically similar KG topic nodes across episodes using embeddings already produced for semantic search. It writes a small corpus-level artifact (topic_clusters.json), can feed auto-generated topic_id_aliases for CIL, and lets the GI/KG viewer render Cytoscape compound parent nodes so users see topic families without losing per-episode labels.
Architecture alignment: The layer sits beside per-episode KG/GI extraction and beside CIL: it reads KG topic labels and search-time vectors, does not replace KG or FAISS, and does not require a database. Evolving clustering (threshold, algorithm, secondary signals) should be possible by changing this layer and its artifact only.
Multiple id lenses: The design is pragmatic, not a single unified topic ontology. You still have embedding space (FAISS), topic:… ids in KG and index metadata, tc:… compound parents for viewer layout (graph_compound_parent_id), and CIL merge targets (cil_alias_target_topic_id) plus hand overrides — they stay aligned by convention and tooling. A concise table lives in the Semantic Search Guide — Id spaces (mental model).
Problem Statement¶
KG topics use topic:{slug} derived from LLM-chosen labels. The same real-world subject often appears under different slugs and labels across episodes and feeds. Example from a 40-episode two-feed corpus: topic:strait-of-hormuz-conflict, topic:strait-of-hormuz, topic:strait-of-hormuz-control, and topic:marine-traffic-disruption all relate to Hormuz shipping and tension, but the merged graph shows them as unrelated nodes. CIL timelines and search lift that depend on exact topic: ids under-count cross-episode continuity.
Use cases:
- Viewer graph: Operator loads a multi-episode corpus and sees topic variants grouped visually (compound nodes) with a shared canonical label.
- CIL topic timeline: Queries keyed to a canonical topic id find more episodes once aliases map variants to that id.
- Offline operation: Clustering runs at build or index time; no external API calls at query or view time.
Goals¶
- Bounded module: Implement clustering behind a clear interface (inputs: topic id, label, optional description, embedding vector, contributing episode ids; output: cluster assignments and
topic_clusters.json). - Filesystem artifact: Emit
<corpus>/search/topic_clusters.jsonalongside existing search index files; no new storage backend. - Optional validation YAML: Operators may pass any path to
--validate-configwith expected-same pairs and expected-distinct pairs while tuning clustering; there is no committed canonical validation file inconfig/. Unit tests use a minimal fixture undertests/fixtures/search/for YAML shape only. - Integration hooks: CLI command to (re)build clusters; optional HTTP route to serve the JSON; viewer consumes it as an overlay (compound parents), without mutating raw
*.kg.json. - CIL bridge: Optional generation of
topic_id_aliasesentries from clusters; document merge rules with hand-authored overrides.
Constraints and assumptions¶
Constraints:
- No query-time calls to external embedding APIs; reuse the same local model as indexing (e.g.
sentence-transformers/all-MiniLM-L6-v2per RFC-061). - Backward compatible when
topic_clusters.jsonis absent (viewer and APIs degrade gracefully). - Corpus layout remains filesystem-first; artifact size should stay small (on the order of kilobytes for hundreds of topics).
Assumptions:
kg_topic(or equivalent) vectors exist in the FAISS store for topics to cluster, or labels can be re-embedded with the same model for a standalone clustering run.- Operators accept that pure embedding similarity can confuse geography-qualified labels (e.g. Cuban vs Iranian economic crisis); an optional ad-hoc validation YAML can encode must-not-merge cases while tuning.
Design and implementation¶
1. Layer placement and boundaries¶
The Corpus Topic Clustering Layer is not "semantic search" itself; it consumes embeddings produced for search (and KG topic text from artifacts). Position in the stack:
- Below: KG extraction (
*.kg.json), FAISS indexing (kg_topicrows). - Peer: CIL (
cil_lift_overrides.json,topic_id_aliases) for cross-episode identity. - Above: Server routes and viewer graph merge, which read
topic_clusters.jsonas read-only overlay data.
Isolation contract: Core clustering logic should avoid importing FastAPI or Vue; adapters load FAISS / walk *.kg.json and call the pure function or class that returns clusters.
2. Algorithm and parameters¶
- Similarity: Cosine similarity on topic embedding vectors (same space as semantic search).
- Clustering: Greedy average-linkage merging with a configurable cosine threshold (starting point 0.75; tune against your corpus using optional
--validate-configYAML). Repeatedly merge the two clusters whose mean pairwise similarity between members is highest, while that mean remains at or above the threshold. This avoids single-linkage chaining (weak A–C links via B). - Canonical label per cluster: The member whose embedding has the highest average cosine similarity to all other members in the cluster (centroid-closest label), not merely the shortest string (short labels can be too narrow, e.g. "ICE Deployment").
Validation YAML (operator-authored, any path): Uses expected_merge_pairs (exactly two topic: ids that must co-cluster) for CLI checks, because full human “story” groups often fail pairwise embedding thresholds at 0.75 (e.g. Supreme Court vs tariff refunds in the same episode). expected_distinct lists pairs that must not co-cluster. Optional per-row episode_sources maps each topic: id in that row to one or more episode_id strings (same as metadata / FAISS kg_topic metadata) where that Topic appears in *.kg.json; clustering ignores episode_sources (human backtracking only).
Known risk: Geography-qualified topics ("Cuban Economic Crisis" vs "Economic Crisis in Iran") may score high on similarity. Mitigations: encode as expected-distinct in validation; if needed later, add a secondary signal (e.g. named-entity overlap) without changing the artifact schema version in a breaking way.
3. Artifact schema (topic_clusters.json)¶
Versioned JSON written under <corpus>/search/topic_clusters.json. Current writers emit schema_version: "2". Readers accept v1 payloads (legacy field names below) for older files on disk.
Two different ids per cluster (do not conflate):
| Field (v2) | Legacy (v1) | Prefix | Role |
|---|---|---|---|
graph_compound_parent_id |
cluster_id |
tc: |
Viewer / Cytoscape only: id of the compound TopicCluster parent node. Never used as a CIL merge target. |
cil_alias_target_topic_id |
canonical_topic_id |
topic: |
CIL / topic_id_aliases: the topic id non-centroid members should alias to when --merge-cil-overrides runs. Never used as the graph compound parent id. |
Human-facing label (unchanged across versions): canonical_label — display string for the cluster (centroid-closest member label), not a graph or CIL id.
Illustrative v2 shape:
{
"schema_version": "2",
"model": "sentence-transformers/all-MiniLM-L6-v2",
"threshold": 0.75,
"clusters": [
{
"graph_compound_parent_id": "tc:strait-of-hormuz",
"cil_alias_target_topic_id": "topic:strait-of-hormuz",
"canonical_label": "Strait of Hormuz",
"member_count": 4,
"members": [
{
"topic_id": "topic:strait-of-hormuz-conflict",
"label": "Strait of Hormuz Conflict",
"similarity_to_centroid": 0.95,
"episode_ids": ["episode-id-or-stem"]
}
]
}
],
"singletons": 155,
"topic_count": 194,
"cluster_count": 12
}
- Members: List actual
topic:…ids present in KG graphs; viewer sets Cytoscapeparenttograph_compound_parent_idwhen both cluster parent and member exist in the merged graph.
4. CIL: auto-generated topic_id_aliases¶
Optional post-step: map each non-centroid member topic_id to the cluster’s cil_alias_target_topic_id (slug from centroid-closest label). podcast_scraper.cli topic-clusters --merge-cil-overrides writes topic_clusters.json, derives aliases, then merges into cil_lift_overrides.json.
Precedence: Existing keys in the file’s topic_id_aliases win over freshly derived auto entries (hand edits and prior merged runs stay in control). New topic ids only in the auto map are appended. Other top-level keys (transcript_char_shift, entity_id_aliases) are preserved.
5. Viewer: Cytoscape compound nodes¶
- Load
topic_clusters.json(via API or static path) when loading a corpus. - After normal GI+KG merge, inject a parent node per cluster with
type: TopicCluster(or agreed equivalent) and setdata.parenton member topic nodes that appear in the graph. - Style compound parents in
cyGraphStylesheet.ts(e.g. rounded rectangle, low-opacity fill, dashed border, canonical label). - Do not rewrite raw KG files; overlay only.
6. HTTP API¶
GET /api/corpus/topic-clusters: returnssearch/topic_clusters.jsonfor the resolved corpus root (pathquery or server default). 404 JSON body{ "detail": "...", "available": false }when the file is absent.
7. CLI¶
topic-clusters: rebuildtopic_clusters.jsonfor a corpus root (--threshold, optional--validate-config, optional--merge-cil-overrides).
Key decisions¶
- Compound nodes vs collapsing ids: Use Cytoscape compound parents so per-episode labels remain visible; collapsing everything to one
topic:id is optional via aliases for CIL/search but not required for the graph overlay. - Centroid-closest canonical label: Reduces risk of overly narrow canonical strings compared to "shortest label only."
- Draft RFC not in index: Per project convention, Draft RFCs are not listed in
docs/rfc/index.mduntil promoted.
Alternatives considered¶
- Rewrite all topic ids in merge: Simplifies CIL but loses distinct labels in the graph; rejected for viewer clarity.
- Only manual
topic_id_aliases: Does not scale with corpus growth; rejected as sole solution. - New database for clusters: Rejected; filesystem JSON is sufficient at current scale.
Testing strategy¶
- Unit tests: Pure clustering on small synthetic embedding matrices; canonical label selection.
- Integration / manual: Run
topic-clusters --validate-config <your.yaml>on a corpus whose index contains the listedtopic:ids; there is no required repo-wide validation dataset. - Regression: When
topic_clusters.jsonis missing, server and viewer paths must not error.
Rollout and monitoring¶
Phases:
- Clustering module + CLI + tests (fixture YAML for shape only).
- Auto-alias generation (
topic-clusters --merge-cil-overrides). - API route (
GET /api/corpus/topic-clusters) + viewer compoundTopicClusterparents (Cytoscapeparent). - KG extraction prompts nudge short, stable topic headings (about 2–8 words; detail in
description) to improve cross-episode alignment (kg/llm_extract.pysystem prompts + shared Jinja user prompts).
Success criteria:
- When using optional validation YAML locally, constraints pass under chosen threshold (or documented waiver for specific pairs).
topic_clusters.jsonsize and generation time acceptable on a 40-episode reference corpus.- Viewer shows compound grouping for at least one multi-episode Hormuz-style cluster in manual QA.
Post-implementation holistic review¶
Completed four passes (data flow, edge cases, documentation, second consistency audit). The detailed review notes were folded into the unified quality plan during the post-2.6 cleanup; see the implementation notes inline below for what survived as durable guidance.
Autoresearch Findings (2026-04-18, PR feat/pipeline-validation-591)¶
Threshold sweep on production corpus (1178 unique topics, 10 feeds)¶
| Threshold | Clusters | Singletons | Singleton% |
|---|---|---|---|
| 0.60 | 208 | 504 | 43% |
| 0.65 | 177 | 655 | 56% |
| 0.70 | 129 | 809 | 69% |
| 0.75 (old) | 88 | 929 | 79% |
| 0.80 | 56 | 1019 | 87% |
Default lowered from 0.75 → 0.70: +47% more clusters (88 → 129) with manageable max cluster size (19). Manual inspection confirms all new clusters are legitimate merges (quantum topics cluster, geopolitical topics cluster, etc.).
KG label quality is the biggest lever for clustering¶
Issue #580 showed Gemini produces 156-char sentence labels → 90% singletons. Root cause: sentence-shaped labels have low pairwise embedding similarity. KG v2 prompt (#590) + label enforcement (#587, 50-char cap) address this. Recommend re-sweep after deploying v2 prompt on production corpus — optimal threshold may shift back to 0.75 with cleaner labels.
Open questions¶
Viewer / search (cluster parents vs hits): Phase 1 shows Topic cluster: context (canonical
label) in the graph node rail when topic_clusters.json is loaded and the selected node is a
member Topic; selection stays on the Topic. Optional follow-ups (search-hit emphasis on
compounds, Show on graph camera to cluster) remain product decisions — see
UXS-004 and UXS-005.
Data (FAISS vs JSON): Canonical cluster membership stays in topic_clusters.json (and
HTTP passthrough) unless a future feature requires denormalized cluster ids on kg_topic FAISS
rows.
References¶
docs/rfc/RFC-061-semantic-corpus-search.mddocs/rfc/RFC-072-canonical-identity-layer-cross-layer-bridge.mddocs/architecture/ARCHITECTURE.mdtests/fixtures/search/topic_clusters_validation.example.yaml(test-only minimal example)docs/wip/wip-topic-clusters-validation-reference.yaml(optional WIP copy of a richer validation set for local--validate-config; not used by CI)