ADR-062: Sentence-Boundary Transcript Chunking¶

Status: Accepted
Date: 2026-04-03
Authors: Podcast Scraper Team
Related RFCs: RFC-061
Related PRDs: PRD-021

Context & Problem Statement¶

Semantic search (RFC-061) indexes transcript chunks as one of four document types. The chunking strategy directly affects embedding quality, search relevance, and index size. Chunks that split mid-sentence produce poor embeddings. Chunks that are too long dilute semantic signal. The system needs a chunking approach that balances quality, simplicity, and no external dependencies.

Decision¶

We adopt sentence-boundary chunking with configurable overlap:

Sentence splitting: Simple regex ((?<=[.!?])\s+ with \n fallback). No external tokenizer dependency.
Target chunk size: ~300 tokens (estimated via whitespace split). Sentences are grouped until the target is reached.
Overlap: ~50 tokens of trailing sentences carried from the previous chunk to preserve cross-chunk context.
Character tracking: Each chunk records char_start and char_end offsets into the original transcript.
Timestamp interpolation: When Whisper segment timestamps are available, timestamp_start_ms and timestamp_end_ms are interpolated from character position alignment.

Rationale¶

Embedding quality: Sentence boundaries preserve semantic coherence. Mid-sentence splits degrade embedding vectors measurably.
No external dependency: Regex splitting avoids adding spaCy sentence segmentation or NLTK punkt as dependencies. The project already uses spaCy for NER but not for chunking — keeping the dependency isolated.
Predictable sizing: Target + overlap parameters produce consistent chunk sizes across episodes, making index behavior predictable.
Configurable: vector_chunk_size_tokens and vector_chunk_overlap_tokens in config allow tuning without code changes.

Alternatives Considered¶

Fixed-token chunking (no sentence awareness): Rejected; frequently splits mid-sentence, degrading embedding quality.
Paragraph-based chunking: Rejected; podcast transcripts often lack paragraph structure. Would produce wildly variable chunk sizes.
Recursive character splitting (LangChain-style): Rejected; adds dependency, more complex than needed, and the recursive fallback logic is unnecessary when sentence splitting works well for transcripts.
spaCy sentence segmentation: Rejected; adds a heavy dependency for chunking when regex splitting is sufficient for English transcripts.

Consequences¶

Positive: Consistent, high-quality chunks. No new dependencies. Configurable parameters. Timestamp tracking enables "jump to audio" in viewer.
Negative: Regex sentence splitting may fail on edge cases (abbreviations, URLs, decimal numbers). Acceptable for podcast transcripts which are conversational text.
Neutral: Overlap increases index size by ~15-20% vs non-overlapping chunks. Trade-off for better cross-chunk retrieval.

Implementation Notes¶

Module: src/podcast_scraper/search/chunker.py
Function: chunk_transcript(text, target_tokens=300, overlap_tokens=50, timestamps=None) -> list[TranscriptChunk]
Config: vector_chunk_size_tokens: int = 300, vector_chunk_overlap_tokens: int = 50