RFC-006: Whisper Screenplay Formatting¶
- Status: Completed
- Authors: GPT-5 Codex (initial documentation)
- Stakeholders: Maintainers, operators formatting transcripts for dialog review
- Related PRD:
docs/prd/PRD-002-whisper-fallback.md
Abstract¶
Describe the algorithm that converts Whisper transcription segments into screenplay-style dialog with speaker attribution, including configuration hooks and formatting guarantees.
Problem Statement¶
When Whisper is used to fill missing transcripts, some users prefer dialog-formatted output with alternating speaker labels. We must provide deterministic formatting driven by user-configurable speaker settings and silence gaps.
Constraints & Assumptions¶
- Whisper segment data includes
start,end, andtextfields; we operate on the sorted list of segments. - Users may supply explicit speaker names or fall back to enumerated labels (
SPEAKER 1, etc.). - Formatting should remain optional and degrade gracefully if segments are missing metadata.
Design & Implementation¶
- Segment normalization
- Sort segments by
starttime, defaulting missing values to0.0. - Skip empty or whitespace-only segments.
- Speaker alternation
- Maintain
current_speaker_idx, advancing when the gap between segment end and next start exceedscfg.screenplay_gap_s. - Wrap speaker index modulo
max(config.MIN_NUM_SPEAKERS, cfg.screenplay_num_speakers). - Line aggregation
- Consecutive segments assigned to same speaker are concatenated with spaces.
- Preserve order in a list of
(speaker_idx, text)tuples. - Label resolution
- Map indices to user-provided
cfg.screenplay_speaker_nameswhen available; fallback toSPEAKER <n>. - Output
- Join lines with newline separators and append trailing newline for POSIX-friendly files.
- When formatting fails (invalid input), fall back to plain Whisper text to avoid data loss.
Key Decisions¶
- Gap-based alternation rather than lexical cues keeps implementation simple and deterministic.
- Speaker count minimum ensures at least two speakers when screenplay mode is enabled, aligning with CLI validation.
- Graceful fallback prioritizes delivering usable transcripts even when formatting inputs are malformed.
Alternatives Considered¶
- Speaker diarization: Not implemented due to complexity and external dependencies.
- Timestamps in output: Deferred; could be added as optional metadata in future iterations.
Testing Strategy¶
- Unit tests feed synthetic segment lists to validate gap handling, speaker rotation, and aggregation.
- Integration tests toggle screenplay flags to confirm CLI wiring and fallback behavior.
Rollout & Monitoring¶
- Logging warns when formatting fails and plain text is used instead.
- Future enhancements (e.g., custom templates) can extend this RFC while maintaining backward compatibility.
References¶
- Source:
podcast_scraper/whisper_integration.py(format_screenplay_from_segments) - CLI configuration:
docs/rfc/RFC-007-cli-interface.md - Configuration schema:
docs/rfc/RFC-008-config-model.md