Skip to content

RFC-006: Whisper Screenplay Formatting

  • Status: Completed
  • Authors: GPT-5 Codex (initial documentation)
  • Stakeholders: Maintainers, operators formatting transcripts for dialog review
  • Related PRD: docs/prd/PRD-002-whisper-fallback.md

Abstract

Describe the algorithm that converts Whisper transcription segments into screenplay-style dialog with speaker attribution, including configuration hooks and formatting guarantees.

Problem Statement

When Whisper is used to fill missing transcripts, some users prefer dialog-formatted output with alternating speaker labels. We must provide deterministic formatting driven by user-configurable speaker settings and silence gaps.

Constraints & Assumptions

  • Whisper segment data includes start, end, and text fields; we operate on the sorted list of segments.
  • Users may supply explicit speaker names or fall back to enumerated labels (SPEAKER 1, etc.).
  • Formatting should remain optional and degrade gracefully if segments are missing metadata.

Design & Implementation

  1. Segment normalization
  2. Sort segments by start time, defaulting missing values to 0.0.
  3. Skip empty or whitespace-only segments.
  4. Speaker alternation
  5. Maintain current_speaker_idx, advancing when the gap between segment end and next start exceeds cfg.screenplay_gap_s.
  6. Wrap speaker index modulo max(config.MIN_NUM_SPEAKERS, cfg.screenplay_num_speakers).
  7. Line aggregation
  8. Consecutive segments assigned to same speaker are concatenated with spaces.
  9. Preserve order in a list of (speaker_idx, text) tuples.
  10. Label resolution
  11. Map indices to user-provided cfg.screenplay_speaker_names when available; fallback to SPEAKER <n>.
  12. Output
  13. Join lines with newline separators and append trailing newline for POSIX-friendly files.
  14. When formatting fails (invalid input), fall back to plain Whisper text to avoid data loss.

Key Decisions

  • Gap-based alternation rather than lexical cues keeps implementation simple and deterministic.
  • Speaker count minimum ensures at least two speakers when screenplay mode is enabled, aligning with CLI validation.
  • Graceful fallback prioritizes delivering usable transcripts even when formatting inputs are malformed.

Alternatives Considered

  • Speaker diarization: Not implemented due to complexity and external dependencies.
  • Timestamps in output: Deferred; could be added as optional metadata in future iterations.

Testing Strategy

  • Unit tests feed synthetic segment lists to validate gap handling, speaker rotation, and aggregation.
  • Integration tests toggle screenplay flags to confirm CLI wiring and fallback behavior.

Rollout & Monitoring

  • Logging warns when formatting fails and plain text is used instead.
  • Future enhancements (e.g., custom templates) can extend this RFC while maintaining backward compatibility.

References

  • Source: podcast_scraper/whisper_integration.py (format_screenplay_from_segments)
  • CLI configuration: docs/rfc/RFC-007-cli-interface.md
  • Configuration schema: docs/rfc/RFC-008-config-model.md