Skip to content

Release v2.1.0 - Automatic Speaker Detection & Metadata Generation Design

Release Date: November 13, 2025 Type: Minor Release

Summary

v2.1.0 introduces automatic speaker name detection using Named Entity Recognition (NER) and establishes the design foundation for per-episode metadata generation. This release significantly enhances the transcript pipeline with intelligent host and guest identification, while laying the groundwork for comprehensive metadata documents.

What's New

✨ RFC-010: Automatic Speaker Name Detection (Implemented)

Automatic host and guest identification from episode metadata:

  • Named Entity Recognition (NER): Uses spaCy to automatically extract person names from episode titles and descriptions
  • RSS Author Tag Support: Prioritizes RSS <author>, <itunes:author>, and <itunes:owner> tags for host identification
  • Host/Guest Distinction: Intelligently distinguishes recurring hosts from episode-specific guests
  • Host Validation: Validates detected hosts using the first episode's metadata
  • Pattern-Based Heuristics: Analyzes sample episodes to learn title patterns (position, prefixes, suffixes) for better guest selection
  • Confidence Scoring: Uses overlap detection and confidence scores to select the best guest candidate
  • Name Sanitization: Removes punctuation, normalizes whitespace, and deduplicates extracted names
  • Graceful Fallback: Falls back to manual speaker names if automatic detection fails
  • Language-Aware: Single language configuration drives both Whisper model selection and NER processing

Configuration:

  • --auto-speakers / auto_speakers: Enable automatic speaker detection (default: true)
  • --language / language: Language for both Whisper and NER (default: "en")
  • --ner-model / ner_model: spaCy model to use (default: "en_core_web_sm")
  • --cache-detected-hosts / cache_detected_hosts: Cache host detection across episodes (default: true)
  • --speaker-names / speaker_names: Manual fallback names (first = host, second = guest)

Features:

  • Auto-downloads spaCy models when needed (similar to Whisper)
  • Clear logging of detected hosts and guests at INFO level
  • Detailed extraction details available at DEBUG level
  • Works seamlessly in dry-run mode
  • Pattern analysis from first few episodes improves accuracy

📋 PRD-004 & RFC-011: Metadata Generation Design

Comprehensive design documents for per-episode metadata generation:

  • PRD-004: Product Requirements Document defining metadata schema, use cases, and functional requirements
  • RFC-011: Technical design document with:
  • Pydantic model definitions for type-safe metadata structure
  • Database integration design (PostgreSQL, MongoDB, Elasticsearch, ClickHouse)
  • ID generation strategy (feed_id, episode_id, content IDs)
  • Unified JSON format with snake_case field names and ISO 8601 date serialization
  • Database loading examples for all target databases

Key Design Decisions:

  • Opt-in feature (default false for backwards compatibility)
  • JSON (default) and YAML format support
  • Database-friendly schema for direct ingestion without transformation
  • Stable, deterministic ID generation suitable for primary keys
  • Schema versioning strategy (semantic versioning starting at 1.0.0)

Note: Implementation will follow in a future release. This release establishes the design foundation.

Improvements

RSS Parser Enhancements

  • Author Tag Extraction: Extracts RSS <author>, <itunes:author>, and <itunes:owner> tags from feed channel
  • HTML Stripping: Removes HTML tags and decodes entities from episode descriptions
  • Improved Episode Description Extraction: Better handling of HTML content in RSS descriptions

Code Quality & Tooling

  • CI/CD Alignment: GitHub Actions workflow now uses make ci directly, ensuring local and CI checks match exactly
  • Markdown Linting: Added markdownlint to CI pipeline, catching formatting issues early
  • Complexity Management: Added complexity warnings to .flake8 per-file-ignores for appropriate modules
  • Code Cleanup: Removed unused variables and imports

Documentation Updates

  • ARCHITECTURE.md: Updated to reflect speaker detection integration and metadata generation design
  • TESTING_STRATEGY.md: Expanded with speaker detection and metadata generation testing requirements
  • PRD-002 & PRD-003: Updated to reflect RFC-010 speaker detection features
  • RFC-010: Expanded with final implementation details (RSS author tags, host validation, pattern heuristics)
  • RFC-011: Comprehensive technical design for metadata generation

Technical Details

New Dependencies

  • spacy>=3.7.0: Required dependency for Named Entity Recognition

Module Changes

  • speaker_detection.py (new): Complete NER-based speaker detection implementation
  • rss_parser.py: Enhanced with author tag extraction and HTML stripping
  • workflow.py: Integrated speaker detection into pipeline
  • episode_processor.py: Updated to use detected speaker names
  • whisper_integration.py: Renamed from whisper.py to avoid naming conflicts
  • models.py: Added authors field to RssFeed dataclass

Configuration Changes

New configuration fields:

language: "en"                    # Language for Whisper and NER
auto_speakers: true               # Enable automatic speaker detection
ner_model: "en_core_web_sm"       # spaCy model name
cache_detected_hosts: true        # Cache host detection
```yaml

- Closes #21: Implement NER as per RFC-010
- Closes #15: Create PRD and RFC for generating metadata document per episode
- Related to #28: CI/CD improvements

## Documentation

- Speaker Detection: [RFC-010](../rfc/RFC-010-speaker-name-detection.md)
- Metadata Generation Design: [PRD-004](../prd/PRD-004-metadata-generation.md), [RFC-011](../rfc/RFC-011-metadata-generation.md)
- Architecture: [ARCHITECTURE.md](../architecture/ARCHITECTURE.md)
- Testing Strategy: [TESTING_STRATEGY.md](../architecture/TESTING_STRATEGY.md)

## Migration Notes

### For Users Upgrading from v2.0.1

1. **New Required Dependency**: Install spaCy models automatically or manually:

   ```bash
   python -m spacy download en_core_web_sm
  1. Automatic Speaker Detection: Enabled by default (auto_speakers: true). To disable:
auto_speakers: false
  1. Manual Speaker Names: Now used as fallback only if automatic detection fails. First name = host, second = guest.

  2. Language Configuration: Single language field now controls both Whisper and NER:

language: "en" # Used for both Whisper model selection and NER
  1. No Breaking Changes: All existing functionality preserved. New features are additive.

Configuration Example

rss: "https://example.com/feed.xml"
output_dir: "./transcripts"
language: "en"                    # New: Controls Whisper + NER
auto_speakers: true               # New: Enable automatic detection
ner_model: "en_core_web_sm"       # New: spaCy model
cache_detected_hosts: true        # New: Cache hosts across episodes
speaker_names: "Host, Guest"      # Fallback if detection fails

```text

- 63 tests passing
- Comprehensive coverage for speaker detection:
  - RSS author tag extraction
  - NER model loading and validation
  - Host detection from feed metadata
  - Guest detection from episode metadata
  - Pattern analysis and heuristics
  - Name sanitization and deduplication
  - Fallback to manual names
  - Dry-run mode support

## Contributors

- Marko Dragoljevic (@chipi)

## Next Steps

- Implement metadata generation module (`podcast_scraper/metadata.py`) per RFC-011
- Integrate metadata generation into workflow pipeline
- Add configuration fields and CLI flags for metadata generation
- Generate metadata documents alongside transcripts

**Full Changelog**: <https://github.com/chipi/podcast_scraper/compare/v2.0.1...v2.1.0>