RFC-021: Modularization Refactoring Plan¶
- Status: Completed (Historical Reference)
- Authors:
- Stakeholders: Maintainers, developers implementing OpenAI providers
- Related ADRs:
- ADR-020: Protocol-Based Provider Discovery
- Related RFCs:
docs/rfc/RFC-013-openai-provider-implementation.md- OpenAI provider implementation (built on this plan)docs/rfc/RFC-016-modularization-for-ai-experiments.md- Provider system architecturedocs/rfc/RFC-017-prompt-management.md- Prompt management (aligned with this plan)- Related PRDs:
docs/prd/PRD-006-openai-provider-integration.md- OpenAI provider product requirements
Overview¶
Note: This RFC documents the modularization refactoring plan that was implemented to enable OpenAI provider integration. The refactoring is complete, and this document is kept as a historical reference for understanding the architecture decisions.
This document outlines the refactoring plan to modularize the podcast scraper architecture, enabling easy integration of OpenAI API as a replacement for on-device AI/ML components.
North Star Goal: Easily plug in OpenAI API as replacement for on-device AI/ML (speaker detection, transcription, summarization) without major refactoring.
Scope: This refactoring focuses on three key areas:
- Speaker Detection (NER → OpenAI API)
- Transcription (Whisper local → OpenAI Whisper API)
- Summarization (Local transformers → OpenAI API)
Out of Scope: RSS feed source abstraction (can be addressed separately later)
Current Architecture Assessment¶
Modularity Scores¶
| Area | Score | Status | Priority |
|---|---|---|---|
| Speaker Detection | 5/10 | Moderate | HIGH |
| Transcription | 6/10 | Moderate | HIGH |
| Summarization | 4/10 | Low | HIGH |
| RSS Feed Source | 2/10 | Low | IGNORED |
1. Speaker Detection Abstraction¶
Current State¶
Moderate Coupling Points:
workflow.pydirectly imports speaker detection:
from . import speaker_detection
nlp = speaker_detection.get_ner_model(cfg) # NER-specific
hosts = speaker_detection.detect_hosts_from_feed(...) # NER-specific
config.pyhas NER-specific fields:
ner_model: Optional[str] = Field(default=None, alias="ner_model")
auto_speakers: bool = Field(default=True, alias="auto_speakers")
speaker_detection.pyis tightly coupled to spaCy:- Direct
spacy.load()calls - NER-specific entity extraction
- No abstraction layer
- Large functions (
detect_speaker_names()~250 lines,extract_person_entities()~170 lines)
Current Modularity Score: 5/10
- Speaker detection is isolated in its own module
- Functions are reasonably abstract (
detect_speaker_names()) - Hardcoded to NER/spaCy implementation
- Cannot easily swap to OpenAI API or other services
- Config tied to NER model names
- Large functions with multiple responsibilities
Target Architecture¶
Protocol-Based Provider System:
# podcast_scraper/speaker_detectors/base.py
from typing import Protocol, List, Set, Optional, Dict, Any, Tuple
from .. import config, models
class SpeakerDetector(Protocol):
"""Protocol for speaker detection providers."""
def detect_hosts(
self,
feed_title: str,
feed_description: Optional[str],
feed_authors: Optional[List[str]],
) -> Set[str]:
"""Detect hosts from feed metadata."""
...
def detect_speakers(
self,
episode_title: str,
episode_description: Optional[str],
known_hosts: Set[str],
) -> Tuple[List[str], Set[str], bool]:
```text
"""Detect speakers for an episode.
```python
def analyze_patterns(
self,
episodes: List[models.Episode],
known_hosts: Set[str],
) -> Optional[Dict[str, Any]]:
"""Analyze episode patterns for heuristics."""
...
```python
from .. import config
```python
from .base import SpeakerDetector
class SpeakerDetectorFactory:
"""Factory for creating speaker detectors."""
@staticmethod
def create(cfg: config.Config) -> Optional[SpeakerDetector]:
if not cfg.auto_speakers:
return None
detector_type = cfg.speaker_detector_provider # 'ner', 'openai', etc. (renamed from speaker_detector_type)
if detector_type == 'ner':
from .ner_detector import NERSpeakerDetector
return NERSpeakerDetector(cfg)
elif detector_type == 'openai':
from .openai_detector import OpenAISpeakerDetector
return OpenAISpeakerDetector(cfg)
return None
```text
1. **Add provider type field to `config.py`:**
```python
speaker_detector_type: Literal["ner", "openai"] = Field(default="ner")
# Keep ner_model for backward compatibility
ner_model: Optional[str] = Field(default=None, alias="ner_model")
- Create protocol definitions:
- Create
podcast_scraper/speaker_detectors/package - Define
SpeakerDetectorprotocol inbase.py -
No implementation changes yet
-
Create factory function:
- Create
factory.pywithSpeakerDetectorFactory.create() - Returns current NER implementation wrapped in protocol
Phase 2: Refactor Current Implementation - Speaker Detection¶
- Refactor
speaker_detection.py→speaker_detectors/ner_detector.py: - Extract helper functions from large functions:
_calculate_heuristic_score()- Extract fromdetect_speaker_names()_build_guest_candidates()- Process title/description guests_select_best_guest()- Select guest with highest score_extract_entities_from_text()- Core NER extraction_extract_entities_from_segments()- Segment-based fallback_pattern_based_fallback()- Pattern matching fallback
- Implement
SpeakerDetectorprotocol -
Wrap existing functions as methods
-
Update
workflow.py:
```python from .speaker_detectors import SpeakerDetectorFactory
detector = SpeakerDetectorFactory.create(cfg) if detector:
```text
hosts = detector.detect_hosts(feed.title, feed.description, feed.authors)
### Phase 3: Add OpenAI Provider - Speaker Detection (Future)
1. **Create `speaker_detectors/openai_detector.py`:**
- Implement `SpeakerDetector` protocol
- Use OpenAI API for entity extraction
- Map OpenAI responses to expected format
2. **Update factory:**
- Add OpenAI detector to factory
- Update config with OpenAI options
**Benefits:**
- Easy to add OpenAI API for speaker detection
- Can use multiple detectors (fallback chain)
- Testable with mock detectors
- Backward compatible (NER remains default)
- Better code organization (smaller functions)
**Effort:** Medium (2-3 days)
---
## 2. Transcription Abstraction
### Current State (Transcription)
**Moderate Coupling Points:**
1. **`workflow.py`** directly imports Whisper:
```python
from . import whisper_integration as whisper
whisper_model = whisper.load_whisper_model(cfg) # Whisper-specific
result, elapsed = whisper.transcribe_with_whisper(...) # Whisper-specific
```
2. **`episode_processor.py`** has Whisper-specific code:
```python
from . import whisper_integration as whisper
result, tc_elapsed = whisper.transcribe_with_whisper(whisper_model, temp_media, cfg)
```
3. **`config.py`** has Whisper-specific fields:
```python
whisper_model: str = Field(default="base", alias="whisper_model")
transcribe_missing: bool = Field(default=False, alias="transcribe_missing")
```
4. **`_TranscriptionResources`** has Whisper model hardcoded:
```python
class _TranscriptionResources(NamedTuple):
whisper_model: Any # Whisper-specific type
```
**Current Modularity Score: 6/10**
- Transcription logic is isolated in `whisper_integration.py`
- Functions are reasonably abstract (`transcribe_with_whisper()`)
- Hardcoded to Whisper library
- Cannot easily swap to OpenAI Whisper API or other services
- Config tied to Whisper model names
- Resource management assumes local model loading
### Target Architecture (Transcription)
**Protocol-Based Provider System:**
````python
# podcast_scraper/transcription/base.py
from typing import Protocol, Dict, Optional, Tuple, Any
from .. import config
class TranscriptionProvider(Protocol):
"""Protocol for transcription providers."""
def initialize(self, cfg: config.Config) -> Optional[Any]:
"""Initialize provider (load model, setup API client, etc.).
Returns:
Provider-specific resource object or None if initialization fails
"""
...
def transcribe(
self,
media_path: str,
cfg: config.Config,
resource: Any, # Provider-specific resource
) -> Tuple[Dict[str, Any], float]:
```text
"""Transcribe media file.
Returns:
Tuple of (result_dict, elapsed_seconds)
result_dict should have 'text' and optionally 'segments'
"""
...
def cleanup(self, resource: Any) -> None:
"""Cleanup provider resources."""
...
```python
from .. import config
```python
from .base import TranscriptionProvider
class TranscriptionProviderFactory:
"""Factory for creating transcription providers."""
@staticmethod
def create(cfg: config.Config) -> Optional[TranscriptionProvider]:
if not cfg.transcribe_missing:
return None
provider_type = cfg.transcription_provider # 'whisper', 'openai', etc.
if provider_type == 'whisper':
from .whisper_provider import WhisperTranscriptionProvider
return WhisperTranscriptionProvider()
elif provider_type == 'openai':
from .openai_provider import OpenAITranscriptionProvider
return OpenAITranscriptionProvider(cfg)
return None
```text
1. **Add provider type field to `config.py`:**
```python
transcription_provider: Literal["whisper", "openai"] = Field(default="whisper")
# Keep whisper_model for backward compatibility
whisper_model: str = Field(default="base", alias="whisper_model")
````
1. **Create protocol definitions:**
- Create `podcast_scraper/transcription/` package
- Define `TranscriptionProvider` protocol in `base.py`
- No implementation changes yet
2. **Create factory function:**
- Create `factory.py` with `TranscriptionProviderFactory.create()`
- Returns current Whisper implementation wrapped in protocol
## Phase 2: Refactor Current Implementation - Transcription
1. **Refactor `whisper_integration.py` → `transcription/whisper_provider.py`:**
- Keep all current logic
- Implement `TranscriptionProvider` protocol
- Wrap existing functions as methods
- Extract helper functions from `transcribe_media_to_text()`:
- `_format_transcript_if_needed()` - Screenplay formatting logic
- `_save_transcript_file()` - File writing logic
- `_cleanup_temp_media()` - Cleanup logic
2. **Update `workflow.py`:**
```python
from .transcription import TranscriptionProviderFactory
provider = TranscriptionProviderFactory.create(cfg)
if provider:
```text
resource = provider.initialize(cfg)
result, elapsed = provider.transcribe(media_path, cfg, resource)
```python
3. **Update `_TranscriptionResources`:**
```python
class _TranscriptionResources(NamedTuple):
provider: Optional[TranscriptionProvider]
resource: Any # Provider-specific resource
temp_dir: Optional[str]
transcription_jobs: List[models.TranscriptionJob]
```text
# ... rest
Phase 3: Add OpenAI Provider - Transcription (Future)¶
- Create
transcription/openai_provider.py: - Implement
TranscriptionProviderprotocol - Use OpenAI Whisper API
- Handle API authentication and rate limiting
-
Map OpenAI responses to expected format
-
Update factory:
- Add OpenAI provider to factory
- Update config with OpenAI options
Benefits:
- Easy to add OpenAI Whisper API
- Can support both local (Whisper) and cloud (API) providers
- Testable with mock providers
- Backward compatible (Whisper remains default)
- Better resource management (provider-specific cleanup)
Effort: Medium-High (3-4 days)
3. Summarization Abstraction¶
Current State (Summarization)¶
Tight Coupling Points:
workflow.pydirectly imports summarizer:
from . import summarizer
model_name = summarizer.select_summary_model(cfg) # Local model-specific
summary_model = summarizer.SummaryModel(...) # Local model-specific
metadata.pyhas summarization logic:
from . import summarizer
summary_metadata = _generate_episode_summary(...) # Uses local models
config.pyhas local model-specific fields:
summary_model: Optional[str] = Field(default=None, alias="summary_model")
summary_provider: Literal["local"] = Field(default="local")
generate_summaries: bool = Field(default=False, alias="generate_summaries")
summarizer.pyis tightly coupled to HuggingFace transformers:- Direct
AutoModelForSeq2SeqLM.from_pretrained()calls - Local model loading and caching
- No abstraction layer
Current Modularity Score: 4/10
- Summarization logic is isolated in
summarizer.py - Hardcoded to local HuggingFace models
- Cannot easily swap to OpenAI API
- Config tied to HuggingFace model names
- Resource management assumes local model loading
- Large
metadata.pyfunctions (generate_episode_metadata()~200 lines)
Target Architecture (Summarization)¶
Protocol-Based Provider System:
# podcast_scraper/summarization/base.py
from typing import Protocol, Optional, Dict, Any
from .. import config
class SummarizationProvider(Protocol):
"""Protocol for summarization providers."""
def initialize(self, cfg: config.Config) -> Optional[Any]:
"""Initialize provider (load model, setup API client, etc.).
Returns:
Provider-specific resource object or None if initialization fails
"""
...
def summarize(
self,
text: str,
cfg: config.Config,
resource: Any, # Provider-specific resource
max_length: Optional[int] = None,
min_length: Optional[int] = None,
) -> Dict[str, Any]:
```text
"""Summarize text.
```
Args:
text: Text to summarize
cfg: Configuration object
resource: Provider-specific resource (model, client, etc.)
max_length: Maximum summary length
min_length: Minimum summary length
```python
def summarize_chunks(
self,
chunks: List[str],
cfg: config.Config,
resource: Any,
) -> List[str]:
"""Summarize multiple text chunks (MAP phase).
```python
def combine_summaries(
self,
summaries: List[str],
cfg: config.Config,
resource: Any,
) -> str:
"""Combine multiple summaries into final summary (REDUCE phase).
```python
def cleanup(self, resource: Any) -> None:
"""Cleanup provider resources."""
...
```python
from .. import config
```python
from .base import SummarizationProvider
class SummarizationProviderFactory:
"""Factory for creating summarization providers."""
@staticmethod
def create(cfg: config.Config) -> Optional[SummarizationProvider]:
if not cfg.generate_summaries:
return None
provider_type = cfg.summary_provider # 'transformers', 'openai', etc.
if provider_type == 'transformers':
from .transformers_provider import TransformersSummarizationProvider
return TransformersSummarizationProvider(cfg)
elif provider_type == 'openai':
from .openai_provider import OpenAISummarizationProvider
return OpenAISummarizationProvider(cfg)
return None
```text
1. **Add provider type field to `config.py`:**
```python
summary_provider: Literal["transformers", "openai"] = Field(default="transformers")
# Keep summary_model for backward compatibility
summary_model: Optional[str] = Field(default=None, alias="summary_model")
- Create protocol definitions:
- Create
podcast_scraper/summarization/package - Define
SummarizationProviderprotocol inbase.py -
No implementation changes yet
-
Create factory function:
- Create
factory.pywithSummarizationProviderFactory.create() - Returns current local implementation wrapped in protocol
Phase 2: Refactor Current Implementation - Summarization¶
- Extract Preprocessing to Shared Module:
- Create
podcast_scraper/preprocessing.pymodule - Move
clean_transcript(),remove_sponsor_blocks(),clean_for_summarization()fromsummarizer.py - These functions are provider-agnostic and should be called BEFORE provider selection
-
Update
metadata.pyto use shared preprocessing module -
Refactor
summarizer.py→summarization/transformers_provider.py: - Keep all current logic
- Implement
SummarizationProviderprotocol - Wrap existing
SummaryModelclass as provider - Remove preprocessing functions (moved to shared module)
- Extract helper functions from
metadata.py:
- `_build_feed_metadata()` - Construct FeedMetadata
- `_build_episode_metadata()` - Construct EpisodeMetadata
- `_build_content_metadata()` - Construct ContentMetadata
- `_build_processing_metadata()` - Construct ProcessingMetadata
```python
3. **Update `workflow.py`:**
```python
from .summarization import SummarizationProviderFactory
provider = SummarizationProviderFactory.create(cfg)
if provider:
```text
resource = provider.initialize(cfg)
Phase 3: Add OpenAI Provider - Summarization (Future)¶
- Create
summarization/openai_provider.py: - Implement
SummarizationProviderprotocol - Use OpenAI API for summarization
- Handle chunking for long texts (MAP phase)
- Combine summaries (REDUCE phase)
-
Handle API authentication and rate limiting
-
Update factory:
- Add OpenAI provider to factory
- Update config with OpenAI options
Benefits:
- Easy to add OpenAI API for summarization
- Can support both local (transformers) and cloud (API) providers
- Testable with mock providers
- Backward compatible (local remains default)
- Better resource management
- Cleaner
metadata.py(smaller functions)
Effort: Medium-High (3-4 days)
Implementation Priority & Timeline¶
Recommended Order¶
- Phase 1: Transcription Abstraction (Highest Impact)
- Already has good isolation
- Most likely to need alternatives (OpenAI Whisper API)
- Medium complexity
-
Effort: 3-4 days
-
Phase 2: Speaker Detection Abstraction (Medium Impact)
- Good isolation already
- May want OpenAI API for better accuracy
- Medium complexity
-
Effort: 2-3 days
-
Phase 3: Summarization Abstraction (High Impact)
- Most tightly coupled currently
- OpenAI API would be most valuable here
- Medium-High complexity
- Effort: 3-4 days
Total Estimated Effort: 8-11 days for all three abstractions
Quick Wins (Can Do Now - No Breaking Changes)¶
- Add Provider Type Fields to Config:
# config.py
speaker_detector_provider: Literal["ner", "openai"] = Field(default="ner") # Renamed from speaker_detector_type
transcription_provider: Literal["whisper", "openai"] = Field(default="whisper")
summary_provider: Literal["transformers", "openai"] = Field(default="transformers")
- Create Protocol/ABC Definitions:
- Define interfaces now
- Implement later when needed
- Makes intent clear
-
Enables type checking
-
Extract Provider Factories:
- Create factory functions that return current implementations
- Use factories in workflow
- Makes swapping easier later
Effort: 1 day
File Structure (Proposed)¶
podcast_scraper/
├── preprocessing.py # NEW: Provider-agnostic preprocessing utilities
│ # - clean_transcript() (timestamp removal, speaker normalization)
│ # - remove_sponsor_blocks() (ad removal)
│ # - clean_for_summarization() (combined cleaning)
│ # Called BEFORE provider selection in metadata.py/workflow.py
├── speaker_detectors/
│ ├── __init__.py
│ ├── base.py # SpeakerDetector protocol
│ ├── factory.py # SpeakerDetectorFactory
│ ├── ner_detector.py # Current NER implementation (refactored)
│ └── openai_detector.py # Future OpenAI implementation
├── transcription/
│ ├── __init__.py
│ ├── base.py # TranscriptionProvider protocol
│ ├── factory.py # TranscriptionProviderFactory
│ ├── whisper_provider.py # Current Whisper implementation (refactored)
│ └── openai_provider.py # Future OpenAI Whisper API implementation
├── summarization/
│ ├── __init__.py
│ ├── base.py # SummarizationProvider protocol
│ ├── factory.py # SummarizationProviderFactory
│ ├── transformers_provider.py # Current HuggingFace transformers implementation (refactored)
│ └── openai_provider.py # Future OpenAI API implementation
├── workflow.py # Uses factories, calls preprocessing
├── metadata.py # Refactored (smaller functions), calls preprocessing BEFORE providers
├── config.py # Has provider type fields
└── ...
```text
1. **Backward Compatibility First**
- Default to current implementations
- Add new fields as optional
- Don't break existing code
- Keep existing config fields for compatibility
2. **Protocols Over Inheritance**
- Use `Protocol` for flexibility
- Easier to mock and test
- No forced inheritance hierarchy
- Enables duck typing
3. **Factory Pattern**
- Centralized provider creation
- Easy to swap implementations
- Configuration-driven
- Single point of change
4. **Gradual Migration**
- Can implement incrementally
- Test at each step
- No big-bang refactoring
- Each phase delivers value
5. **Testability**
- Mock providers for unit tests
- Integration tests with real providers
- Backward compatibility tests
- Provider-specific tests
6. **North Star: OpenAI API Ready**
- All abstractions designed with OpenAI API in mind
- Easy to add OpenAI providers after refactoring
- No additional refactoring needed for OpenAI integration
---
## Migration Strategy
### Backward Compatibility
- Keep all existing config fields
- Default to current implementations
- Add new fields as optional
- Deprecate old fields gradually (if needed)
### Testing Strategy
- Create mock providers for testing
- Test workflow with different providers
- Ensure backward compatibility tests pass
- Test provider switching
### Rollout Plan
1. **Week 1:** Quick wins + Transcription abstraction
2. **Week 2:** Speaker detection abstraction
3. **Week 3:** Summarization abstraction
4. **Week 4:** Testing, documentation, cleanup
---
## Success Criteria
Can add OpenAI API providers without modifying core workflow
All existing functionality preserved
Tests pass with both old and new providers
Config remains backward compatible
Code is more maintainable (smaller functions, clearer structure)
Ready for OpenAI API integration as next step
---
## Next Steps After Refactoring
Once this refactoring is complete, adding OpenAI API providers will be straightforward:
1. **OpenAI Speaker Detection:**
- Create `speaker_detectors/openai_detector.py`
- Implement `SpeakerDetector` protocol
- Use OpenAI API for entity extraction
- Add to factory
2. **OpenAI Whisper Transcription:**
- Create `transcription/openai_provider.py`
- Implement `TranscriptionProvider` protocol
- Use OpenAI Whisper API
- Add to factory
3. **OpenAI Summarization:**
- Create `summarization/openai_provider.py`
- Implement `SummarizationProvider` protocol
- Use OpenAI API for summarization
- Handle MAP/REDUCE phases
- Add to factory
**Estimated Effort for OpenAI Integration:** 3-5 days (after refactoring)
---
## Notes
- All refactoring should maintain existing functionality
- Add tests for extracted functions and new providers
- Update docstrings to reflect new structure
- Consider backward compatibility for public APIs
- Document provider interfaces clearly
- Provide examples for each provider type