RFC-029: Provider Refactoring Consolidation¶
- Status: ✅ Completed
- Authors: Maintainers
- Stakeholders: Developers working on provider system, test maintainers
- Related PRDs:
docs/prd/PRD-006-openai-provider-integration.md - Related ADRs:
- ADR-024: Unified Provider Pattern
- ADR-025: Technology-Based Provider Naming
- Related RFCs:
docs/rfc/RFC-013-openai-provider-implementation.md- Original OpenAI provider designdocs/rfc/RFC-016-modularization-for-ai-experiments.md- Provider system architecture- Related Documents:
docs/architecture/ARCHITECTURE.md- System architecturedocs/guides/PROVIDER_IMPLEMENTATION_GUIDE.md- Provider implementation guidedocs/wip/PROVIDER_REFACTORING_OPPORTUNITIES.md- Refactoring opportunities (consolidated here)docs/wip/UNIFIED_PROVIDERS_STATUS.md- Status tracking (consolidated here)docs/wip/UNIFIED_ML_PROVIDER_STATUS.md- ML provider status (consolidated here)docs/wip/ML_PROVIDER_UNIFIED_DESIGN.md- Design decisions (consolidated here)docs/wip/PROVIDER_INTEGRATION_STATUS.md- Integration status (consolidated here)docs/wip/PROVIDER_NAMING_CONSISTENCY.md- Naming analysis (consolidated here)docs/wip/PROVIDER_TEST_STRATEGY.md- Test strategy (consolidated here)docs/wip/PROVIDER_TEST_COVERAGE_EXPANSION.md- Test coverage (consolidated here)docs/wip/TEST_VS_PRODUCTION_CONFIGURATION.md- Test vs production config (consolidated here)docs/wip/MODULARITY_ARCHITECTURE_STRENGTHENING.md- Modularity architecture (consolidated here)docs/wip/NEXT_STEPS_ROADMAP.md- Next steps roadmap (consolidated here)
Abstract¶
This RFC consolidates all provider refactoring documentation and provides a comprehensive view of the unified provider architecture. The project has successfully implemented unified ML and OpenAI providers that implement all three protocols (TranscriptionProvider, SpeakerDetector, SummarizationProvider), replacing the previous separate provider classes. This RFC documents:
- The completed unified provider architecture
- Current integration status across the codebase
- Remaining refactoring opportunities
- Test strategy and coverage expansion
- Naming consistency considerations
- Test vs production configuration
- Modularity architecture strengthening
- Next steps roadmap for completing the refactoring
Architecture Alignment: This RFC builds upon RFC-013 (OpenAI Provider Implementation) and RFC-016 (Modularization for AI Experiments), documenting the completed unified provider pattern and remaining work.
Problem Statement¶
The original provider architecture had separate provider classes for each capability:
WhisperTranscriptionProviderfor transcriptionNERSpeakerDetectorfor speaker detectionTransformersSummarizationProviderfor summarization- Separate OpenAI providers for each capability
This architecture had several issues:
- Duplication: ML providers shared underlying libraries (Whisper, spaCy, Transformers) but were separate classes
- Inconsistency: Different patterns for ML vs OpenAI providers
- Complexity: Multiple provider classes to maintain
- Test Complexity: Tests needed to import and test multiple separate classes
The unified provider architecture addresses these issues by:
- Creating
MLProviderthat implements all three protocols using shared ML libraries - Creating
OpenAIProviderthat implements all three protocols using shared OpenAI client - Using factory pattern for provider creation
- Maintaining protocol-based interfaces for modularity
Goals¶
- Unified Architecture: Single provider class per provider type (MLProvider, OpenAIProvider)
- Protocol Compliance: All providers implement standard protocols
- Factory Pattern: Providers created via factories, not direct instantiation
- Backward Compatibility: Existing workflow continues to work
- Test Consistency: Tests verify protocol compliance, not class names
- Documentation Accuracy: All docs reflect unified provider architecture
- Code Consistency: Remove references to old separate provider classes
Current Architecture¶
Unified Provider Structure¶
### Factory Pattern
All providers are created via factories:
```python
# Transcription
from podcast_scraper.transcription.factory import create_transcription_provider
provider = create_transcription_provider(cfg) # Returns MLProvider or OpenAIProvider
# Speaker Detection
from podcast_scraper.speaker_detectors.factory import create_speaker_detector
detector = create_speaker_detector(cfg) # Returns MLProvider or OpenAIProvider
# Summarization
from podcast_scraper.summarization.factory import create_summarization_provider
provider = create_summarization_provider(cfg) # Returns MLProvider or OpenAIProvider
```yaml
## Provider Mapping
| Capability | Provider Option | Returns | Implementation |
| ------------ | ---------------- | --------- | ---------------- |
| Transcription | `"whisper"` | `MLProvider` | Whisper library |
| Transcription | `"openai"` | `OpenAIProvider` | OpenAI Whisper API |
| Speaker Detection | `"ner"` | `MLProvider` | spaCy NER |
| Speaker Detection | `"openai"` | `OpenAIProvider` | OpenAI GPT API |
| Summarization | `"local"` | `MLProvider` | Transformers/PyTorch |
| Summarization | `"openai"` | `OpenAIProvider` | OpenAI GPT API |
## Integration Status
### ✅ Completed Integration
#### Configuration (`src/podcast_scraper/config.py`)
- Provider fields validated: `transcription_provider`, `speaker_detector_provider`, `summary_provider`
- API key validation for OpenAI providers
- All provider options properly typed
#### Workflow (`src/podcast_scraper/workflow/orchestration.py`)
- Uses factories to create providers
- Calls provider methods via protocols
- Handles provider cleanup
- Uses `getattr()` for modularity (not `isinstance()`)
#### Factories
- `transcription/factory.py`: Returns MLProvider for "whisper", OpenAIProvider for "openai"
- `speaker_detectors/factory.py`: Returns MLProvider for "ner", OpenAIProvider for "openai"
- `summarization/factory.py`: Returns MLProvider for "local", OpenAIProvider for "openai"
#### Episode Processor (`src/podcast_scraper/workflow/episode_processor.py`)
- Uses transcription provider via protocol
- Checks method existence with `hasattr()`
#### Metadata Generation (`src/podcast_scraper/workflow/metadata_generation.py`)
- Uses summarization provider via protocol
- Supports backward compatibility for parallel processing
#### Service API (`src/podcast_scraper/service.py`)
- Delegates to workflow which uses providers
#### CLI (`src/podcast_scraper/cli.py`)
- Supports provider selection via CLI arguments
- Maps CLI arguments to Config object
### ⏳ Remaining Work
#### Test Updates (High Priority)
Many test files still import and test old separate provider classes directly:
**Files Needing Updates:**
- `tests/unit/podcast_scraper/transcription/test_transcription_provider.py` - 9 imports of `WhisperTranscriptionProvider`
- `tests/unit/test_transcription_provider.py` - 9 imports of `WhisperTranscriptionProvider`
- `tests/unit/podcast_scraper/summarization/test_summarization_provider.py` - 8 imports of `TransformersSummarizationProvider`
- `tests/unit/test_summarization_provider.py` - 8 imports of `TransformersSummarizationProvider`
- `tests/unit/test_speaker_detector_provider.py` - 7 imports of `NERSpeakerDetector`
- `tests/unit/podcast_scraper/speaker_detectors/test_speaker_detector_provider.py` - 7 imports of `NERSpeakerDetector`
- `tests/unit/podcast_scraper/test_openai_providers.py` - Multiple imports of old OpenAI providers
- `tests/integration/test_openai_providers.py` - Multiple imports of old OpenAI providers
- `tests/e2e/test_whisper_e2e.py` - Import of `WhisperTranscriptionProvider`
- `tests/e2e/test_ml_models_e2e.py` - Imports of `NERSpeakerDetector`, `TransformersSummarizationProvider`
- `tests/e2e/test_e2e_server.py` - Imports of old OpenAI providers
- `tests/integration/test_parallel_summarization.py` - Import of `TransformersSummarizationProvider`
- `tests/integration/test_provider_integration.py` - Checks for old class names
**Required Changes:**
- Update imports to use factories instead of direct imports
- Update test assertions to check protocol compliance, not class names
- Create unified provider test files where appropriate
**Example Fix:**
```python
# Before
from podcast_scraper.transcription.whisper_provider import WhisperTranscriptionProvider
provider = WhisperTranscriptionProvider(cfg)
# After
from podcast_scraper.transcription.factory import create_transcription_provider
provider = create_transcription_provider(cfg)
# Assert protocol compliance, not class name
assert hasattr(provider, 'transcribe')
assert hasattr(provider, 'initialize')
Documentation Updates (Medium Priority)¶
Factory Documentation:
src/podcast_scraper/transcription/factory.py- Mentions "WhisperTranscriptionProvider"src/podcast_scraper/speaker_detectors/factory.py- Mentions "NERSpeakerDetector"src/podcast_scraper/summarization/factory.py- Mentions "TransformersSummarizationProvider"
Workflow Comments:
src/podcast_scraper/workflow/orchestration.py- Line 526 mentions "NERSpeakerDetector"
Required Changes:
- Update docstrings to mention unified providers
- Update comments to reference protocols or unified providers
Refactoring Opportunities¶
Phase 1: Critical Fixes (Do Now)¶
- Update Test Files (High Priority)
- Update all test imports to use factories
- Update test assertions to check protocol compliance
-
Create unified provider test files where appropriate
-
Update Factory Documentation (Medium Priority)
- Update docstrings to reflect unified providers
-
Remove references to old class names
-
Update Workflow Comments (Medium Priority)
- Update comments to reference protocols/unified providers
Phase 2: Improvements (Do Soon)¶
- Protocol Compliance Testing (Medium Priority)
- Create protocol compliance test suite
- Verify all providers implement required methods
- Verify method signatures match protocols
-
Run for all providers
-
Standardize Error Messages (Medium Priority)
- Review and standardize error messages across providers
- Ensure consistent user experience
Phase 3: Polish (Do Later)¶
- Extract Common Patterns (Low Priority)
- Evaluate if base class/mixins are worth it
- Only if duplication becomes significant
-
Potential shared patterns:
- Initialization state tracking
- Error handling for missing config
- Protocol compliance verification
- Cleanup patterns
-
Add Deprecation Warnings (Low Priority)
- Only if keeping old classes for transition period
-
Add warnings to old provider classes
-
Type Hints Consistency (Low Priority)
- Review and align type hints with protocol definitions
Test Strategy¶
Test Organization¶
The test strategy follows a three-tier pyramid:
/E2E\ ← Few, realistic end-to-end tests
/------\
/Integration\ ← Moderate, focused integration tests
/------------\
/ Unit \ ← Many, fast unit tests
/----------------\
Unit Tests: Standalone Provider Tests¶
Location: tests/unit/podcast_scraper/ml/ and tests/unit/podcast_scraper/openai/
Purpose: Test the provider classes themselves in isolation
What They Test:
- Provider creation and initialization
- Protocol method implementation (transcribe, detect_speakers, summarize)
- Error handling and edge cases
- Cleanup and resource management
- Configuration validation
- Internal state management
Mocking Strategy:
- MLProvider: Mock Whisper library, spaCy models, Transformers models
- OpenAIProvider: Mock OpenAI API client
Unit Tests: Factory Tests¶
Location: tests/unit/podcast_scraper/transcription/, tests/unit/podcast_scraper/speaker_detectors/, tests/unit/podcast_scraper/summarization/
Purpose: Test that factories create correct providers and verify protocol compliance
What They Test:
- Factory creates correct provider type (MLProvider vs OpenAIProvider)
- Factory handles invalid provider types
- Providers returned by factories implement protocols correctly
- Protocol method signatures match expectations
Integration Tests: Provider Integration¶
Location: tests/integration/
Purpose: Test providers working with other components in the app
What They Test:
- Providers work with Config objects
- Providers work with workflow components
- Provider factory integration
- Multiple providers working together
- Protocol compliance in component context
- Provider switching via configuration
- Error handling in workflow context
Mocking Strategy:
- Real Providers: Use actual provider implementations (MLProvider, OpenAIProvider)
- Mocked External Services: Mock HTTP APIs, ML model loading (for speed)
- Real Internal Components: Use real Config, real workflow logic
E2E Tests: Full Workflow Tests¶
Location: tests/e2e/
Purpose: Test providers in complete user workflows
What They Test:
- Providers work in full pipeline (CLI → workflow → providers → output)
- Providers work with real HTTP client (E2E server mock endpoints)
- Providers work with real ML models (for local providers)
- Multiple providers work together in full pipeline
- Error scenarios (API failures, rate limits, model loading failures)
- Complete user journeys
Test Assertions: Protocol Compliance vs Class Names¶
❌ Old Pattern (Class Name Checks):
# BAD: Checking class names
self.assertEqual(provider.__class__.__name__, "WhisperTranscriptionProvider")
GOOD: Checking protocol compliance¶
self.assertEqual(provider.class.name, "MLProvider") # Unified provider self.assertTrue(hasattr(provider, "transcribe")) # Protocol method self.assertTrue(hasattr(provider, "initialize")) # Protocol method self.assertTrue(hasattr(provider, "cleanup")) # Protocol method
## Test Coverage Expansion
**New Test Files Created:**
- `tests/unit/podcast_scraper/ml/test_ml_provider_lifecycle.py` - Lifecycle and edge cases
- `tests/unit/podcast_scraper/openai/test_openai_provider_lifecycle.py` - Lifecycle and edge cases
- `tests/integration/test_provider_factory_error_handling.py` - Factory error handling
- `tests/integration/test_provider_mixed_configurations.py` - Mixed provider configurations
**Coverage Areas Strengthened:**
1. **Lifecycle Management**: Multiple initialize/cleanup calls, partial initialization, error recovery
2. **Thread Safety Attribute**: `_requires_separate_instances` behavior, workflow pattern
3. **Factory Error Handling**: Invalid provider types, missing API keys, initialization failures
4. **Mixed Provider Configurations**: Different providers for different capabilities
## Naming Consistency
### Current Naming
| Capability | Provider Options | Current Choice | Rationale |
| ------------ | ------------------ | ---------------- | ----------- |
| **Transcription** | `whisper`, `openai` | `whisper` | ✅ Specific library name |
| **Speaker Detection** | `ner`, `openai` | `ner` | ⚠️ Technique name (uses spaCy) |
| **Summarization** | `local`, `openai` | `local` | ❌ Generic term (uses Transformers/PyTorch) |
### Problem
The naming is inconsistent:
- **`whisper`**: Specific library name (clear)
- **`ner`**: Technique name (uses spaCy, but NER is the technique)
- **`local`**: Generic term (uses Transformers/PyTorch, but doesn't specify)
**Issue**: "local" is vague and doesn't tell users what technology is being used.
### Options
#### Option 1: Keep "local" (Current)
**Pros:**
- Emphasizes local vs API distinction
- Generic enough for future local implementations
- Already implemented and in use
**Cons:**
- Vague (doesn't specify Transformers/PyTorch)
- Inconsistent with `whisper` naming
#### Option 2: Change to "transformers" (Original RFC design)
**Pros:**
- ✅ Specific library name (consistent with `whisper`)
- ✅ Clear what technology is used
- ✅ Matches original RFC-013 design
**Cons:**
- ❌ Breaking change (requires migration)
- ❌ Less generic (ties to specific library)
### Recommendation
**Option 2: Change to "transformers"** for consistency with `whisper` naming pattern.
**Rationale:**
- Matches original RFC-013 design
- Consistent with `whisper` (both are library names)
- More descriptive than "local"
- Users can still understand it's local (vs API) from context
**Migration Path:**
1. Add `"transformers"` as alias for `"local"` (backward compatibility)
2. Update documentation to prefer `"transformers"`
3. Deprecate `"local"` with warning
4. Remove `"local"` in future major version
## Key Decisions
1. **Unified Provider Pattern**
- **Decision**: Single provider class per provider type (MLProvider, OpenAIProvider)
- **Rationale**: Matches user's vision of "one ML provider that uses Whisper, spaCy, and torch for different functionality, same as we have with OpenAI." Simplifies architecture and reduces duplication.
2. **Factory Pattern**
- **Decision**: All providers created via factories, not direct instantiation
- **Rationale**: Enables provider switching via configuration, maintains abstraction, supports protocol-based design.
3. **Protocol-Based Testing**
- **Decision**: Tests verify protocol compliance, not class names
- **Rationale**: Maintains modularity, allows provider implementation changes without breaking tests.
4. **Modularity Pattern**
- **Decision**: Use `getattr(provider, "_requires_separate_instances", False)` instead of `isinstance()` checks
- **Rationale**: Maintains abstraction, avoids direct dependencies on concrete implementations.
5. **Thread Safety Attribute**
- **Decision**: Use `_requires_separate_instances` attribute for thread-safety checks
- **Rationale**: MLProvider requires separate instances for parallel processing, OpenAIProvider does not.
## Alternatives Considered
1. **Keep Separate Provider Classes**
- **Description**: Maintain separate classes for each capability
- **Pros**: More modular, easier to understand individual capabilities
- **Cons**: Duplication, inconsistent patterns, more classes to maintain
- **Why Rejected**: Unified pattern matches user's vision and reduces complexity
2. **Shared ML Initialization, Separate Provider Classes**
- **Description**: Keep separate provider classes but share initialization logic
- **Pros**: Maintains protocol separation, models only loaded when needed
- **Cons**: More complex architecture, doesn't match "one ML provider" vision
- **Why Rejected**: User wanted single unified provider class
3. **Base Class for Common Patterns**
- **Description**: Extract common initialization and error handling to base class
- **Pros**: Reduces duplication, consistent patterns
- **Cons**: Premature optimization, adds abstraction layer
- **Why Rejected**: Current duplication is acceptable, can revisit later if needed
## Testing Strategy
### Test Coverage
**Unit Tests:**
- Standalone provider tests (MLProvider, OpenAIProvider)
- Factory tests (verify correct provider creation)
- Protocol compliance tests
- Lifecycle and edge case tests
**Integration Tests:**
- Provider integration with workflow
- Factory error handling
- Mixed provider configurations
- Protocol compliance in component context
**E2E Tests:**
- Full pipeline with providers
- Real ML models (for local providers)
- E2E server mock endpoints (for API providers)
- Complete user workflows
### Test File Organization
Rollout Plan¶
Phase 1: Critical Fixes (Immediate)¶
- Update Test Files (High Priority)
- Update all test imports to use factories
- Update test assertions to check protocol compliance
- Create unified provider test files where appropriate
-
Timeline: 1-2 weeks
-
Update Factory Documentation (Medium Priority)
- Update docstrings to reflect unified providers
- Remove references to old class names
-
Timeline: 1 week
-
Update Workflow Comments (Medium Priority)
- Update comments to reference protocols/unified providers
- Timeline: 1 day
Phase 2: Improvements (Soon)¶
- Protocol Compliance Testing (Medium Priority)
- Create protocol compliance test suite
- Run for all providers
-
Timeline: 1 week
-
Standardize Error Messages (Medium Priority)
- Review and standardize error messages
- Timeline: 1 week
Phase 3: Polish (Later)¶
- Extract Common Patterns (Low Priority)
- Evaluate if base class/mixins are worth it
- Only if duplication becomes significant
-
Timeline: TBD
-
Add Deprecation Warnings (Low Priority)
- Only if keeping old classes for transition period
-
Timeline: TBD
-
Type Hints Consistency (Low Priority)
- Review and align type hints
- Timeline: TBD
Success Criteria¶
- ✅ All test files use factories instead of direct imports
- ✅ All test assertions check protocol compliance, not class names
- ✅ All factory documentation reflects unified providers
- ✅ All workflow comments reference protocols/unified providers
- ✅ Protocol compliance test suite exists and passes
- ✅ Error messages are standardized across providers
- ✅ No references to old separate provider classes in codebase (except deprecation warnings if applicable)
Benefits¶
- Consistent Architecture: Single provider class per provider type simplifies architecture
- Reduced Duplication: ML providers share underlying libraries, OpenAI providers share client
- Better Testability: Protocol-based testing maintains modularity
- Easier Maintenance: Fewer classes to maintain, clearer patterns
- Improved Documentation: Accurate docs reflect current architecture
- Better User Experience: Consistent error messages and clearer naming
Migration Path¶
For Tests¶
- Update Imports: Change from direct provider imports to factory imports
- Update Assertions: Change from class name checks to protocol compliance checks
- Run Tests: Verify all tests pass with unified providers
For Documentation¶
- Update Factory Docstrings: Replace old class names with unified provider names
- Update Workflow Comments: Reference protocols or unified providers
- Update Guides: Ensure provider implementation guide reflects unified architecture
For Naming (If Changing "local" to "transformers")¶
- Add Alias: Add
"transformers"as alias for"local"in config validation - Update Documentation: Prefer
"transformers"in docs - Add Deprecation Warning: Warn when
"local"is used - Remove in Future: Remove
"local"in future major version
Test vs Production Configuration¶
Overview¶
Provider names are the same for tests and production, but model names differ to optimize for speed (tests) vs quality (production).
Provider Names: Same for Tests and Production¶
Provider names are consistent:
- Transcription:
"whisper"(default) - Speaker Detection:
"spacy"(default, deprecated:"ner") - Summarization:
"transformers"(default, deprecated:"local")
Why? Provider names indicate which library/technology is used, not the model size. Both tests and production use the same libraries.
Model Names: Different for Tests vs Production¶
Model names differ between tests and production:
Whisper Models¶
- Tests:
TEST_DEFAULT_WHISPER_MODEL = "tiny.en"(smallest, fastest) - Production:
whisper_model = "base"(default, better quality)
Transformers Models¶
- Tests:
TEST_DEFAULT_SUMMARY_MODEL = "facebook/bart-base"(~500MB, fast) - Production:
summary_model = None→ auto-selects"bart-large-cnn"(better quality)
spaCy Models¶
- Tests:
DEFAULT_NER_MODEL = "en_core_web_sm"(same as production) - Production:
DEFAULT_NER_MODEL = "en_core_web_sm"(same as tests)
Test Configuration¶
create_test_config() Helper¶
The tests/conftest.py file provides create_test_config() which creates a Config object with test-friendly defaults:
def create_test_config(**overrides):
defaults = {
"rss_url": TEST_FEED_URL,
"whisper_model": config.TEST_DEFAULT_WHISPER_MODEL, # "tiny.en"
# ... other defaults ...
# NOTE: Provider names are NOT set here - they use Config defaults
}
defaults.update(overrides)
return config.Config(**defaults)
```yaml
**Key Point**: `create_test_config()` does NOT override provider names. Tests use the same provider defaults as production (`"spacy"`, `"transformers"`).
### Configuration Summary
| Aspect | Tests | Production |
| -------- | ------- | ------------ |
| **Provider Names** | `"spacy"`, `"transformers"` | `"spacy"`, `"transformers"` (same) |
| **Whisper Model** | `"tiny.en"` (fast) | `"base"` (quality) |
| **Summary Model** | `"facebook/bart-base"` (fast) | `"bart-large-cnn"` (quality) |
| **spaCy Model** | `"en_core_web_sm"` | `"en_core_web_sm"` (same) |
**Key Takeaway**: Provider names are consistent between tests and production. Only model sizes differ (fast for tests, quality for production).
## Modularity Architecture Strengthening
### Goal
Strengthen the modular architecture with clear separation of concerns, well-defined interfaces, and independent, testable components. With two clear provider types (OpenAI and ML), we can establish strict boundaries and modularity principles.
### Architecture Principles
1. **Protocol-Based Design**
- All providers implement protocols
- Core code depends on protocols, not implementations
- Protocols define contracts, not implementations
2. **Factory Pattern**
- Factories are the single point of provider creation
- Core code uses factories, not direct instantiation
- Factories handle provider selection logic
3. **Dependency Inversion**
- High-level modules (workflow) depend on abstractions (protocols)
- Low-level modules (providers) implement abstractions
- Both depend on abstractions, not each other
4. **Separation of Concerns**
- Each provider handles one capability
- Providers don't know about each other
- Workflow orchestrates, providers execute
5. **Testability**
- Tests mock protocols, not implementations
- Each provider has isolated test suite
- Integration tests verify protocol compliance
### Strengthening Plan
#### Phase 1: Protocol Strengthening
**Goal**: Ensure protocols are complete, clear, and well-documented.
- Review all three protocols (`TranscriptionProvider`, `SpeakerDetector`, `SummarizationProvider`)
- Ensure all required methods are defined
- Add comprehensive docstrings with examples
- Document error conditions and exceptions
- Add type hints for all parameters and return values
#### Phase 2: Remove Concrete Dependencies
**Goal**: Workflow and core code should only depend on protocols, not concrete implementations.
**Current Status**: ✅ Fixed workflow.py line 2070 - Removed direct `TransformersSummarizationProvider` import
- **Solution**: Added `_requires_separate_instances` attribute to providers
- **Implementation**: Workflow now uses `getattr(summary_provider, "_requires_separate_instances", False)` instead of `isinstance()` check
- **Benefits**: Workflow no longer depends on concrete provider classes, maintains modularity
**Remaining Work**:
- Audit all imports: Find and fix all concrete provider imports in core code
- Protocol documentation: Add comprehensive docstrings to all protocols
- Factory documentation: Document factory pattern and usage
#### Phase 3: Provider Independence
**Goal**: Ensure providers are completely independent and don't share state.
- Verify providers don't use module-level state
- Verify providers don't share caches
- Ensure each provider instance is independent
- Verify providers don't import each other
- Verify providers don't depend on workflow or core modules
- Providers should only depend on: `config`, `models`, protocols
#### Phase 4: Test Independence
**Goal**: Each provider has isolated, independent test suite.
**Test Principles**:
- **Mock Protocols, Not Implementations**: Tests should mock protocol interfaces
- **No Cross-Provider Dependencies**: Tests for one provider shouldn't depend on another
- **Isolated Fixtures**: Each provider test has its own fixtures
- **Protocol Compliance Tests**: Test that providers implement protocols correctly
**Good Test Pattern**:
```python
from unittest.mock import Mock
from podcast_scraper.transcription.base import TranscriptionProvider
def test_workflow_uses_provider():
"""Test workflow uses provider via protocol."""
# Mock the protocol, not concrete class
mock_provider = Mock(spec=TranscriptionProvider)
mock_provider.transcribe.return_value = "transcribed text"
# Use in workflow
result = workflow_function(mock_provider)
assert result == "transcribed text"
```python
#### Phase 5: Factory Pattern Strengthening
**Goal**: Factories are the single point of provider creation.
- Factories are the ONLY place that import concrete provider classes
- Factories return protocol types, not concrete types
- Factories handle provider creation errors gracefully
- Factories validate configuration before creating providers
### Modularity Success Criteria
1. ✅ No direct provider class imports in workflow or core code
2. ✅ All providers implement protocols correctly
3. ✅ Tests mock protocols, not implementations
4. ✅ Each provider has isolated, independent test suite
5. ✅ Factories are the only place that create providers
6. ✅ Clear documentation of all interfaces
7. ✅ Providers are completely independent (no shared state)
### Modularity Benefits
1. **Swappability**: Providers can be swapped without modifying core code
2. **Testability**: Easy to mock and test components independently
3. **Maintainability**: Clear boundaries make code easier to understand and modify
4. **Extensibility**: New providers can be added without changing existing code
5. **Reliability**: Clear interfaces reduce bugs and integration issues
## Next Steps Roadmap
### Current Status Summary
#### ✅ Recently Completed
1. **Provider Naming Consistency**
- Changed `"ner"` → `"spacy"` for speaker detection
- Changed `"local"` → `"transformers"` for summarization
- Added backward compatibility with deprecation warnings
- Updated all code, tests, documentation, and examples
2. **Unified Providers**
- `MLProvider` implements all three protocols (Whisper, spaCy, Transformers)
- `OpenAIProvider` implements all three protocols (OpenAI API)
- Factories updated to return unified providers
- Workflow integration complete
- Modularity improvements (no `isinstance()` checks)
3. **Test Coverage Expansion**
- Standalone unit tests for MLProvider and OpenAIProvider
- Integration tests for unified providers
- Lifecycle and error handling tests
- Mixed configuration tests
### Immediate Next Steps (Priority Order)
#### 1. 🔴 **Run Tests and Fix Any Issues** (Critical)
**Goal**: Verify everything works with the new provider naming
**Actions**:
```bash
# Run all tests
make test
# Or run specific test suites
make test-unit
make test-integration
make test-e2e
- Some tests might need updates for new defaults
- Backward compatibility tests should verify deprecated names work
2. 🟡 Clean Up Old Provider Files (Medium Priority)¶
Goal: Mark old separate provider files as deprecated
Files to Deprecate:
src/podcast_scraper/transcription/whisper_provider.py(empty, logic moved to MLProvider)src/podcast_scraper/speaker_detectors/ner_detector.py(empty, logic moved to MLProvider)src/podcast_scraper/summarization/local_provider.py(empty, logic moved to MLProvider)src/podcast_scraper/transcription/openai_provider.py(empty, logic moved to OpenAIProvider)src/podcast_scraper/speaker_detectors/openai_detector.py(empty, logic moved to OpenAIProvider)src/podcast_scraper/summarization/openai_provider.py(empty, logic moved to OpenAIProvider)
Actions:
- Add deprecation warnings to these files
- Add
__all__ = []to prevent accidental imports - Add docstrings explaining they're deprecated
- Update any remaining imports to use factories instead
Note: Don't delete yet - keep for backward compatibility, remove in future major version
3. 🟡 Update Remaining Test Files (Medium Priority)¶
Goal: Ensure all tests use factories and check protocol compliance
Verification:
# Search for direct imports of old providers
grep -r "from.*whisper_provider import" tests/
grep -r "from.*ner_detector import" tests/
grep -r "from.*local_provider import" tests/
grep -r "WhisperTranscriptionProvider\|NERSpeakerDetector\|TransformersSummarizationProvider" tests/
- Update assertions to check protocol compliance
- Use
hasattr()or protocol checks instead ofisinstance()
4. 🟢 Documentation Cleanup (Low Priority)¶
Goal: Ensure all documentation reflects current state
Files to Review:
docs/guides/PROVIDER_IMPLEMENTATION_GUIDE.mddocs/architecture/ARCHITECTURE.mddocs/guides/DEVELOPMENT_GUIDE.mddocs/guides/TESTING_GUIDE.md- All RFCs and PRDs
Testing Checklist¶
Before considering this work complete, verify:
- [ ] All unit tests pass
- [ ] All integration tests pass
- [ ] All E2E tests pass
- [ ] Backward compatibility tests pass (deprecated names work)
- [ ] No deprecation warnings in test output (unless testing deprecation)
- [ ] Provider switching works (e.g., Whisper → OpenAI)
- [ ] Mixed provider configurations work (e.g., Whisper + OpenAI summarization)
- [ ] Default configurations work correctly
- [ ] CLI arguments work with new names
- [ ] Config files work with new names
- [ ] Library API works with new names
Roadmap Success Criteria¶
This work is complete when:
- ✅ All tests pass with new provider names
- ✅ Deprecated names still work (with warnings)
- ✅ No direct imports of old provider classes in tests
- ✅ All documentation reflects current state
- ✅ Old provider files are marked as deprecated
- ✅ Codebase is ready for future deprecation removal
Notes¶
- Provider names are now consistent:
whisper,spacy,transformers(all library names) - Backward compatibility is maintained: Old names work but emit warnings
- Unified providers are the standard: All factories return
MLProviderorOpenAIProvider - Modularity is maintained: No
isinstance()checks, uses protocols and attributes
Open Questions¶
- Naming Consistency: Should we change
"local"to"transformers"for consistency? - Deprecation Strategy: Should we add deprecation warnings to old provider classes?
- Base Class Extraction: Is the duplication significant enough to warrant a base class?
- Test File Organization: Should we consolidate test files or keep them separate?
References¶
- Related RFC:
docs/rfc/RFC-013-openai-provider-implementation.md- Original OpenAI provider design - Related RFC:
docs/rfc/RFC-016-modularization-for-ai-experiments.md- Provider system architecture - Source Code:
src/podcast_scraper/providers/ml/ml_provider.py- Unified ML providersrc/podcast_scraper/providers/openai/openai_provider.py- Unified OpenAI providersrc/podcast_scraper/transcription/factory.py- Transcription factorysrc/podcast_scraper/speaker_detectors/factory.py- Speaker detector factorysrc/podcast_scraper/summarization/factory.py- Summarization factory- Provider Implementation Guide:
docs/guides/PROVIDER_IMPLEMENTATION_GUIDE.md