Provider Implementation Guide¶
This comprehensive guide explains how to implement new providers for the podcast scraper. It consolidates information from multiple guides and uses OpenAI as a complete example throughout.
Overview¶
The podcast scraper uses a protocol-based provider system where each capability (transcription, speaker detection, summarization) has a protocol interface that all providers must implement.
This design allows:
- Pluggable implementations: Swap providers via configuration
- Type safety: Protocols ensure consistent interfaces
- Easy testing: Mock providers for testing
- Extensibility: Add new providers without modifying core code
Architecture¶
Provider Types¶
- TranscriptionProvider: Converts audio to text
- SpeakerDetector: Detects speaker names from episode metadata
- SummarizationProvider: Generates episode summaries
Unified Provider Pattern¶
As of v2.4.0, the project follows a Unified Provider pattern where a single class implementation handles multiple protocols using shared libraries or API clients.
MLProvider: Unified local implementation using Whisper, spaCy, and Transformers.HybridMLProvider: Combines local ML MAP phase + LLM REDUCE phase.OpenAIProvider: Unified API implementation using OpenAI's various endpoints.GeminiProvider: Google Gemini API (transcription + summarization).AnthropicProvider: Anthropic Claude API (summarization only).MistralProvider: Mistral API (summarization only).GrokProvider: Grok/xAI API (summarization only).DeepSeekProvider: DeepSeek API (summarization only).OllamaProvider: Local self-hosted LLMs (transcription, speaker detection, summarization).
File Structure:
src/podcast_scraper/
├── providers/
│ ├── ml/
│ │ ├── ml_provider.py # Unified Local ML implementation
│ │ ├── hybrid_ml_provider.py # Hybrid MAP-REDUCE implementation
│ │ ├── whisper_utils.py # Whisper transcription utilities
│ │ ├── speaker_detection.py # spaCy NER speaker detection
│ │ └── summarizer.py # Transformers summarization
│ ├── openai/
│ │ └── openai_provider.py # Unified OpenAI API implementation
│ ├── gemini/
│ │ └── gemini_provider.py # Gemini API implementation
│ ├── anthropic/
│ │ └── anthropic_provider.py # Anthropic API implementation
│ ├── mistral/
│ │ └── mistral_provider.py # Mistral API implementation
│ ├── grok/
│ │ └── grok_provider.py # Grok API implementation
│ ├── deepseek/
│ │ └── deepseek_provider.py # DeepSeek API implementation
│ └── ollama/
│ └── ollama_provider.py # Ollama local LLM implementation
├── transcription/
│ ├── base.py # Protocol definition
│ └── factory.py # Factory logic
├── speaker_detectors/
│ ├── base.py # Protocol definition
│ └── factory.py # Factory logic
└── summarization/
├── base.py # Protocol definition
└── factory.py # Factory logic
Step-by-Step Implementation¶
Step 1: Understand the Protocol¶
First, examine the protocol interface in {capability}/base.py. For example, TranscriptionProvider:
from typing import Protocol
class TranscriptionProvider(Protocol):
def initialize(self) -> None:
"""Initialize provider (load models, connect to API, etc.)."""
...
def transcribe(
self,
audio_path: str,
language: str | None = None,
) -> str:
"""Transcribe audio file to text."""
...
Step 2: Implement the Provider Class¶
Create a new file for your provider. If your provider handles multiple capabilities, consider a unified structure like openai/ or ml/.
Reference Implementation: src/podcast_scraper/providers/openai/openai_provider.py
1. Configuration Validation¶
Check required config fields in __init__(). API keys should be validated here.
2. Thread Safety¶
Define _requires_separate_instances based on your implementation:
True: For local ML models (HuggingFace/Whisper) that are not thread-safe.False: For API clients (OpenAI) that handle concurrent requests internally.
3. Initialization Lifecycle¶
__init__: Store configuration and initialize lightweight clients.initialize(): Load heavy resources (ML models) or perform network handshakes. This method must be idempotent.- Lazy Loading: Call
initialize()inside protocol methods if not already initialized.
4. Error Handling¶
Use typed exceptions from podcast_scraper.exceptions:
ProviderConfigError: For invalid configuration.ProviderDependencyError: For missing packages or models.ProviderRuntimeError: For API failures or inference errors.ProviderNotInitializedError: If a method is called beforeinitialize().
5. Prompt Store (for LLMs)¶
Use the centralized prompt_store for LLM prompts:
from ..prompts.store import render_prompt, get_prompt_metadata
# Render a versioned prompt
system_prompt = render_prompt("summarization/system_v1")
Step 3: Register in Factory¶
Update the factory functions in {capability}/factory.py to include your new provider.
def create_transcription_provider(cfg: config.Config) -> TranscriptionProvider:
# ...
if provider_type == "whisper":
from ..providers.ml.ml_provider import MLProvider
return MLProvider(cfg)
elif provider_type == "openai":
from ..providers.openai.openai_provider import OpenAIProvider
return OpenAIProvider(cfg)
# ...
Testing Your Provider¶
E2E Server Mock Endpoints¶
For API providers, you must add mock endpoint handlers to the E2E test server (tests/e2e/fixtures/e2e_http_server.py). This allows tests to run without real API keys or internet access.
Testing Checklist¶
- [ ] Unit Tests: Test logic in isolation, mock all external dependencies.
- [ ] Integration Tests: Test provider with the real E2E server mock endpoints.
- [ ] E2E Tests: Test provider in the full pipeline context.
- [ ] Resource Management: Verify
cleanup()properly unloads models or closes connections.
Related Documentation¶
- Protocol Extension Guide - How to extend protocols
- ML Provider Reference - Details on local ML models
- Development Guide - Development workflow