PRD-006: OpenAI Provider Integration¶
- Status: ✅ Implemented (v2.4.0)
- Related RFCs: RFC-013, RFC-017, RFC-021, RFC-022, RFC-029
Summary¶
Add OpenAI API as an optional provider for speaker detection (NER), transcription, and summarization capabilities, enabling users to choose between local on-device processing and cloud-based API services. This builds on the modularization refactoring (RFC-021) to provide seamless provider switching without changing end-user experience or workflow behavior.
Background & Context¶
Currently, the podcast scraper uses on-device AI/ML models:
- Speaker Detection: spaCy NER models running locally
- Transcription: OpenAI Whisper models running locally
- Summarization: Hugging Face transformer models (BART, LED) running locally
While on-device processing provides privacy and predictable costs, some users may prefer:
- Higher accuracy: OpenAI's models may provide better results for speaker detection and summarization
- Reduced resource usage: Offload processing to cloud, reducing on-device CPU/GPU/memory requirements
- Faster processing: API-based transcription can be faster than local Whisper for some users
- Better quality: OpenAI's GPT models may produce higher-quality summaries than local transformers
This PRD addresses the need to add OpenAI as a provider option while maintaining backward compatibility and zero changes to end-user experience when using default (local) providers.
Goals¶
- Add OpenAI API as provider option for speaker detection, transcription, and summarization
- Maintain 100% backward compatibility - no changes to existing behavior with default (local) providers
- Provide secure API key management via environment variables and
.envfiles (never in source code) - Use
python-dotenvfor convenient environment variable management per environment - Enable per-capability provider selection (can mix local and OpenAI providers)
- Maintain existing parallelism and performance characteristics where applicable
- Support both development and production environments for API key storage
OpenAI Model Selection and Cost Analysis¶
Configuration Fields¶
Add to config.py when implementing OpenAI providers:
# OpenAI Model Selection
openai_speaker_model: str = Field(
default="gpt-4o-mini",
description="OpenAI model for speaker detection (entity extraction)"
)
openai_summary_model: str = Field(
default="gpt-4o-mini",
description="OpenAI model for summarization"
)
openai_transcription_model: str = Field(
default="whisper-1",
description="OpenAI Whisper API model version"
)
# OpenAI API Configuration
openai_api_key: Optional[str] = Field(
default=None,
description="OpenAI API key (prefer OPENAI_API_KEY environment variable or .env file)"
)
openai_temperature: float = Field(
default=0.3,
ge=0.0,
le=2.0,
description="Temperature for OpenAI generation (0.0-2.0, lower = more deterministic)"
)
openai_max_tokens: Optional[int] = Field(
default=None,
description="Max tokens for OpenAI generation (None = model default)"
)
```yaml
## Legacy Model Pricing Reference
| Model | Input Cost | Output Cost | Context Window | Best For |
| ----- | ---------- | ----------- | -------------- | -------- |
| **gpt-4o** | $2.50 / 1M tokens | $10.00 / 1M tokens | 128k tokens | Highest quality |
| **gpt-4o-mini** | $0.15 / 1M tokens | $0.60 / 1M tokens | 128k tokens | **Recommended** (balanced) |
| **gpt-4-turbo** | $10.00 / 1M tokens | $30.00 / 1M tokens | 128k tokens | Maximum quality |
| **gpt-3.5-turbo** | $0.50 / 1M tokens | $1.50 / 1M tokens | 16k tokens | Budget option |
| **whisper-1** | $0.006 / minute | N/A | N/A | Audio transcription |
**Note:** Prices subject to change. Check [OpenAI Pricing](https://openai.com/api/pricing/) for current rates.
## Cost Comparison: Local vs OpenAI (Per 100 Episodes)
| Component | Local (Transformers) | OpenAI (gpt-4o-mini) | Difference |
| --------- | ------------------- | -------------------- | ---------- |
| **Speaker Detection** | Free (spaCy NER) | $0.14 | +$0.14 |
| **Transcription** | Free (local Whisper) | $36.00 | +$36.00 |
| **Summarization** | Free (local BART/LED) | $0.41 | +$0.41 |
| **Total API Costs** | $0 | **$36.55** | +$36.55 |
**Infrastructure Costs (Local):**
- Hardware: ~$2,000-$4,000 (GPU, high RAM)
- Electricity: ~$0.50-$1.00/hour (GPU usage)
- Maintenance: Developer time for model updates
### Recommended Hybrid Strategies
**Cost-Optimized Hybrid ($0.55/100 episodes):**
```yaml
speaker_detector_type: openai # $0.14/100 (minimal cost)
transcription_provider: whisper # Free (local)
summary_provider: openai # $0.41/100 (high value)
Privacy-First Hybrid:
yaml
speaker_detector_provider: ner # Free (local, private)
transcription_provider: whisper # Free (local, private)
summary_provider: openai # $0.41/100 (convenience)yaml
- Temperature: 0.3 (deterministic, factual)
- Max Tokens: None (model default)
OpenAI Model Pricing¶
⚠️ Pricing Information: The pricing table below is provided for reference only and may be outdated. Always check OpenAI Pricing for current rates before making cost decisions.
Model Pricing (per million tokens)¶
| Model | Input | Output | Context | Notes |
|---|---|---|---|---|
| GPT-5 | $1.25 | $10.00 | 256k | Best quality, 65% fewer hallucinations |
| GPT-5 Mini | $0.25 | $2.00 | 128k | Recommended for production |
| GPT-5 Nano | $0.05 | $0.40 | 64k | Recommended for dev/test |
| GPT-4o | ~$5.00 | ~$15.00 | 128k | Legacy, still supported |
| GPT-4o-mini | ~$0.15 | ~$0.60 | 128k | Legacy budget option |
| GPT-4 | $30.00 | $60.00 | 128k | Legacy, expensive |
| whisper-1 | $0.006/min | N/A | N/A | Audio transcription |
Recommended Model Defaults¶
| Purpose | Test Default | Production Default | Rationale |
|---|---|---|---|
| Speaker Detection | gpt-5-nano |
gpt-5-mini |
Names extraction is simple |
| Summarization | gpt-5-nano |
gpt-5 |
Full GPT-5 for quality summaries |
| Transcription | whisper-1 |
whisper-1 |
Only OpenAI option |
Cost Estimate Per Episode¶
Assuming ~10,000 words per episode transcript (~13,000 tokens):
| Task | Test (Nano) | Production |
|---|---|---|
| Speaker Detection | ~$0.001 | ~$0.004 (Mini) |
| Summarization | ~$0.002 | ~$0.02 (GPT-5) |
| Transcription (60 min) | $0.36 | $0.36 |
| Total per episode | ~$0.36 | ~$0.38 |
Cost Comparison: Local vs OpenAI (Per 100 Episodes)¶
| Component | Local | OpenAI (GPT-5 Mini) | OpenAI (GPT-5) |
|---|---|---|---|
| Speaker Detection | Free | $0.40 | $1.60 |
| Summarization | Free | $0.80 | $4.00 |
| Transcription | Free | $36.00 | $36.00 |
| Total | $0 | $37.20 | $41.60 |
GPT-5 vs GPT-4 Advantages¶
- 95% cheaper than GPT-4 for input tokens
- 83% cheaper for output tokens
- 65% fewer hallucinations - more reliable outputs
- 50-80% more efficient - needs fewer tokens for same quality
- 94.6% math accuracy (relevant for numerical content)
Hybrid Strategy Recommendations¶
Cost-Optimized (GPT-5 Nano + Local Whisper):
```yaml
speaker_detector_provider: openai transcription_provider: whisper # Free (local) summary_provider: openai openai_speaker_model: gpt-5-nano openai_summary_model: gpt-5-nano
```bash
Cost: ~$0.003/episode (excluding transcription)
Quality-Optimized (GPT-5 + OpenAI Whisper):
```yaml
speaker_detector_provider: openai transcription_provider: openai summary_provider: openai openai_speaker_model: gpt-5-mini openai_summary_model: gpt-5
```bash
Cost: ~$0.38/episode
Non-Goals¶
- Changing default behavior (transformers/local providers remain default)
- Supporting other cloud providers in this PRD (OpenAI only)
- Changing end-user workflow or CLI interface
- Modifying existing local provider implementations
- Adding new features beyond provider selection
- API key management UI or interactive configuration
Personas¶
- Quality Seeker Quinn: Wants highest-quality summaries and speaker detection, willing to pay for OpenAI API
- Resource-Constrained Rachel: Has limited local compute resources, prefers cloud processing
- Hybrid User Henry: Wants local transcription (privacy) but OpenAI summarization (quality)
- Developer Devin: Needs to test with OpenAI API during development
- Privacy-First Pat: Continues using local providers exclusively (no changes to their workflow)
User Stories¶
- As Quality Seeker Quinn, I can configure OpenAI API for summarization to get higher-quality summaries without changing my workflow.
- As Resource-Constrained Rachel, I can use OpenAI API for all capabilities to reduce local resource usage.
- As Hybrid User Henry, I can use local transcription but OpenAI summarization by configuring providers separately.
- As Developer Devin, I can set my OpenAI API key in environment variables and test API integration.
- As Privacy-First Pat, I can continue using the software exactly as before with no changes (local providers remain default).
- As any operator, I can see which provider was used for each capability in logs and metadata.
- As any operator, I can switch providers via configuration without code changes.
Functional Requirements¶
FR1: Provider Selection¶
- FR1.1: Add
speaker_detector_providerconfig field with values"ner"(default),"openai"(Note:speaker_detector_typeis deprecated but still supported for backward compatibility) - FR1.2: Add
transcription_providerconfig field with values"whisper"(default),"openai" - FR1.3: Add
summary_providerconfig field with values"transformers"(default),"openai" - FR1.4: Provider selection is independent per capability (can mix providers)
- FR1.5: Default values maintain current behavior (local providers)
- FR1.6: Invalid provider values result in clear error messages
FR2: API Key Management¶
- FR2.1: Support
OPENAI_API_KEYenvironment variable for API authentication - FR2.2: Support
.envfile viapython-dotenvfor convenient per-environment configuration - FR2.3: API key is never stored in source code, config files, or committed files
- FR2.4:
.envfile automatically loaded whenconfig.pymodule is imported - FR2.5:
examples/.env.exampletemplate file provided (safe to commit) with placeholder values - FR2.6: Missing API key when OpenAI provider is selected results in clear error message
- FR2.7: API key validation occurs at provider initialization (fail fast)
- FR2.8: Support for development and production environments via separate
.envfiles - FR2.9: Environment variable priority: config file > system env >
.envfile - FR2.10: Future-proof design for additional API keys
FR3: Speaker Detection with OpenAI¶
- FR3.1: OpenAI provider uses GPT-4 or GPT-3.5-turbo for entity extraction
- FR3.2: Maintains same interface as NER provider (detect_hosts, detect_speakers, analyze_patterns)
- FR3.3: Returns results in same format as NER provider (no workflow changes)
- FR3.4: Handles API rate limits gracefully (retry with backoff)
- FR3.5: Logs provider type used in detection logs
FR4: Transcription with OpenAI¶
- FR4.1: OpenAI provider uses Whisper API for transcription
- FR4.2: Maintains same interface as local Whisper provider (transcribe method)
- FR4.3: Returns results in same format (text, segments) as local provider
- FR4.4: Supports same configuration options (language, model selection)
- FR4.5: Handles file uploads and API responses correctly
- FR4.6: Maintains parallelism where applicable (API calls can be parallelized)
FR5: Summarization with OpenAI¶
- FR5.1: OpenAI provider uses GPT-4o-mini or GPT-4 for summarization
- FR5.2: Maintains same interface as local provider (summarize, summarize_chunks, combine_summaries)
- FR5.3: Leverages large context window (128k tokens) - can process full transcripts without chunking
- FR5.4: Returns results in same format as local provider
- FR5.5: Simplified processing - single API call for most transcripts (no MAP/REDUCE needed)
- FR5.6: Falls back to chunking only for extremely long transcripts (rare)
- FR5.7: More cost-efficient than local provider (single API call vs multiple model inferences)
FR6: Logging and Observability¶
- FR6.1: Log which provider is used for each capability at initialization
- FR6.2: Log provider type in episode processing logs
- FR6.3: Include provider information in metadata documents
- FR6.4: Log API usage (calls, errors, rate limits) for debugging
- FR6.5: No sensitive information (API keys) in logs
FR7: Error Handling¶
- FR7.1: API errors result in clear error messages with provider context
- FR7.2: Rate limit errors include retry information
- FR7.3: Network errors are handled gracefully with retries
- FR7.4: Invalid API key errors are clear and actionable
- FR7.5: Fallback behavior (if any) is documented and predictable
FR8: Performance and Parallelism¶
- FR8.1: OpenAI API calls can be parallelized for batch processing
- FR8.2: Rate limiting is respected (no overwhelming API)
- FR8.3: Parallelism works correctly with API providers (no blocking)
- FR8.4: Performance characteristics are documented (API vs local)
Technical Requirements¶
TR1: Architecture¶
- TR1.1: Build on modularization refactoring (provider abstraction already in place)
- TR1.2: Implement OpenAI providers following existing protocol interfaces
- TR1.3: No changes to workflow.py logic (uses factory pattern)
- TR1.4: Provider selection via factory pattern (already designed)
TR2: Dependencies¶
- TR2.1: Add
openaiPython package as optional dependency - TR2.2: OpenAI dependency only required when OpenAI provider is used
- TR2.3: Lazy import OpenAI client (only when provider is selected)
- TR2.4: Version pinning for OpenAI package
TR3: Configuration¶
- TR3.1: Add provider type fields to config.py (already planned in refactoring)
- TR3.2: Support environment variable and
.envfile for API key (viapython-dotenv) - TR3.3: Config validation ensures provider + API key consistency
- TR3.4: Backward compatible defaults (local providers)
TR4: Testing¶
- TR4.1: Unit tests for OpenAI providers (with mocked API)
- TR4.2: Integration tests with real API (optional, requires API key)
- TR4.3: Tests verify same interface as local providers
- TR4.4: Tests verify error handling and rate limiting
- TR4.5: Backward compatibility tests (default behavior unchanged)
Success Criteria¶
- ✅ Users can select OpenAI provider for any capability via configuration
- ✅ Default behavior (local providers) remains unchanged
- ✅ API keys are managed securely via environment variables and
.envfiles (usingpython-dotenv) - ✅ OpenAI providers implement same interfaces as local providers
- ✅ No changes required to workflow.py or end-user code
- ✅ Parallelism works correctly with API providers
- ✅ Error handling is clear and actionable
- ✅ Documentation explains provider selection and API key setup (including
.envfile usage)
Out of Scope¶
-
Other cloud providers (AWS, etc.) - future PRDs
-
API key management UI or interactive setup
- Cost tracking or usage monitoring
- Provider fallback chains (use provider A, fallback to provider B)
- Real-time provider switching during execution
Dependencies¶
- Prerequisite: Modularization refactoring must be completed (RFC-021) (✅ Completed)
- External: OpenAI API access and API key
- Internal: Provider abstraction interfaces (from refactoring)
Risks & Mitigations¶
- Risk: API costs can be high for large batches
- Mitigation: Document costs, make local providers default, provide cost estimates
- Risk: API rate limits may slow processing
- Mitigation: Implement retry logic with backoff, respect rate limits, document limits
- Risk: API availability issues
- Mitigation: Clear error messages, retry logic, fallback documentation
- Risk: Breaking changes if provider interfaces change
- Mitigation: Protocol-based interfaces, comprehensive tests, backward compatibility
Extensibility & Public API¶
Extension Points (Public API)¶
The provider system is designed to be extensible by external contributors:
- Protocol Interfaces (Public API):
SpeakerDetectorprotocol - Public interface for speaker detection providersTranscriptionProviderprotocol - Public interface for transcription providersSummarizationProviderprotocol - Public interface for summarization providers-
These protocols define the contract that all providers must implement
-
Factory Registration (Public API):
- Factories can be extended to support custom providers
- Contributors can register their own provider implementations
-
Provider selection via configuration remains the same
-
Configuration Extensions (Public API):
- Config fields for provider selection are public
- Custom providers can add their own config fields
- Backward compatible with existing configurations
Internal Implementations¶
What we provide are internal implementations of the protocols:
NERSpeakerDetector- Internal implementation (spaCy-based)WhisperTranscriptionProvider- Internal implementation (local Whisper)LocalSummarizationProvider- Internal implementation (Hugging Face transformers)OpenAISpeakerDetector- Internal implementation (OpenAI API)OpenAITranscriptionProvider- Internal implementation (OpenAI Whisper API)OpenAISummarizationProvider- Internal implementation (OpenAI GPT API)
These are provided as reference implementations and defaults, but the architecture supports external implementations.
Contributor Implementations¶
We expect and encourage contributors to create their own provider implementations:
- Custom Speaker Detectors: AWS Comprehend, Google Cloud NLP, custom NER models
- Custom Transcription Providers: Deepgram, AssemblyAI, Azure Speech Services
- Custom Summarization Providers: Google Gemini, custom LLMs
Requirements for Contributor Implementations:
- Must implement the protocol interface (type checking enforced)
- Must pass all protocol tests
- Must include unit tests
- Must be documented with examples
- Must handle errors gracefully
- Must respect rate limits (for API providers)
Testing Strategy¶
Generic Pipeline Testing:
- Test workflow with mock providers (verify protocol compliance)
- Test factory selection logic
- Test provider switching
- Test error handling and fallbacks
- Test configuration validation
Implementation Testing:
- Each provider must have comprehensive unit tests
- Test protocol interface compliance
- Test error scenarios
- Test edge cases
- Integration tests with real providers (optional, requires API keys in
.envfile or environment)
Testing Requirements:
- All providers must pass protocol interface tests
- All providers must pass generic pipeline tests
- Internal implementations must have 80%+ test coverage
- External implementations should follow same testing standards
Documentation & Examples¶
New Extensibility Section (docs/EXTENSIBILITY.md):
- Architecture Overview:
- How provider system works
- Protocol-based design
-
Factory pattern usage
-
Creating Custom Providers:
- Step-by-step guide for each provider type
- Protocol interface documentation
-
Example implementations
-
Testing Custom Providers:
- How to test protocol compliance
- Mock provider examples
-
Integration testing guide
-
Contributing Providers:
- Code organization
- Naming conventions
- Documentation requirements
-
Pull request process
-
Examples:
- Minimal provider implementation
- Full-featured provider implementation
- Provider with custom configuration
- Provider with error handling
Future Considerations¶
- Support for other providers (AWS services)
- Provider fallback chains (try OpenAI, fallback to local)
- Cost tracking and usage monitoring
- Provider performance comparison tools
- Hybrid processing (use API for some episodes, local for others)
- Provider plugin system (external packages can register providers)