Skip to content

PRD-006: OpenAI Provider Integration

  • Status: ✅ Implemented (v2.4.0)
  • Related RFCs: RFC-013, RFC-017, RFC-021, RFC-022, RFC-029

Summary

Add OpenAI API as an optional provider for speaker detection (NER), transcription, and summarization capabilities, enabling users to choose between local on-device processing and cloud-based API services. This builds on the modularization refactoring (RFC-021) to provide seamless provider switching without changing end-user experience or workflow behavior.

Background & Context

Currently, the podcast scraper uses on-device AI/ML models:

  • Speaker Detection: spaCy NER models running locally
  • Transcription: OpenAI Whisper models running locally
  • Summarization: Hugging Face transformer models (BART, LED) running locally

While on-device processing provides privacy and predictable costs, some users may prefer:

  • Higher accuracy: OpenAI's models may provide better results for speaker detection and summarization
  • Reduced resource usage: Offload processing to cloud, reducing on-device CPU/GPU/memory requirements
  • Faster processing: API-based transcription can be faster than local Whisper for some users
  • Better quality: OpenAI's GPT models may produce higher-quality summaries than local transformers

This PRD addresses the need to add OpenAI as a provider option while maintaining backward compatibility and zero changes to end-user experience when using default (local) providers.

Goals

  • Add OpenAI API as provider option for speaker detection, transcription, and summarization
  • Maintain 100% backward compatibility - no changes to existing behavior with default (local) providers
  • Provide secure API key management via environment variables and .env files (never in source code)
  • Use python-dotenv for convenient environment variable management per environment
  • Enable per-capability provider selection (can mix local and OpenAI providers)
  • Maintain existing parallelism and performance characteristics where applicable
  • Support both development and production environments for API key storage

OpenAI Model Selection and Cost Analysis

Configuration Fields

Add to config.py when implementing OpenAI providers:

# OpenAI Model Selection

openai_speaker_model: str = Field(
    default="gpt-4o-mini",
    description="OpenAI model for speaker detection (entity extraction)"
)

openai_summary_model: str = Field(
    default="gpt-4o-mini",
    description="OpenAI model for summarization"
)

openai_transcription_model: str = Field(
    default="whisper-1",
    description="OpenAI Whisper API model version"
)

# OpenAI API Configuration

openai_api_key: Optional[str] = Field(
    default=None,
    description="OpenAI API key (prefer OPENAI_API_KEY environment variable or .env file)"
)

openai_temperature: float = Field(
    default=0.3,
    ge=0.0,
    le=2.0,
    description="Temperature for OpenAI generation (0.0-2.0, lower = more deterministic)"
)

openai_max_tokens: Optional[int] = Field(
    default=None,
    description="Max tokens for OpenAI generation (None = model default)"
)
```yaml

## Legacy Model Pricing Reference

| Model | Input Cost | Output Cost | Context Window | Best For |
| ----- | ---------- | ----------- | -------------- | -------- |
| **gpt-4o** | $2.50 / 1M tokens | $10.00 / 1M tokens | 128k tokens | Highest quality |
| **gpt-4o-mini** | $0.15 / 1M tokens | $0.60 / 1M tokens | 128k tokens | **Recommended** (balanced) |
| **gpt-4-turbo** | $10.00 / 1M tokens | $30.00 / 1M tokens | 128k tokens | Maximum quality |
| **gpt-3.5-turbo** | $0.50 / 1M tokens | $1.50 / 1M tokens | 16k tokens | Budget option |
| **whisper-1** | $0.006 / minute | N/A | N/A | Audio transcription |

**Note:** Prices subject to change. Check [OpenAI Pricing](https://openai.com/api/pricing/) for current rates.

## Cost Comparison: Local vs OpenAI (Per 100 Episodes)

| Component | Local (Transformers) | OpenAI (gpt-4o-mini) | Difference |
| --------- | ------------------- | -------------------- | ---------- |
| **Speaker Detection** | Free (spaCy NER) | $0.14 | +$0.14 |
| **Transcription** | Free (local Whisper) | $36.00 | +$36.00 |
| **Summarization** | Free (local BART/LED) | $0.41 | +$0.41 |
| **Total API Costs** | $0 | **$36.55** | +$36.55 |

**Infrastructure Costs (Local):**

- Hardware: ~$2,000-$4,000 (GPU, high RAM)
- Electricity: ~$0.50-$1.00/hour (GPU usage)
- Maintenance: Developer time for model updates

### Recommended Hybrid Strategies

**Cost-Optimized Hybrid ($0.55/100 episodes):**

```yaml
speaker_detector_type: openai      # $0.14/100 (minimal cost)
transcription_provider: whisper    # Free (local)
summary_provider: openai          # $0.41/100 (high value)

Privacy-First Hybrid:

yaml speaker_detector_provider: ner # Free (local, private) transcription_provider: whisper # Free (local, private) summary_provider: openai # $0.41/100 (convenience)yaml

  • Temperature: 0.3 (deterministic, factual)
  • Max Tokens: None (model default)

OpenAI Model Pricing

⚠️ Pricing Information: The pricing table below is provided for reference only and may be outdated. Always check OpenAI Pricing for current rates before making cost decisions.

Model Pricing (per million tokens)

Model Input Output Context Notes
GPT-5 $1.25 $10.00 256k Best quality, 65% fewer hallucinations
GPT-5 Mini $0.25 $2.00 128k Recommended for production
GPT-5 Nano $0.05 $0.40 64k Recommended for dev/test
GPT-4o ~$5.00 ~$15.00 128k Legacy, still supported
GPT-4o-mini ~$0.15 ~$0.60 128k Legacy budget option
GPT-4 $30.00 $60.00 128k Legacy, expensive
whisper-1 $0.006/min N/A N/A Audio transcription
Purpose Test Default Production Default Rationale
Speaker Detection gpt-5-nano gpt-5-mini Names extraction is simple
Summarization gpt-5-nano gpt-5 Full GPT-5 for quality summaries
Transcription whisper-1 whisper-1 Only OpenAI option

Cost Estimate Per Episode

Assuming ~10,000 words per episode transcript (~13,000 tokens):

Task Test (Nano) Production
Speaker Detection ~$0.001 ~$0.004 (Mini)
Summarization ~$0.002 ~$0.02 (GPT-5)
Transcription (60 min) $0.36 $0.36
Total per episode ~$0.36 ~$0.38

Cost Comparison: Local vs OpenAI (Per 100 Episodes)

Component Local OpenAI (GPT-5 Mini) OpenAI (GPT-5)
Speaker Detection Free $0.40 $1.60
Summarization Free $0.80 $4.00
Transcription Free $36.00 $36.00
Total $0 $37.20 $41.60

GPT-5 vs GPT-4 Advantages

  1. 95% cheaper than GPT-4 for input tokens
  2. 83% cheaper for output tokens
  3. 65% fewer hallucinations - more reliable outputs
  4. 50-80% more efficient - needs fewer tokens for same quality
  5. 94.6% math accuracy (relevant for numerical content)

Hybrid Strategy Recommendations

Cost-Optimized (GPT-5 Nano + Local Whisper):

```yaml

speaker_detector_provider: openai transcription_provider: whisper # Free (local) summary_provider: openai openai_speaker_model: gpt-5-nano openai_summary_model: gpt-5-nano

```bash

Cost: ~$0.003/episode (excluding transcription)

Quality-Optimized (GPT-5 + OpenAI Whisper):

```yaml

speaker_detector_provider: openai transcription_provider: openai summary_provider: openai openai_speaker_model: gpt-5-mini openai_summary_model: gpt-5

```bash

Cost: ~$0.38/episode

Non-Goals

  • Changing default behavior (transformers/local providers remain default)
  • Supporting other cloud providers in this PRD (OpenAI only)
  • Changing end-user workflow or CLI interface
  • Modifying existing local provider implementations
  • Adding new features beyond provider selection
  • API key management UI or interactive configuration

Personas

  • Quality Seeker Quinn: Wants highest-quality summaries and speaker detection, willing to pay for OpenAI API
  • Resource-Constrained Rachel: Has limited local compute resources, prefers cloud processing
  • Hybrid User Henry: Wants local transcription (privacy) but OpenAI summarization (quality)
  • Developer Devin: Needs to test with OpenAI API during development
  • Privacy-First Pat: Continues using local providers exclusively (no changes to their workflow)

User Stories

  • As Quality Seeker Quinn, I can configure OpenAI API for summarization to get higher-quality summaries without changing my workflow.
  • As Resource-Constrained Rachel, I can use OpenAI API for all capabilities to reduce local resource usage.
  • As Hybrid User Henry, I can use local transcription but OpenAI summarization by configuring providers separately.
  • As Developer Devin, I can set my OpenAI API key in environment variables and test API integration.
  • As Privacy-First Pat, I can continue using the software exactly as before with no changes (local providers remain default).
  • As any operator, I can see which provider was used for each capability in logs and metadata.
  • As any operator, I can switch providers via configuration without code changes.

Functional Requirements

FR1: Provider Selection

  • FR1.1: Add speaker_detector_provider config field with values "ner" (default), "openai" (Note: speaker_detector_type is deprecated but still supported for backward compatibility)
  • FR1.2: Add transcription_provider config field with values "whisper" (default), "openai"
  • FR1.3: Add summary_provider config field with values "transformers" (default), "openai"
  • FR1.4: Provider selection is independent per capability (can mix providers)
  • FR1.5: Default values maintain current behavior (local providers)
  • FR1.6: Invalid provider values result in clear error messages

FR2: API Key Management

  • FR2.1: Support OPENAI_API_KEY environment variable for API authentication
  • FR2.2: Support .env file via python-dotenv for convenient per-environment configuration
  • FR2.3: API key is never stored in source code, config files, or committed files
  • FR2.4: .env file automatically loaded when config.py module is imported
  • FR2.5: examples/.env.example template file provided (safe to commit) with placeholder values
  • FR2.6: Missing API key when OpenAI provider is selected results in clear error message
  • FR2.7: API key validation occurs at provider initialization (fail fast)
  • FR2.8: Support for development and production environments via separate .env files
  • FR2.9: Environment variable priority: config file > system env > .env file
  • FR2.10: Future-proof design for additional API keys

FR3: Speaker Detection with OpenAI

  • FR3.1: OpenAI provider uses GPT-4 or GPT-3.5-turbo for entity extraction
  • FR3.2: Maintains same interface as NER provider (detect_hosts, detect_speakers, analyze_patterns)
  • FR3.3: Returns results in same format as NER provider (no workflow changes)
  • FR3.4: Handles API rate limits gracefully (retry with backoff)
  • FR3.5: Logs provider type used in detection logs

FR4: Transcription with OpenAI

  • FR4.1: OpenAI provider uses Whisper API for transcription
  • FR4.2: Maintains same interface as local Whisper provider (transcribe method)
  • FR4.3: Returns results in same format (text, segments) as local provider
  • FR4.4: Supports same configuration options (language, model selection)
  • FR4.5: Handles file uploads and API responses correctly
  • FR4.6: Maintains parallelism where applicable (API calls can be parallelized)

FR5: Summarization with OpenAI

  • FR5.1: OpenAI provider uses GPT-4o-mini or GPT-4 for summarization
  • FR5.2: Maintains same interface as local provider (summarize, summarize_chunks, combine_summaries)
  • FR5.3: Leverages large context window (128k tokens) - can process full transcripts without chunking
  • FR5.4: Returns results in same format as local provider
  • FR5.5: Simplified processing - single API call for most transcripts (no MAP/REDUCE needed)
  • FR5.6: Falls back to chunking only for extremely long transcripts (rare)
  • FR5.7: More cost-efficient than local provider (single API call vs multiple model inferences)

FR6: Logging and Observability

  • FR6.1: Log which provider is used for each capability at initialization
  • FR6.2: Log provider type in episode processing logs
  • FR6.3: Include provider information in metadata documents
  • FR6.4: Log API usage (calls, errors, rate limits) for debugging
  • FR6.5: No sensitive information (API keys) in logs

FR7: Error Handling

  • FR7.1: API errors result in clear error messages with provider context
  • FR7.2: Rate limit errors include retry information
  • FR7.3: Network errors are handled gracefully with retries
  • FR7.4: Invalid API key errors are clear and actionable
  • FR7.5: Fallback behavior (if any) is documented and predictable

FR8: Performance and Parallelism

  • FR8.1: OpenAI API calls can be parallelized for batch processing
  • FR8.2: Rate limiting is respected (no overwhelming API)
  • FR8.3: Parallelism works correctly with API providers (no blocking)
  • FR8.4: Performance characteristics are documented (API vs local)

Technical Requirements

TR1: Architecture

  • TR1.1: Build on modularization refactoring (provider abstraction already in place)
  • TR1.2: Implement OpenAI providers following existing protocol interfaces
  • TR1.3: No changes to workflow.py logic (uses factory pattern)
  • TR1.4: Provider selection via factory pattern (already designed)

TR2: Dependencies

  • TR2.1: Add openai Python package as optional dependency
  • TR2.2: OpenAI dependency only required when OpenAI provider is used
  • TR2.3: Lazy import OpenAI client (only when provider is selected)
  • TR2.4: Version pinning for OpenAI package

TR3: Configuration

  • TR3.1: Add provider type fields to config.py (already planned in refactoring)
  • TR3.2: Support environment variable and .env file for API key (via python-dotenv)
  • TR3.3: Config validation ensures provider + API key consistency
  • TR3.4: Backward compatible defaults (local providers)

TR4: Testing

  • TR4.1: Unit tests for OpenAI providers (with mocked API)
  • TR4.2: Integration tests with real API (optional, requires API key)
  • TR4.3: Tests verify same interface as local providers
  • TR4.4: Tests verify error handling and rate limiting
  • TR4.5: Backward compatibility tests (default behavior unchanged)

Success Criteria

  • ✅ Users can select OpenAI provider for any capability via configuration
  • ✅ Default behavior (local providers) remains unchanged
  • ✅ API keys are managed securely via environment variables and .env files (using python-dotenv)
  • ✅ OpenAI providers implement same interfaces as local providers
  • ✅ No changes required to workflow.py or end-user code
  • ✅ Parallelism works correctly with API providers
  • ✅ Error handling is clear and actionable
  • ✅ Documentation explains provider selection and API key setup (including .env file usage)

Out of Scope

  • Other cloud providers (AWS, etc.) - future PRDs

  • API key management UI or interactive setup

  • Cost tracking or usage monitoring
  • Provider fallback chains (use provider A, fallback to provider B)
  • Real-time provider switching during execution

Dependencies

  • Prerequisite: Modularization refactoring must be completed (RFC-021) (✅ Completed)
  • External: OpenAI API access and API key
  • Internal: Provider abstraction interfaces (from refactoring)

Risks & Mitigations

  • Risk: API costs can be high for large batches
  • Mitigation: Document costs, make local providers default, provide cost estimates
  • Risk: API rate limits may slow processing
  • Mitigation: Implement retry logic with backoff, respect rate limits, document limits
  • Risk: API availability issues
  • Mitigation: Clear error messages, retry logic, fallback documentation
  • Risk: Breaking changes if provider interfaces change
  • Mitigation: Protocol-based interfaces, comprehensive tests, backward compatibility

Extensibility & Public API

Extension Points (Public API)

The provider system is designed to be extensible by external contributors:

  1. Protocol Interfaces (Public API):
  2. SpeakerDetector protocol - Public interface for speaker detection providers
  3. TranscriptionProvider protocol - Public interface for transcription providers
  4. SummarizationProvider protocol - Public interface for summarization providers
  5. These protocols define the contract that all providers must implement

  6. Factory Registration (Public API):

  7. Factories can be extended to support custom providers
  8. Contributors can register their own provider implementations
  9. Provider selection via configuration remains the same

  10. Configuration Extensions (Public API):

  11. Config fields for provider selection are public
  12. Custom providers can add their own config fields
  13. Backward compatible with existing configurations

Internal Implementations

What we provide are internal implementations of the protocols:

  • NERSpeakerDetector - Internal implementation (spaCy-based)
  • WhisperTranscriptionProvider - Internal implementation (local Whisper)
  • LocalSummarizationProvider - Internal implementation (Hugging Face transformers)
  • OpenAISpeakerDetector - Internal implementation (OpenAI API)
  • OpenAITranscriptionProvider - Internal implementation (OpenAI Whisper API)
  • OpenAISummarizationProvider - Internal implementation (OpenAI GPT API)

These are provided as reference implementations and defaults, but the architecture supports external implementations.

Contributor Implementations

We expect and encourage contributors to create their own provider implementations:

  • Custom Speaker Detectors: AWS Comprehend, Google Cloud NLP, custom NER models
  • Custom Transcription Providers: Deepgram, AssemblyAI, Azure Speech Services
  • Custom Summarization Providers: Google Gemini, custom LLMs

Requirements for Contributor Implementations:

  1. Must implement the protocol interface (type checking enforced)
  2. Must pass all protocol tests
  3. Must include unit tests
  4. Must be documented with examples
  5. Must handle errors gracefully
  6. Must respect rate limits (for API providers)

Testing Strategy

Generic Pipeline Testing:

  • Test workflow with mock providers (verify protocol compliance)
  • Test factory selection logic
  • Test provider switching
  • Test error handling and fallbacks
  • Test configuration validation

Implementation Testing:

  • Each provider must have comprehensive unit tests
  • Test protocol interface compliance
  • Test error scenarios
  • Test edge cases
  • Integration tests with real providers (optional, requires API keys in .env file or environment)

Testing Requirements:

  • All providers must pass protocol interface tests
  • All providers must pass generic pipeline tests
  • Internal implementations must have 80%+ test coverage
  • External implementations should follow same testing standards

Documentation & Examples

New Extensibility Section (docs/EXTENSIBILITY.md):

  1. Architecture Overview:
  2. How provider system works
  3. Protocol-based design
  4. Factory pattern usage

  5. Creating Custom Providers:

  6. Step-by-step guide for each provider type
  7. Protocol interface documentation
  8. Example implementations

  9. Testing Custom Providers:

  10. How to test protocol compliance
  11. Mock provider examples
  12. Integration testing guide

  13. Contributing Providers:

  14. Code organization
  15. Naming conventions
  16. Documentation requirements
  17. Pull request process

  18. Examples:

  19. Minimal provider implementation
  20. Full-featured provider implementation
  21. Provider with custom configuration
  22. Provider with error handling

Future Considerations

  • Support for other providers (AWS services)
  • Provider fallback chains (try OpenAI, fallback to local)
  • Cost tracking and usage monitoring
  • Provider performance comparison tools
  • Hybrid processing (use API for some episodes, local for others)
  • Provider plugin system (external packages can register providers)