PRD-014: Ollama Provider Integration¶
- Status: ✅ Implemented (v2.5.0)
- Revision: 2
- Date: 2026-02-04
- Related RFCs: RFC-037 (Revised)
- Related PRDs: PRD-006 (OpenAI), PRD-013 (Grok)
Summary¶
Add Ollama as an optional provider for speaker detection and summarization capabilities. Ollama is unique in being a fully local/offline solution that runs open-source LLMs on your own hardware. Unlike cloud providers, Ollama requires NO API keys, has NO rate limits, and incurs NO per-token costs. Like OpenAI, Ollama uses a unified provider pattern where a single OllamaProvider class implements both capabilities. Like Anthropic, DeepSeek, and Grok, Ollama does NOT support audio transcription.
Background & Context¶
Ollama offers several compelling advantages:
- Fully Local/Offline: No internet required, complete privacy
- Zero API Costs: After hardware, unlimited free usage
- No Rate Limits: Process as fast as your hardware allows
- Open Source Models: Llama 3.3, Mistral, Gemma, Phi, Qwen, etc.
- OpenAI-Compatible API: Same format, uses OpenAI SDK
- Model Flexibility: Easily switch between models
This PRD addresses adding Ollama as a privacy-first, cost-free provider option.
Goals¶
- Add Ollama as provider option for speaker detection and summarization
- Maintain 100% backward compatibility
- Follow unified provider pattern (like OpenAI) - single class implementing both protocols
- Support local model selection (user installs models via Ollama CLI)
- Support both Config-based and experiment-based factory modes from the start
- Handle capability gaps gracefully (no transcription)
- Use OpenAI SDK with custom base_url (no new SDK dependency)
- Use environment-based model defaults (test vs production)
- Validate Ollama server is running and models are available
Ollama Model Selection and Cost Analysis¶
Configuration Fields¶
Add to config.py when implementing Ollama providers (following OpenAI pattern):
# Ollama API Configuration (NO API KEY NEEDED - local service)
ollama_api_base: Optional[str] = Field(
default=None,
alias="ollama_api_base",
description="Ollama API base URL (default: http://localhost:11434/v1, for E2E testing)"
)
# Ollama Model Selection (environment-based defaults, like OpenAI)
ollama_speaker_model: str = Field(
default_factory=_get_default_ollama_speaker_model,
alias="ollama_speaker_model",
description="Ollama model for speaker detection (default: environment-based)"
)
ollama_summary_model: str = Field(
default_factory=_get_default_ollama_summary_model,
alias="ollama_summary_model",
description="Ollama model for summarization (default: environment-based)"
)
# Shared settings (like OpenAI)
ollama_temperature: float = Field(
default=0.3,
alias="ollama_temperature",
description="Temperature for Ollama generation (0.0-2.0, lower = more deterministic)"
)
ollama_max_tokens: Optional[int] = Field(
default=None,
alias="ollama_max_tokens",
description="Max tokens for Ollama generation (None = model default)"
)
# Ollama Connection Settings
ollama_timeout: int = Field(
default=120,
alias="ollama_timeout",
description="Timeout in seconds for Ollama API calls (local inference can be slow)"
)
# Ollama Prompt Configuration (following OpenAI pattern)
ollama_speaker_system_prompt: Optional[str] = Field(
default=None,
alias="ollama_speaker_system_prompt",
description="Ollama system prompt for speaker detection (default: ollama/ner/system_ner_v1)"
)
ollama_speaker_user_prompt: str = Field(
default="ollama/ner/guest_host_v1",
alias="ollama_speaker_user_prompt",
description="Ollama user prompt for speaker detection"
)
ollama_summary_system_prompt: Optional[str] = Field(
default=None,
alias="ollama_summary_system_prompt",
description="Ollama system prompt for summarization (default: ollama/summarization/system_v1)"
)
ollama_summary_user_prompt: str = Field(
default="ollama/summarization/long_v1",
alias="ollama_summary_user_prompt",
description="Ollama user prompt for summarization"
)
Environment-based defaults:
- Test environment:
llama3.2:latest(smaller, faster for testing) - Production environment:
llama3.3:latest(best quality, 128k context)
Model Options and Cost Analysis¶
| Model | Size | RAM Required | Context Window | Quality | Speed |
|---|---|---|---|---|---|
| llama3.3:70b | 40GB | 48GB+ | 128k | Best | Slow |
| llama3.2:latest | 2GB | 4GB+ | 128k | Good | Fast |
| mistral:latest | 4GB | 8GB+ | 32k | Good | Fast |
| gemma2:9b | 5GB | 8GB+ | 8k | Good | Fast |
| phi3:medium | 8GB | 16GB+ | 128k | Good | Medium |
| qwen2.5:14b | 9GB | 16GB+ | 128k | Excellent | Medium |
Cost Comparison: Ollama vs Cloud (Per 1000 Episodes)¶
| Component | OpenAI (gpt-4o-mini) | DeepSeek | Grok | Ollama |
|---|---|---|---|---|
| Transcription | $6.00 | ❌ N/A | ❌ N/A | ❌ N/A |
| Speaker Detection | $1.40 | $0.04 | $0.06 | $0.00 |
| Summarization | $4.10 | $0.12 | $0.20 | $0.00 |
| Total | $11.50 | $0.16 | $0.26 | $0.00 |
One-Time Hardware Investment¶
For optimal Ollama performance:
| Hardware | Cost | Performance |
|---|---|---|
| Mac Mini M4 (16GB) | ~$600 | Good for small models |
| Mac Studio M2 Max (64GB) | ~$3,000 | Excellent for 70B models |
| PC with RTX 4090 (24GB VRAM) | ~$2,500 | Excellent speed |
Break-even Analysis: At 10,000 episodes/month with OpenAI, hardware pays for itself in ~3 months.
Processing Speed Comparison¶
| Provider | 1000-word Summary Time |
|---|---|
| OpenAI GPT-4o-mini | ~5 seconds |
| Grok | ~1.0 seconds |
| Ollama (Llama 3.3 on M4 Pro) | ~30 seconds |
| Ollama (Llama 3.2 on M4 Pro) | ~3 seconds |
Note: Ollama is slower than cloud providers but costs nothing per request.
Non-Goals¶
- Transcription support (Ollama hosts LLMs only)
- Automatic model installation (user manages via
ollama pull) - GPU optimization (handled by Ollama itself)
- Changing default behavior
Personas¶
- Privacy-First Pat: Wants all processing on local hardware
- Budget-Conscious Bob: Wants zero ongoing API costs
- Offline Otto: Needs processing without internet connection
- Self-Hosted Steve: Prefers local infrastructure over cloud
- Enterprise Emily: Needs data to never leave corporate network
User Stories¶
- As Privacy-First Pat, I can process podcasts without sending data to cloud APIs.
- As Budget-Conscious Bob, I can process unlimited episodes with no API costs.
- As Offline Otto, I can process podcasts on an airplane or remote location.
- As Self-Hosted Steve, I can run everything on my own hardware.
- As Enterprise Emily, I can ensure sensitive podcast data never leaves our network.
Functional Requirements¶
FR1: Provider Selection¶
- FR1.1: Add
"ollama"as valid value forspeaker_detector_provider - FR1.2: Add
"ollama"as valid value forsummary_provider - FR1.3: Attempting transcription with Ollama results in clear error
- FR1.4: Default values maintain current behavior
- FR1.5: Support both Config-based and experiment-based factory modes from the start
FR2: No API Key Required¶
- FR2.1: Ollama does not require API key (local service)
- FR2.2: Clear error if Ollama server is not running (with helpful instructions)
- FR2.3: Support custom
ollama_api_basefor remote Ollama servers - FR2.4: Support
OLLAMA_API_BASEenvironment variable (likeOPENAI_API_BASE)
FR3: Model Availability Validation¶
- FR3.1: Validate model exists in local Ollama installation
- FR3.2: Provide helpful error with
ollama pullcommand if model missing - FR3.3: List available models in error message
FR4: Speaker Detection with Ollama¶
- FR4.1: Ollama provider uses local LLMs for entity extraction
- FR4.2: Maintains same interface as other providers
- FR4.3: Uses Ollama-specific prompt templates
FR5: Summarization with Ollama¶
- FR5.1: Ollama provider uses local LLMs for summarization
- FR5.2: Leverages model's context window
- FR5.3: Maintains same interface
FR6: OpenAI-Compatible API¶
- FR6.1: Use OpenAI SDK with
base_url=http://localhost:11434/v1 - FR6.2: No separate SDK dependency required
Technical Requirements¶
TR1: Architecture¶
- TR1.1: Follow unified provider pattern (like OpenAI) - single class implementing both protocols
- TR1.2: Create
providers/ollama/ollama_provider.pywith unifiedOllamaProviderclass - TR1.3:
OllamaProviderimplementsSpeakerDetectorandSummarizationProviderprotocols - TR1.4: Update factories to include Ollama option with support for both Config-based and experiment-based modes
- TR1.5: Create
prompts/ollama/directory with provider-specific prompt templates - TR1.6: Use OpenAI SDK with custom base_url (no new SDK dependency)
- TR1.7: Follow OpenAI provider architecture exactly for consistency
TR2: Dependencies¶
- TR2.1: Reuse existing
openaipackage (already installed for OpenAI provider) - TR2.2: Add
httpxpackage for Ollama health checks (connection validation) - TR2.3: External dependency: Ollama must be installed and running
- TR2.4: ImportError with helpful message if
openaiorhttpxpackages not installed
TR3: Connection Handling¶
- TR3.1: Graceful error if Ollama server not running
- TR3.2: Support for custom port/host configuration
- TR3.3: Timeout configuration for slow models
Success Criteria¶
- ✅ Users can select Ollama provider for speaker detection and summarization via unified provider
- ✅ Clear error when Ollama server is not running (with helpful instructions)
- ✅ Clear error when model is not installed (with
ollama pullcommand) - ✅ Works completely offline
- ✅ Zero API costs
- ✅ Environment-based model defaults (test vs production)
- ✅ Both Config-based and experiment-based factory modes supported
- ✅ E2E tests pass (with mock or real Ollama)
- ✅ Follows OpenAI provider pattern exactly for consistency
Provider Capability Matrix (Final)¶
| Capability | Local | OpenAI | Anthropic | Mistral | DeepSeek | Gemini | Grok | Ollama |
|---|---|---|---|---|---|---|---|---|
| Transcription | ✅ | ✅ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ |
| Speaker Detection | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Summarization | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Offline | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Zero Cost | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
Prerequisites¶
Before using Ollama provider:
- Install Ollama:
brew install ollama(macOS) or see https://ollama.ai - Start Ollama:
ollama serve - Pull model:
ollama pull llama3.3
Future Considerations¶
- Whisper support if Ollama adds audio models
- GPU acceleration optimization
- Model recommendation based on available hardware
- Batch processing optimization for local inference
- Support for Ollama running on remote server