PRD-014: Ollama Provider Integration¶

Status: ✅ Implemented (v2.5.0)
Revision: 2
Date: 2026-02-04
Related RFCs: RFC-037 (Revised)
Related PRDs: PRD-006 (OpenAI), PRD-013 (Grok)

Summary¶

Add Ollama as an optional provider for speaker detection and summarization capabilities. Ollama is unique in being a fully local/offline solution that runs open-source LLMs on your own hardware. Unlike cloud providers, Ollama requires NO API keys, has NO rate limits, and incurs NO per-token costs. Like OpenAI, Ollama uses a unified provider pattern where a single OllamaProvider class implements both capabilities. Like Anthropic, DeepSeek, and Grok, Ollama does NOT support audio transcription.

Background & Context¶

Ollama offers several compelling advantages:

Fully Local/Offline: No internet required, complete privacy
Zero API Costs: After hardware, unlimited free usage
No Rate Limits: Process as fast as your hardware allows
Open Source Models: Llama 3.3, Mistral, Gemma, Phi, Qwen, etc.
OpenAI-Compatible API: Same format, uses OpenAI SDK
Model Flexibility: Easily switch between models

This PRD addresses adding Ollama as a privacy-first, cost-free provider option.

Goals¶

Add Ollama as provider option for speaker detection and summarization
Maintain 100% backward compatibility
Follow unified provider pattern (like OpenAI) - single class implementing both protocols
Support local model selection (user installs models via Ollama CLI)
Support both Config-based and experiment-based factory modes from the start
Handle capability gaps gracefully (no transcription)
Use OpenAI SDK with custom base_url (no new SDK dependency)
Use environment-based model defaults (test vs production)
Validate Ollama server is running and models are available

Ollama Model Selection and Cost Analysis¶

Configuration Fields¶

Add to config.py when implementing Ollama providers (following OpenAI pattern):

# Ollama API Configuration (NO API KEY NEEDED - local service)
ollama_api_base: Optional[str] = Field(
    default=None,
    alias="ollama_api_base",
    description="Ollama API base URL (default: http://localhost:11434/v1, for E2E testing)"
)

# Ollama Model Selection (environment-based defaults, like OpenAI)
ollama_speaker_model: str = Field(
    default_factory=_get_default_ollama_speaker_model,
    alias="ollama_speaker_model",
    description="Ollama model for speaker detection (default: environment-based)"
)

ollama_summary_model: str = Field(
    default_factory=_get_default_ollama_summary_model,
    alias="ollama_summary_model",
    description="Ollama model for summarization (default: environment-based)"
)

# Shared settings (like OpenAI)
ollama_temperature: float = Field(
    default=0.3,
    alias="ollama_temperature",
    description="Temperature for Ollama generation (0.0-2.0, lower = more deterministic)"
)

ollama_max_tokens: Optional[int] = Field(
    default=None,
    alias="ollama_max_tokens",
    description="Max tokens for Ollama generation (None = model default)"
)

# Ollama Connection Settings
ollama_timeout: int = Field(
    default=120,
    alias="ollama_timeout",
    description="Timeout in seconds for Ollama API calls (local inference can be slow)"
)

# Ollama Prompt Configuration (following OpenAI pattern)
ollama_speaker_system_prompt: Optional[str] = Field(
    default=None,
    alias="ollama_speaker_system_prompt",
    description="Ollama system prompt for speaker detection (default: ollama/ner/system_ner_v1)"
)

ollama_speaker_user_prompt: str = Field(
    default="ollama/ner/guest_host_v1",
    alias="ollama_speaker_user_prompt",
    description="Ollama user prompt for speaker detection"
)

ollama_summary_system_prompt: Optional[str] = Field(
    default=None,
    alias="ollama_summary_system_prompt",
    description="Ollama system prompt for summarization (default: ollama/summarization/system_v1)"
)

ollama_summary_user_prompt: str = Field(
    default="ollama/summarization/long_v1",
    alias="ollama_summary_user_prompt",
    description="Ollama user prompt for summarization"
)

Environment-based defaults:

Test environment: llama3.2:latest (smaller, faster for testing)
Production environment: llama3.3:latest (best quality, 128k context)

Model Options and Cost Analysis¶

Model	Size	RAM Required	Context Window	Quality	Speed
llama3.3:70b	40GB	48GB+	128k	Best	Slow
llama3.2:latest	2GB	4GB+	128k	Good	Fast
mistral:latest	4GB	8GB+	32k	Good	Fast
gemma2:9b	5GB	8GB+	8k	Good	Fast
phi3:medium	8GB	16GB+	128k	Good	Medium
qwen2.5:14b	9GB	16GB+	128k	Excellent	Medium

Cost Comparison: Ollama vs Cloud (Per 1000 Episodes)¶

Component	OpenAI (gpt-4o-mini)	DeepSeek	Grok	Ollama
Transcription	$6.00	❌ N/A	❌ N/A	❌ N/A
Speaker Detection	$1.40	$0.04	$0.06	$0.00
Summarization	$4.10	$0.12	$0.20	$0.00
Total	$11.50	$0.16	$0.26	$0.00

One-Time Hardware Investment¶

For optimal Ollama performance:

Hardware	Cost	Performance
Mac Mini M4 (16GB)	~$600	Good for small models
Mac Studio M2 Max (64GB)	~$3,000	Excellent for 70B models
PC with RTX 4090 (24GB VRAM)	~$2,500	Excellent speed

Break-even Analysis: At 10,000 episodes/month with OpenAI, hardware pays for itself in ~3 months.

Processing Speed Comparison¶

Provider	1000-word Summary Time
OpenAI GPT-4o-mini	~5 seconds
Grok	~1.0 seconds
Ollama (Llama 3.3 on M4 Pro)	~30 seconds
Ollama (Llama 3.2 on M4 Pro)	~3 seconds

Note: Ollama is slower than cloud providers but costs nothing per request.

Non-Goals¶

Transcription support (Ollama hosts LLMs only)
Automatic model installation (user manages via ollama pull)
GPU optimization (handled by Ollama itself)
Changing default behavior

Personas¶

Privacy-First Pat: Wants all processing on local hardware
Budget-Conscious Bob: Wants zero ongoing API costs
Offline Otto: Needs processing without internet connection
Self-Hosted Steve: Prefers local infrastructure over cloud
Enterprise Emily: Needs data to never leave corporate network

User Stories¶

As Privacy-First Pat, I can process podcasts without sending data to cloud APIs.
As Budget-Conscious Bob, I can process unlimited episodes with no API costs.
As Offline Otto, I can process podcasts on an airplane or remote location.
As Self-Hosted Steve, I can run everything on my own hardware.
As Enterprise Emily, I can ensure sensitive podcast data never leaves our network.

Functional Requirements¶

FR1: Provider Selection¶

FR1.1: Add "ollama" as valid value for speaker_detector_provider
FR1.2: Add "ollama" as valid value for summary_provider
FR1.3: Attempting transcription with Ollama results in clear error
FR1.4: Default values maintain current behavior
FR1.5: Support both Config-based and experiment-based factory modes from the start

FR2: No API Key Required¶

FR2.1: Ollama does not require API key (local service)
FR2.2: Clear error if Ollama server is not running (with helpful instructions)
FR2.3: Support custom ollama_api_base for remote Ollama servers
FR2.4: Support OLLAMA_API_BASE environment variable (like OPENAI_API_BASE)

FR3: Model Availability Validation¶

FR3.1: Validate model exists in local Ollama installation
FR3.2: Provide helpful error with ollama pull command if model missing
FR3.3: List available models in error message

FR4: Speaker Detection with Ollama¶

FR4.1: Ollama provider uses local LLMs for entity extraction
FR4.2: Maintains same interface as other providers
FR4.3: Uses Ollama-specific prompt templates

FR5: Summarization with Ollama¶

FR5.1: Ollama provider uses local LLMs for summarization
FR5.2: Leverages model's context window
FR5.3: Maintains same interface

FR6: OpenAI-Compatible API¶

FR6.1: Use OpenAI SDK with base_url=http://localhost:11434/v1
FR6.2: No separate SDK dependency required

Technical Requirements¶

TR1: Architecture¶

TR1.1: Follow unified provider pattern (like OpenAI) - single class implementing both protocols
TR1.2: Create providers/ollama/ollama_provider.py with unified OllamaProvider class
TR1.3: OllamaProvider implements SpeakerDetector and SummarizationProvider protocols
TR1.4: Update factories to include Ollama option with support for both Config-based and experiment-based modes
TR1.5: Create prompts/ollama/ directory with provider-specific prompt templates
TR1.6: Use OpenAI SDK with custom base_url (no new SDK dependency)
TR1.7: Follow OpenAI provider architecture exactly for consistency

TR2: Dependencies¶

TR2.1: Reuse existing openai package (already installed for OpenAI provider)
TR2.2: Add httpx package for Ollama health checks (connection validation)
TR2.3: External dependency: Ollama must be installed and running
TR2.4: ImportError with helpful message if openai or httpx packages not installed

TR3: Connection Handling¶

TR3.1: Graceful error if Ollama server not running
TR3.2: Support for custom port/host configuration
TR3.3: Timeout configuration for slow models

Success Criteria¶

✅ Users can select Ollama provider for speaker detection and summarization via unified provider
✅ Clear error when Ollama server is not running (with helpful instructions)
✅ Clear error when model is not installed (with ollama pull command)
✅ Works completely offline
✅ Zero API costs
✅ Environment-based model defaults (test vs production)
✅ Both Config-based and experiment-based factory modes supported
✅ E2E tests pass (with mock or real Ollama)
✅ Follows OpenAI provider pattern exactly for consistency

Provider Capability Matrix (Final)¶

Capability	Local	OpenAI	Anthropic	Mistral	DeepSeek	Gemini	Grok	Ollama
Transcription	✅	✅	❌	✅	❌	✅	❌	❌
Speaker Detection	✅	✅	✅	✅	✅	✅	✅	✅
Summarization	✅	✅	✅	✅	✅	✅	✅	✅
Offline	✅	❌	❌	❌	❌	❌	❌	✅
Zero Cost	✅	❌	❌	❌	❌	❌	❌	✅

Prerequisites¶

Before using Ollama provider:

Install Ollama: brew install ollama (macOS) or see https://ollama.ai
Start Ollama: ollama serve
Pull model: ollama pull llama3.3

Future Considerations¶

Whisper support if Ollama adds audio models
GPU acceleration optimization
Model recommendation based on available hardware
Batch processing optimization for local inference
Support for Ollama running on remote server