Ollama Provider Guide¶
Complete guide for using the Ollama provider with podcast_scraper, including installation, setup, troubleshooting, and testing.
Overview¶
Ollama is a local, self-hosted LLM solution that runs entirely on your machine. It provides:
- ✅ Zero API costs - All processing happens locally
- ✅ Complete privacy - Data never leaves your machine
- ✅ Offline operation - No internet required after setup
- ✅ Unlimited usage - No rate limits or quotas
- ✅ Full-stack - Transcription, speaker detection, and summarization
Installation¶
Step 1: Install Ollama¶
macOS (Homebrew):
brew install ollama
macOS (Direct Download):
- Download from https://ollama.ai
- Open the
.dmgfile and drag Ollama to Applications - Launch Ollama from Applications
Linux:
curl -fsSL https://ollama.ai/install.sh | sh
Windows:
- Download from https://ollama.ai
- Run the installer
- Ollama will start automatically
Step 2: Start Ollama Server¶
Option A: Manual Start (Recommended for Testing)
# Start server in foreground (see logs)
ollama serve
# Or start in background
ollama serve &
Option B: Service Mode (macOS)
# Check if installed as service
brew services list | grep ollama
# Start as service
brew services start ollama
# Check status
brew services info ollama
Verify Server is Running:
# Test connection
curl http://localhost:11434/api/tags
# Should return JSON (empty list if no models yet)
Step 3: Pull Required Models¶
# Best for structured JSON output and GIL extraction (recommended default)
ollama pull qwen2.5:7b
# Same family, larger — higher quality; needs much more RAM/VRAM (see table)
ollama pull qwen2.5:32b
# Qwen 3.5 (three-tier checklist in this guide + repository alignment)
# ollama pull qwen3.5:9b
# ollama pull qwen3.5:27b
# ollama pull qwen3.5:35b
# General purpose, good all-rounder (recommended default for speaker detection)
ollama pull llama3.1:8b
# Fast inference, good for speaker detection
ollama pull mistral:7b
# Larger Mistral-family tags (see table below)
ollama pull mistral-nemo:12b
ollama pull mistral-small3.2
# Balanced quality/speed for summarization
ollama pull gemma2:9b
# Lightweight for development/testing (lowest RAM requirement)
ollama pull phi3:mini
# Verify models are available
ollama list
Recommended Models:
| Model | Size | RAM Required | Speed | Quality | Best For | Use Case |
|---|---|---|---|---|---|---|
| Qwen 2.5 7B | 4.4GB | 8GB+ | Medium | High | Structured JSON, GIL extraction | Summarization, GIL extraction (best JSON output) |
| Qwen 2.5 32B | ~19GB | 32GB+ | Slower | Higher | Same use cases as 7B, more capacity | When 7B quality is not enough and hardware allows |
| Llama 3.1 8B | 4.7GB | 8GB+ | Medium | High | General purpose | Speaker detection (default), summarization |
| Mistral 7B | 4.1GB | 8GB+ | Fast | Good | Fast inference | Speaker detection (fastest), summarization |
| Mistral Nemo 12B | varies | 16GB+ | Medium | Good | Larger Mistral instruct | Summarization when 7B is tight |
| Mistral Small 3.2 | varies | 16GB+ | Medium–slow | Higher | Instruct / long-context class | Summarization; confirm tag on Ollama library |
| Gemma 2 9B | 5.5GB | 12GB+ | Medium | High | Balanced quality/speed | Summarization (balanced) |
| Phi-3 Mini | 2.3GB | 4GB+ | Fast | Acceptable | Lightweight | Development, testing, low-resource |
llama3.1:8b |
~4.7GB | 6-8GB | Fastest | Good | Limited RAM (6-8GB), testing, development | N/A |
llama3.2:latest |
~4GB | 8-12GB | Fast | Good | Standard systems (8-12GB), testing, development | N/A |
llama3.3:latest |
~4GB | 12-16GB | Medium | Better | Production (12-16GB recommended) | N/A |
llama3.3:70b |
~40GB | 48GB+ | Slow | Best | High-quality production (48GB+ RAM) | N/A |
Qwen 3.5 (Ollama): three-tier checklist¶
Qwen 3.5 is a newer Ollama library family than Qwen 2.5 (multimodal-capable tags; large context on many variants). Smoke evals and docs still use Qwen 2.5 7B as the primary baseline, but this repo ships model-specific prompts, integration tests, and optional hybrid smoke YAML for the three checklist tiers (see Repository alignment below). After ollama pull, always confirm summaries and any structured JSON / GIL outputs still parse. Tags and on-disk sizes change over time—see Ollama: qwen3.5.
Use the same config fields as other Ollama models (ollama_summary_model, ollama_speaker_model, and/or hybrid_reduce_model with hybrid_reduce_backend: ollama). Model-specific prompts live under src/podcast_scraper/prompts/ollama/<dir>/ where <dir> replaces : with _ (e.g. qwen3.5:9b → qwen3.5_9b). If you use a variant tag not covered below (e.g. qwen3.5:27b-q4_K_M), the provider falls back to generic ollama/ner/ and ollama/summarization/ prompts unless you add a matching <dir>.
For Qwen 3.5 tags, chat requests use Ollama’s OpenAI-compatible field reasoning_effort: none so the assistant answer is returned in content (not only chain-of-thought in reasoning, which can consume max_tokens and look like “empty summaries”). See Ollama: Thinking.
Tier 1 — qwen3.5:9b (fast default for Qwen 3.5)¶
Roughly ~6–7 GB on disk for the common quantized tag (Ollama default latest often tracks this class).
- [ ]
ollama pull qwen3.5:9b(orollama pull qwen3.5:latestif you intend the library default) - [ ]
ollama listincludes the pulled tag - [ ] Point config at the exact tag (example:
hybrid_reduce_model: qwen3.5:9b) - [ ] Run one short episode or a smoke eval; verify summary quality and JSON validity if your pipeline expects structured output
Tier 2 — qwen3.5:27b (quality)¶
Roughly ~17 GB for the common q4_K_M-class tag on the library page—suitable for 24 GB+ unified memory with headroom for the app and OS.
- [ ]
ollama pull qwen3.5:27b - [ ]
ollama listincludesqwen3.5:27b(or the specific variant you pulled, e.g.qwen3.5:27b-q4_K_M) - [ ] Set
hybrid_reduce_model/ollama_summary_modelto that tag - [ ] Re-run the same validation as Tier 1; compare latency vs Tier 1
Tier 3 — qwen3.5:35b (heavy / Apple Silicon 48 GB class)¶
MoE-style tags (e.g. 35b-a3b on the library) are often ~24 GB for a common quant—feasible on 48 GB unified RAM if little else is memory-heavy; expect slower inference than Tier 1–2.
- [ ]
ollama pull qwen3.5:35b(or the explicit MoE tag you want, e.g.qwen3.5:35b-a3b, from the library) - [ ]
ollama listshows the model; confirm Activity Monitor memory stays acceptable under load - [ ] Set config to the exact pulled tag
- [ ] Validate outputs as in Tier 1; if quality gains are marginal, prefer Tier 2 for speed
Qwen 3.5 — repository alignment¶
Use this block after you work through Tier 1–3. It ties the operational checklist to what the repo provides.
- [ ] Tier 1 prompts (9B):
src/podcast_scraper/prompts/ollama/qwen3.5_9b/—ner/+summarization/(same output contracts as Qwen 2.5 7B templates) - [ ] Tier 2 prompts (27B):
src/podcast_scraper/prompts/ollama/qwen3.5_27b/ - [ ] Tier 3 prompts (35B):
src/podcast_scraper/prompts/ollama/qwen3.5_35b/; for tagqwen3.5:35b-a3b, usesrc/podcast_scraper/prompts/ollama/qwen3.5_35b-a3b/ - [ ] Integration tests:
pytest tests/integration/providers/ollama/test_qwen3_5_ollama_tiers.py(or run the full Ollama integration slice viamake test-integration) - [ ] Eval configs (
data/eval/configs/):llm_ollama_*_smoke_v1= pure Ollama summarization (no LongT5; same shape asllm_mistral_smoke_v1). Mistral-family Ollama smokes includellm_ollama_mistral_7b_smoke_v1,llm_ollama_mistral_nemo_12b_smoke_v1,llm_ollama_mistral_small3_2_smoke_v1. Llama 3.x smokes includellm_ollama_llama32_3b_smoke_v1,llm_ollama_llama33_70b_q3km_smoke_v1.hybrid_ml_tier2_*= LongT5 MAP + Ollama REDUCE (needs HF MAP cache +ollama pull). Qwen 3.5 tier extras:hybrid_ml_tier2_qwen35_*b_authority_v1,hybrid_ml_tier2_qwen35_9b_smoke_tuned_v1,hybrid_ml_tier2_qwen35_9b_smoke_paragraph_v1. - [ ] Acceptance configs (
config/acceptance/summarization/, mirror Qwen 2.5): hybrid —acceptance_planet_money_hybrid_ollama_qwen35_9b.yaml,_27b,_35b; full Ollama stack —acceptance_planet_money_ollama_qwen3_5_9b.yaml,_27b,_35b. Seeconfig/acceptance/README.md.
Step 4: Install Podcast Scraper Dependencies¶
# Install with LLM support (includes Ollama dependencies)
pip install -e ".[dev,ml,llm]"
Required Dependencies:
openai- Used for OpenAI-compatible API clienthttpx- Used for health checks
Basic Setup¶
Configuration File¶
# config.yaml
transcription_provider: ollama
speaker_detector_provider: ollama
summary_provider: ollama
# Ollama configuration
ollama_api_base: http://localhost:11434/v1 # Default, can be omitted
ollama_speaker_model: llama3.3:latest # Production model
ollama_summary_model: llama3.3:latest # Production model
ollama_temperature: 0.3 # Lower = more deterministic
ollama_timeout: 300 # 5 minutes for slow inference
CLI Usage¶
# Basic usage (all capabilities)
podcast-scraper --rss https://example.com/feed.xml \
--transcription-provider ollama \
--speaker-detector-provider ollama \
--summary-provider ollama
# With custom model
podcast-scraper --rss https://example.com/feed.xml \
--transcription-provider ollama \
--speaker-detector-provider ollama \
--summary-provider ollama \
--ollama-speaker-model llama3.2:latest \
--ollama-summary-model llama3.2:latest
# With custom timeout (for slow models)
podcast-scraper --rss https://example.com/feed.xml \
--transcription-provider ollama \
--speaker-detector-provider ollama \
--summary-provider ollama \
--ollama-timeout 600 # 10 minutes
Using Ollama as REDUCE Backend (hybrid_ml)¶
When using hybrid MAP-REDUCE summarization (summary_provider: hybrid_ml), you can set hybrid_reduce_backend: ollama so the REDUCE phase runs on a local Ollama model instead of transformers. The MAP phase (e.g. LongT5-base) still runs locally; only the final synthesis uses Ollama.
- Config:
summary_provider: hybrid_ml,hybrid_reduce_backend: ollama,hybrid_reduce_model: <ollama_tag>(e.g.llama3.1:8b,mistral:7b,qwen2.5:7b,qwen2.5:32b). - No template file: The reduce instruction is sent to Ollama as an inline prompt; no
custom.j2or other template file is required. - Acceptance tests: Example configs under
config/acceptance/use Ollama for REDUCE (e.g.acceptance_planet_money_hybrid_ollama_llama3_8b.yamlwithllama3.1:8b). Ensure the model is available (ollama pull llama3.1:8bor equivalent).
See ML Provider Reference for full hybrid_ml configuration.
Environment Variables¶
# Set custom API base (if Ollama is on different host/port)
export OLLAMA_API_BASE=http://192.168.1.100:11434/v1
# Then use in CLI or config
podcast-scraper --rss https://example.com/feed.xml \
--speaker-detector-provider ollama
Checklist: add support for a new Ollama model (in this repo)¶
Use this when you introduce another Ollama model tag (e.g. a larger variant of the same family) and want it to be easy to run and document—not for one-off local experiments only.
-
Runtime (enough to use it):
ollama pull <tag>. Point config or CLI at the tag (ollama_speaker_model,ollama_summary_model, and/orhybrid_reduce_modelwithhybrid_reduce_backend: ollama). There is no new provider type: the same Ollama integration applies; only the model string changes (same idea as switching OpenAI model names). -
Prompts (only if needed): Optional model-specific templates live under
prompts/.../ollama/<dir>/, where<dir>is the tag with:replaced by_(e.g.qwen2.5:32b→qwen2.5_32b). If that path does not exist, the provider falls back to genericollama/<task>/...prompts. -
Eval configs (optional): Add a new file under
data/eval/configs/with a uniqueid, copied from a similar run; setreduce_modelor the relevant Ollama fields to<tag>so multi-run / comparison workflows can name the run. -
Acceptance configs (optional): Add something like
config/acceptance/summarization/acceptance_planet_money_hybrid_ollama_<model>.yaml(mirror an existing hybrid or full-Ollama YAML). Update the table inconfig/acceptance/README.mdif the config should be discoverable. Do not add Ollama-heavy stems toconfig/acceptance/FAST_CONFIGS.txtunless you want PR CI to run them. -
Tests (optional): Extra integration tests are only warranted if you need to pin behavior for that tag; otherwise existing Ollama tests plus config strings usually suffice.
Troubleshooting¶
Issue: ollama list Hangs or Times Out¶
Symptom: Command hangs with no output.
Cause: Ollama server is not running.
Solution:
# 1. Start Ollama server
ollama serve
# 2. Keep that terminal open, then in another terminal:
ollama list
# 3. Verify server is responding
curl http://localhost:11434/api/tags
Issue: "Ollama server is not running"¶
Symptom: Error message when trying to use Ollama provider.
Cause: Ollama server is not accessible at http://localhost:11434.
Solution:
# Check if server is running
curl http://localhost:11434/api/tags
# If connection fails, start server:
ollama serve
# If using custom host/port, set environment variable:
export OLLAMA_API_BASE=http://your-host:11434/v1
Issue: "Model 'llama3.3:latest' is not available"¶
Symptom: Error message about model not found.
Cause: Model hasn't been pulled yet.
Solution:
# Pull the required model
ollama pull llama3.3:latest
# Verify it's available
ollama list
# Test the model directly
ollama run llama3.3:latest "Hello, test"
Issue: HTTP 500 from Ollama / experiment seems stuck¶
Symptom: Logs or a proxy show 500 from POST .../v1/chat/completions, or a summarization experiment sits on one episode for a long time before failing.
Typical causes: Model out of memory (especially large quants on limited RAM), context length too large for the loaded model, or a transient Ollama bug. The OpenAI-compatible client used to retry 5xx errors; for local Ollama, HTTP 500 is no longer retried (fail fast). Timeouts can still wait up to ollama_timeout seconds (default 120) per attempt.
What to try:
- Check Ollama logs (macOS example):
cat ~/.ollama/logs/server.logor runollama servein a terminal and watch stderr. - Run a tiny prompt:
ollama run <your-tag> "Say hello in one sentence." - Use a smaller quant / tag, close other GPU-heavy apps, or set a longer read timeout for eval runs only:
EXPERIMENT_OLLAMA_READ_TIMEOUT=600 make experiment-run CONFIG=... - For persistent 500s on one model only, try another tag from ollama.com/library or reduce transcript size (preprocessing / chunking).
Issue: Connection Refused on Custom Port¶
Symptom: Error connecting to Ollama on non-default port.
Cause: Ollama is running on default port (11434) but config specifies different port.
Solution:
# Check what port Ollama is actually using
ps aux | grep ollama
# Or check Ollama config
cat ~/.ollama/config.json # If exists
# Update config to match actual port:
# ollama_api_base: http://localhost:11434/v1 # Default
Issue: Slow Performance¶
Symptom: Ollama inference is very slow.
Causes & Solutions:
- Model too large for hardware:
# Use smaller model (best for limited RAM)
ollama pull llama3.1:8b
# Update config: ollama_speaker_model: llama3.1:8b
# OR for slightly better quality with more RAM:
# ollama pull llama3.2:latest
- Insufficient RAM:
# Check available RAM
free -h # Linux
vm_stat # macOS
# Use smaller model or add more RAM
- CPU-only inference (no GPU):
- Ollama will use CPU if no GPU available
- Consider using GPU-accelerated models if available
-
Increase timeout:
ollama_timeout: 600 -
Multiple concurrent requests:
- Ollama processes requests sequentially by default
- This is expected behavior for local inference
Issue: Process Won't Die¶
Symptom: Can't kill Ollama process.
Solution:
# Kill by name
pkill ollama
# Or force kill
killall ollama
# Or find and kill manually
ps aux | grep ollama
kill <PID>
# If running as service (macOS)
brew services stop ollama
Issue: Port Already in Use¶
Symptom: Error starting Ollama server - port 11434 already in use.
Solution:
# Find what's using the port
lsof -i :11434
# Kill the process
kill <PID>
# Or use different port (requires Ollama config change)
# Not recommended - better to kill existing process
Issue: Permission Denied¶
Symptom: Permission errors when starting Ollama.
Solution:
# macOS: Grant network permissions in System Settings
# Settings > Privacy & Security > Network Extensions
# Linux: Check user permissions
# Ollama should run as your user, not root
Testing with Real Models¶
E2E Tests with Real Ollama¶
# 1. Ensure Ollama server is running
ollama serve # In separate terminal
# 2. Pull required models
ollama pull llama3.3:latest
# 3. Run E2E tests with real Ollama
USE_REAL_OLLAMA_API=1 \
pytest tests/e2e/test_ollama_provider_integration_e2e.py -v
# 4. Test specific capability
USE_REAL_OLLAMA_API=1 \
pytest tests/e2e/test_ollama_provider_integration_e2e.py::TestOllamaProviderE2E::test_ollama_speaker_detection_in_workflow -v
Test with Real RSS Feed¶
USE_REAL_OLLAMA_API=1 \
LLM_TEST_RSS_FEED="https://feeds.npr.org/510289/podcast.xml" \
pytest tests/e2e/test_ollama_provider_integration_e2e.py::TestOllamaProviderE2E::test_ollama_all_providers_in_pipeline -v
Test Multiple Episodes¶
USE_REAL_OLLAMA_API=1 \
LLM_TEST_MAX_EPISODES=3 \
pytest tests/e2e/test_ollama_provider_integration_e2e.py::TestOllamaProviderE2E::test_ollama_all_providers_in_pipeline -v
Performance Tips¶
Model Selection¶
By Use Case:
- Structured JSON / GIL Extraction: Use
qwen2.5:7b(best-in-class JSON output) - Speaker Detection (Default): Use
llama3.1:8b(strong all-rounder) - Fast Speaker Detection: Use
mistral:7b(fastest inference) - Summarization (Balanced): Use
gemma2:9b(balanced quality/speed) - Development/Testing: Use
phi3:mini(lightweight, lowest RAM)
By RAM Available:
- 4GB+ RAM: Use
phi3:mini(lightweight, dev/test only) - 8GB+ RAM: Use
qwen2.5:7b,llama3.1:8b, ormistral:7b(recommended) - 12GB+ RAM: Use
gemma2:9b(balanced quality/speed)
Default Recommendations:
- Speaker Detection:
llama3.1:8b(general purpose, good quality) - Summarization:
qwen2.5:7b(best structured JSON, ideal for GIL extraction) - Development:
phi3:mini(lightweight, fast iteration)
Timeout Configuration¶
# For fast models (llama3.2)
ollama_timeout: 120 # 2 minutes
# For medium models (llama3.3)
ollama_timeout: 300 # 5 minutes
# For large models (70b)
ollama_timeout: 600 # 10 minutes
Hardware Recommendations¶
| Model | Size | Minimum RAM | Recommended RAM | Speed | Best For |
|---|---|---|---|---|---|
| Phi-3 Mini | 2.3GB | 4GB | 4GB+ | Fast | Dev/test, low-resource |
| Mistral 7B | 4.1GB | 8GB | 8GB+ | Fastest | Fast speaker detection |
| Qwen 2.5 7B | 4.4GB | 8GB | 8GB+ | Medium | Structured JSON, GIL extraction |
| Llama 3.1 8B | 4.7GB | 8GB | 8GB+ | Medium | General purpose (default) |
| Gemma 2 9B | 5.5GB | 12GB | 12GB+ | Medium | Balanced quality/speed |
Common Workflows¶
Development/Testing Setup¶
# 1. Start Ollama
ollama serve
# 2. Pull model (choose based on your RAM and use case)
ollama pull phi3:mini # Lightweight (4GB+ RAM, dev/test)
ollama pull mistral:7b # Fast (8GB+ RAM, fast speaker detection)
ollama pull mistral-nemo:12b # Mistral Nemo 12B (16GB+ RAM typical)
ollama pull mistral-small3.2 # Mistral Small 3.2 (see library for exact size)
ollama pull llama3.1:8b # General purpose (8GB+ RAM, default)
ollama pull qwen2.5:7b # Best JSON (8GB+ RAM, GIL extraction)
ollama pull gemma2:9b # Balanced (12GB+ RAM, summarization)
# 3. Configure (model-specific prompts are automatically selected)
# config.yaml:
# ollama_speaker_model: llama3.1:8b # or mistral:7b for speed
# ollama_summary_model: qwen2.5:7b # or gemma2:9b for quality
# ollama_timeout: 120
Production Setup¶
# 1. Start Ollama as service
brew services start ollama # macOS
# Or: systemctl start ollama # Linux
# 2. Pull production model
ollama pull llama3.3:latest
# 3. Configure for quality
# config.yaml:
# ollama_speaker_model: llama3.3:latest
# ollama_summary_model: llama3.3:latest
# ollama_timeout: 300
Hybrid Setup (Ollama + Other Providers)¶
# Use Ollama for privacy-sensitive operations
# Use other providers for speed/cost optimization
transcription_provider: ollama # Local Ollama transcription (privacy)
speaker_detector_provider: ollama # Local speaker detection (privacy)
summary_provider: ollama # Local summarization (privacy)
# Or mix with other providers:
# transcription_provider: whisper # Local Whisper (alternative)
# speaker_detector_provider: ollama # Local speaker detection
# summary_provider: openai # Cloud summarization (quality)
Verification¶
Quick Health Check¶
# 1. Check server is running
curl http://localhost:11434/api/tags
# 2. List available models
ollama list
# 3. Test model directly
ollama run llama3.3:latest "What is a podcast?"
# 4. Test with podcast_scraper
podcast-scraper --rss https://example.com/feed.xml \
--transcription-provider ollama \
--speaker-detector-provider ollama \
--summary-provider ollama \
--max-episodes 1
Verify Provider is Working¶
# Run with verbose logging
podcast-scraper --rss https://example.com/feed.xml \
--transcription-provider ollama \
--speaker-detector-provider ollama \
--summary-provider ollama \
--log-level DEBUG \
--max-episodes 1
# Check metadata files for provider information
cat output/*/metadata/*.json | jq '.processing.config_snapshot.ml_providers'
Related Documentation¶
- Provider Configuration Quick Reference - Configuration options
- AI Provider Comparison Guide - Ollama vs other providers
- E2E Testing Guide - Testing with real Ollama
- Troubleshooting Guide - Common issues and solutions
Additional Resources¶
- Ollama Documentation: https://github.com/ollama/ollama
- Ollama Models: https://ollama.ai/library
- Ollama API Reference: https://github.com/ollama/ollama/blob/main/docs/api.md