Ollama Provider Guide¶

Complete guide for using the Ollama provider with podcast_scraper, including installation, setup, troubleshooting, and testing.

Overview¶

Ollama is a local, self-hosted LLM solution that runs entirely on your machine. It provides:

✅ Zero API costs - All processing happens locally
✅ Complete privacy - Data never leaves your machine
✅ Offline operation - No internet required after setup
✅ Unlimited usage - No rate limits or quotas
✅ Full-stack - Transcription, speaker detection, and summarization

Installation¶

Step 1: Install Ollama¶

macOS (Homebrew):

brew install ollama

macOS (Direct Download):

Download from https://ollama.ai
Open the .dmg file and drag Ollama to Applications
Launch Ollama from Applications

Linux:

curl -fsSL https://ollama.ai/install.sh | sh

Windows:

Download from https://ollama.ai
Run the installer
Ollama will start automatically

Step 2: Start Ollama Server¶

Option A: Manual Start (Recommended for Testing)

# Start server in foreground (see logs)
ollama serve

# Or start in background
ollama serve &

Option B: Service Mode (macOS)

# Check if installed as service
brew services list | grep ollama

# Start as service
brew services start ollama

# Check status
brew services info ollama

Verify Server is Running:

# Test connection
curl http://localhost:11434/api/tags

# Should return JSON (empty list if no models yet)

Step 3: Pull Required Models¶

# Best for structured JSON output and GIL extraction (recommended default)
ollama pull qwen2.5:7b

# Same family, larger — higher quality; needs much more RAM/VRAM (see table)
ollama pull qwen2.5:32b

# Qwen 3.5 (three-tier checklist in this guide + repository alignment)
# ollama pull qwen3.5:9b
# ollama pull qwen3.5:27b
# ollama pull qwen3.5:35b

# General purpose, good all-rounder (recommended default for speaker detection)
ollama pull llama3.1:8b

# Fast inference, good for speaker detection
ollama pull mistral:7b

# Larger Mistral-family tags (see table below)
ollama pull mistral-nemo:12b
ollama pull mistral-small3.2

# Balanced quality/speed for summarization
ollama pull gemma2:9b

# Lightweight for development/testing (lowest RAM requirement)
ollama pull phi3:mini

# Verify models are available
ollama list

Recommended Models:

Model	Size	RAM Required	Speed	Quality	Best For	Use Case
Qwen 2.5 7B	4.4GB	8GB+	Medium	High	Structured JSON, GIL extraction	Summarization, GIL extraction (best JSON output)
Qwen 2.5 32B	~19GB	32GB+	Slower	Higher	Same use cases as 7B, more capacity	When 7B quality is not enough and hardware allows
Llama 3.1 8B	4.7GB	8GB+	Medium	High	General purpose	Speaker detection (default), summarization
Mistral 7B	4.1GB	8GB+	Fast	Good	Fast inference	Speaker detection (fastest), summarization
Mistral Nemo 12B	varies	16GB+	Medium	Good	Larger Mistral instruct	Summarization when 7B is tight
Mistral Small 3.2	varies	16GB+	Medium–slow	Higher	Instruct / long-context class	Summarization; confirm tag on Ollama library
Gemma 2 9B	5.5GB	12GB+	Medium	High	Balanced quality/speed	Summarization (balanced)
Phi-3 Mini	2.3GB	4GB+	Fast	Acceptable	Lightweight	Development, testing, low-resource
`llama3.1:8b`	~4.7GB	6-8GB	Fastest	Good	Limited RAM (6-8GB), testing, development	N/A
`llama3.2:latest`	~4GB	8-12GB	Fast	Good	Standard systems (8-12GB), testing, development	N/A
`llama3.3:latest`	~4GB	12-16GB	Medium	Better	Production (12-16GB recommended)	N/A
`llama3.3:70b`	~40GB	48GB+	Slow	Best	High-quality production (48GB+ RAM)	N/A

Qwen 3.5 (Ollama): three-tier checklist¶

Qwen 3.5 is a newer Ollama library family than Qwen 2.5 (multimodal-capable tags; large context on many variants). Smoke evals and docs still use Qwen 2.5 7B as the primary baseline, but this repo ships model-specific prompts, integration tests, and optional hybrid smoke YAML for the three checklist tiers (see Repository alignment below). After ollama pull, always confirm summaries and any structured JSON / GIL outputs still parse. Tags and on-disk sizes change over time—see Ollama: qwen3.5.

Use the same config fields as other Ollama models (ollama_summary_model, ollama_speaker_model, and/or hybrid_reduce_model with hybrid_reduce_backend: ollama). Model-specific prompts live under src/podcast_scraper/prompts/ollama/<dir>/ where <dir> replaces : with _ (e.g. qwen3.5:9b → qwen3.5_9b). If you use a variant tag not covered below (e.g. qwen3.5:27b-q4_K_M), the provider falls back to generic ollama/ner/ and ollama/summarization/ prompts unless you add a matching <dir>.

For Qwen 3.5 tags, chat requests use Ollama’s OpenAI-compatible field reasoning_effort: none so the assistant answer is returned in content (not only chain-of-thought in reasoning, which can consume max_tokens and look like “empty summaries”). See Ollama: Thinking.

Tier 1 — `qwen3.5:9b` (fast default for Qwen 3.5)¶

Roughly ~6–7 GB on disk for the common quantized tag (Ollama default latest often tracks this class).

[ ] ollama pull qwen3.5:9b (or ollama pull qwen3.5:latest if you intend the library default)
[ ] ollama list includes the pulled tag
[ ] Point config at the exact tag (example: hybrid_reduce_model: qwen3.5:9b)
[ ] Run one short episode or a smoke eval; verify summary quality and JSON validity if your pipeline expects structured output

Tier 2 — `qwen3.5:27b` (quality)¶

Roughly ~17 GB for the common q4_K_M-class tag on the library page—suitable for 24 GB+ unified memory with headroom for the app and OS.

[ ] ollama pull qwen3.5:27b
[ ] ollama list includes qwen3.5:27b (or the specific variant you pulled, e.g. qwen3.5:27b-q4_K_M)
[ ] Set hybrid_reduce_model / ollama_summary_model to that tag
[ ] Re-run the same validation as Tier 1; compare latency vs Tier 1

Tier 3 — `qwen3.5:35b` (heavy / Apple Silicon 48 GB class)¶

MoE-style tags (e.g. 35b-a3b on the library) are often ~24 GB for a common quant—feasible on 48 GB unified RAM if little else is memory-heavy; expect slower inference than Tier 1–2.

[ ] ollama pull qwen3.5:35b (or the explicit MoE tag you want, e.g. qwen3.5:35b-a3b, from the library)
[ ] ollama list shows the model; confirm Activity Monitor memory stays acceptable under load
[ ] Set config to the exact pulled tag
[ ] Validate outputs as in Tier 1; if quality gains are marginal, prefer Tier 2 for speed

Qwen 3.5 — repository alignment¶

Use this block after you work through Tier 1–3. It ties the operational checklist to what the repo provides.

[ ] Tier 1 prompts (9B): src/podcast_scraper/prompts/ollama/qwen3.5_9b/ — ner/ + summarization/ (same output contracts as Qwen 2.5 7B templates)
[ ] Tier 2 prompts (27B): src/podcast_scraper/prompts/ollama/qwen3.5_27b/
[ ] Tier 3 prompts (35B): src/podcast_scraper/prompts/ollama/qwen3.5_35b/; for tag qwen3.5:35b-a3b, use src/podcast_scraper/prompts/ollama/qwen3.5_35b-a3b/
[ ] Integration tests: pytest tests/integration/providers/ollama/test_qwen3_5_ollama_tiers.py (or run the full Ollama integration slice via make test-integration)
[ ] Eval configs (data/eval/configs/): llm_ollama_*_smoke_v1 = pure Ollama summarization (no LongT5; same shape as llm_mistral_smoke_v1). Mistral-family Ollama smokes include llm_ollama_mistral_7b_smoke_v1, llm_ollama_mistral_nemo_12b_smoke_v1, llm_ollama_mistral_small3_2_smoke_v1. Llama 3.x smokes include llm_ollama_llama32_3b_smoke_v1, llm_ollama_llama33_70b_q3km_smoke_v1. hybrid_ml_tier2_* = LongT5 MAP + Ollama REDUCE (needs HF MAP cache + ollama pull). Qwen 3.5 tier extras: hybrid_ml_tier2_qwen35_*b_authority_v1, hybrid_ml_tier2_qwen35_9b_smoke_tuned_v1, hybrid_ml_tier2_qwen35_9b_smoke_paragraph_v1.
[ ] Acceptance configs (config/acceptance/summarization/, mirror Qwen 2.5): hybrid — acceptance_planet_money_hybrid_ollama_qwen35_9b.yaml, _27b, _35b; full Ollama stack — acceptance_planet_money_ollama_qwen3_5_9b.yaml, _27b, _35b. See config/acceptance/README.md.

Step 4: Install Podcast Scraper Dependencies¶

# Install with LLM support (includes Ollama dependencies)
pip install -e ".[dev,ml,llm]"

Required Dependencies:

openai - Used for OpenAI-compatible API client
httpx - Used for health checks

Basic Setup¶

Configuration File¶

# config.yaml
transcription_provider: ollama
speaker_detector_provider: ollama
summary_provider: ollama

# Ollama configuration
ollama_api_base: http://localhost:11434/v1  # Default, can be omitted
ollama_speaker_model: llama3.3:latest       # Production model
ollama_summary_model: llama3.3:latest       # Production model
ollama_temperature: 0.3                      # Lower = more deterministic
ollama_timeout: 300                          # 5 minutes for slow inference

CLI Usage¶

# Basic usage (all capabilities)
podcast-scraper --rss https://example.com/feed.xml \
  --transcription-provider ollama \
  --speaker-detector-provider ollama \
  --summary-provider ollama

# With custom model
podcast-scraper --rss https://example.com/feed.xml \
  --transcription-provider ollama \
  --speaker-detector-provider ollama \
  --summary-provider ollama \
  --ollama-speaker-model llama3.2:latest \
  --ollama-summary-model llama3.2:latest

# With custom timeout (for slow models)
podcast-scraper --rss https://example.com/feed.xml \
  --transcription-provider ollama \
  --speaker-detector-provider ollama \
  --summary-provider ollama \
  --ollama-timeout 600  # 10 minutes

Using Ollama as REDUCE Backend (hybrid_ml)¶

When using hybrid MAP-REDUCE summarization (summary_provider: hybrid_ml), you can set hybrid_reduce_backend: ollama so the REDUCE phase runs on a local Ollama model instead of transformers. The MAP phase (e.g. LongT5-base) still runs locally; only the final synthesis uses Ollama.

Config: summary_provider: hybrid_ml, hybrid_reduce_backend: ollama, hybrid_reduce_model: <ollama_tag> (e.g. llama3.1:8b, mistral:7b, qwen2.5:7b, qwen2.5:32b).
No template file: The reduce instruction is sent to Ollama as an inline prompt; no custom.j2 or other template file is required.
Acceptance tests: Example configs under config/acceptance/ use Ollama for REDUCE (e.g. acceptance_planet_money_hybrid_ollama_llama3_8b.yaml with llama3.1:8b). Ensure the model is available (ollama pull llama3.1:8b or equivalent).

See ML Provider Reference for full hybrid_ml configuration.

Environment Variables¶

# Set custom API base (if Ollama is on different host/port)
export OLLAMA_API_BASE=http://192.168.1.100:11434/v1

# Then use in CLI or config
podcast-scraper --rss https://example.com/feed.xml \
  --speaker-detector-provider ollama

Checklist: add support for a new Ollama model (in this repo)¶

Use this when you introduce another Ollama model tag (e.g. a larger variant of the same family) and want it to be easy to run and document—not for one-off local experiments only.

Runtime (enough to use it): ollama pull <tag>. Point config or CLI at the tag (ollama_speaker_model, ollama_summary_model, and/or hybrid_reduce_model with hybrid_reduce_backend: ollama). There is no new provider type: the same Ollama integration applies; only the model string changes (same idea as switching OpenAI model names).
Prompts (only if needed): Optional model-specific templates live under prompts/.../ollama/<dir>/, where <dir> is the tag with : replaced by _ (e.g. qwen2.5:32b → qwen2.5_32b). If that path does not exist, the provider falls back to generic ollama/<task>/... prompts.
Eval configs (optional): Add a new file under data/eval/configs/ with a unique id, copied from a similar run; set reduce_model or the relevant Ollama fields to <tag> so multi-run / comparison workflows can name the run.
Acceptance configs (optional): Add something like config/acceptance/summarization/acceptance_planet_money_hybrid_ollama_<model>.yaml (mirror an existing hybrid or full-Ollama YAML). Update the table in config/acceptance/README.md if the config should be discoverable. Do not add Ollama-heavy stems to config/acceptance/FAST_CONFIGS.txt unless you want PR CI to run them.
Tests (optional): Extra integration tests are only warranted if you need to pin behavior for that tag; otherwise existing Ollama tests plus config strings usually suffice.

Troubleshooting¶

Issue: `ollama list` Hangs or Times Out¶

Symptom: Command hangs with no output.

Cause: Ollama server is not running.

Solution:

# 1. Start Ollama server
ollama serve

# 2. Keep that terminal open, then in another terminal:
ollama list

# 3. Verify server is responding
curl http://localhost:11434/api/tags

Issue: "Ollama server is not running"¶

Symptom: Error message when trying to use Ollama provider.

Cause: Ollama server is not accessible at http://localhost:11434.

Solution:

# Check if server is running
curl http://localhost:11434/api/tags

# If connection fails, start server:
ollama serve

# If using custom host/port, set environment variable:
export OLLAMA_API_BASE=http://your-host:11434/v1

Issue: "Model 'llama3.3:latest' is not available"¶

Symptom: Error message about model not found.

Cause: Model hasn't been pulled yet.

Solution:

# Pull the required model
ollama pull llama3.3:latest

# Verify it's available
ollama list

# Test the model directly
ollama run llama3.3:latest "Hello, test"

Issue: HTTP 500 from Ollama / experiment seems stuck¶

Symptom: Logs or a proxy show 500 from POST .../v1/chat/completions, or a summarization experiment sits on one episode for a long time before failing.

Typical causes: Model out of memory (especially large quants on limited RAM), context length too large for the loaded model, or a transient Ollama bug. The OpenAI-compatible client used to retry 5xx errors; for local Ollama, HTTP 500 is no longer retried (fail fast). Timeouts can still wait up to ollama_timeout seconds (default 120) per attempt.

What to try:

Check Ollama logs (macOS example): cat ~/.ollama/logs/server.log or run ollama serve in a terminal and watch stderr.
Run a tiny prompt: ollama run <your-tag> "Say hello in one sentence."
Use a smaller quant / tag, close other GPU-heavy apps, or set a longer read timeout for eval runs only: EXPERIMENT_OLLAMA_READ_TIMEOUT=600 make experiment-run CONFIG=...
For persistent 500s on one model only, try another tag from ollama.com/library or reduce transcript size (preprocessing / chunking).

Issue: Connection Refused on Custom Port¶

Symptom: Error connecting to Ollama on non-default port.

Cause: Ollama is running on default port (11434) but config specifies different port.

Solution:

# Check what port Ollama is actually using
ps aux | grep ollama

# Or check Ollama config
cat ~/.ollama/config.json  # If exists

# Update config to match actual port:
# ollama_api_base: http://localhost:11434/v1  # Default

Issue: Slow Performance¶

Symptom: Ollama inference is very slow.

Causes & Solutions:

Model too large for hardware:

# Use smaller model (best for limited RAM)
ollama pull llama3.1:8b
# Update config: ollama_speaker_model: llama3.1:8b
# OR for slightly better quality with more RAM:
# ollama pull llama3.2:latest

Insufficient RAM:

# Check available RAM
free -h  # Linux
vm_stat  # macOS

# Use smaller model or add more RAM

CPU-only inference (no GPU):
Ollama will use CPU if no GPU available
Consider using GPU-accelerated models if available
Increase timeout: ollama_timeout: 600
Multiple concurrent requests:
Ollama processes requests sequentially by default
This is expected behavior for local inference

Issue: Process Won't Die¶

Symptom: Can't kill Ollama process.

Solution:

# Kill by name
pkill ollama

# Or force kill
killall ollama

# Or find and kill manually
ps aux | grep ollama
kill <PID>

# If running as service (macOS)
brew services stop ollama

Issue: Port Already in Use¶

Symptom: Error starting Ollama server - port 11434 already in use.

Solution:

# Find what's using the port
lsof -i :11434

# Kill the process
kill <PID>

# Or use different port (requires Ollama config change)
# Not recommended - better to kill existing process

Issue: Permission Denied¶

Symptom: Permission errors when starting Ollama.

Solution:

# macOS: Grant network permissions in System Settings
# Settings > Privacy & Security > Network Extensions

# Linux: Check user permissions
# Ollama should run as your user, not root

Testing with Real Models¶

E2E Tests with Real Ollama¶

# 1. Ensure Ollama server is running
ollama serve  # In separate terminal

# 2. Pull required models
ollama pull llama3.3:latest

# 3. Run E2E tests with real Ollama
USE_REAL_OLLAMA_API=1 \
pytest tests/e2e/test_ollama_provider_integration_e2e.py -v

# 4. Test specific capability
USE_REAL_OLLAMA_API=1 \
pytest tests/e2e/test_ollama_provider_integration_e2e.py::TestOllamaProviderE2E::test_ollama_speaker_detection_in_workflow -v

Test with Real RSS Feed¶

USE_REAL_OLLAMA_API=1 \
LLM_TEST_RSS_FEED="https://feeds.npr.org/510289/podcast.xml" \
pytest tests/e2e/test_ollama_provider_integration_e2e.py::TestOllamaProviderE2E::test_ollama_all_providers_in_pipeline -v

Test Multiple Episodes¶

USE_REAL_OLLAMA_API=1 \
LLM_TEST_MAX_EPISODES=3 \
pytest tests/e2e/test_ollama_provider_integration_e2e.py::TestOllamaProviderE2E::test_ollama_all_providers_in_pipeline -v

Performance Tips¶

Model Selection¶

By Use Case:

Structured JSON / GIL Extraction: Use qwen2.5:7b (best-in-class JSON output)
Speaker Detection (Default): Use llama3.1:8b (strong all-rounder)
Fast Speaker Detection: Use mistral:7b (fastest inference)
Summarization (Balanced): Use gemma2:9b (balanced quality/speed)
Development/Testing: Use phi3:mini (lightweight, lowest RAM)

By RAM Available:

4GB+ RAM: Use phi3:mini (lightweight, dev/test only)
8GB+ RAM: Use qwen2.5:7b, llama3.1:8b, or mistral:7b (recommended)
12GB+ RAM: Use gemma2:9b (balanced quality/speed)

Default Recommendations:

Speaker Detection: llama3.1:8b (general purpose, good quality)
Summarization: qwen2.5:7b (best structured JSON, ideal for GIL extraction)
Development: phi3:mini (lightweight, fast iteration)

Timeout Configuration¶

# For fast models (llama3.2)
ollama_timeout: 120  # 2 minutes

# For medium models (llama3.3)
ollama_timeout: 300  # 5 minutes

# For large models (70b)
ollama_timeout: 600  # 10 minutes

Hardware Recommendations¶

Model	Size	Minimum RAM	Recommended RAM	Speed	Best For
Phi-3 Mini	2.3GB	4GB	4GB+	Fast	Dev/test, low-resource
Mistral 7B	4.1GB	8GB	8GB+	Fastest	Fast speaker detection
Qwen 2.5 7B	4.4GB	8GB	8GB+	Medium	Structured JSON, GIL extraction
Llama 3.1 8B	4.7GB	8GB	8GB+	Medium	General purpose (default)
Gemma 2 9B	5.5GB	12GB	12GB+	Medium	Balanced quality/speed

Common Workflows¶

Development/Testing Setup¶

# 1. Start Ollama
ollama serve

# 2. Pull model (choose based on your RAM and use case)
ollama pull phi3:mini        # Lightweight (4GB+ RAM, dev/test)
ollama pull mistral:7b       # Fast (8GB+ RAM, fast speaker detection)
ollama pull mistral-nemo:12b # Mistral Nemo 12B (16GB+ RAM typical)
ollama pull mistral-small3.2 # Mistral Small 3.2 (see library for exact size)
ollama pull llama3.1:8b      # General purpose (8GB+ RAM, default)
ollama pull qwen2.5:7b       # Best JSON (8GB+ RAM, GIL extraction)
ollama pull gemma2:9b        # Balanced (12GB+ RAM, summarization)

# 3. Configure (model-specific prompts are automatically selected)
# config.yaml:
#   ollama_speaker_model: llama3.1:8b  # or mistral:7b for speed
#   ollama_summary_model: qwen2.5:7b    # or gemma2:9b for quality
#   ollama_timeout: 120

Production Setup¶

# 1. Start Ollama as service
brew services start ollama  # macOS
# Or: systemctl start ollama  # Linux

# 2. Pull production model
ollama pull llama3.3:latest

# 3. Configure for quality
# config.yaml:
#   ollama_speaker_model: llama3.3:latest
#   ollama_summary_model: llama3.3:latest
#   ollama_timeout: 300

Hybrid Setup (Ollama + Other Providers)¶

# Use Ollama for privacy-sensitive operations
# Use other providers for speed/cost optimization

transcription_provider: ollama       # Local Ollama transcription (privacy)
speaker_detector_provider: ollama    # Local speaker detection (privacy)
summary_provider: ollama             # Local summarization (privacy)

# Or mix with other providers:
# transcription_provider: whisper    # Local Whisper (alternative)
# speaker_detector_provider: ollama  # Local speaker detection
# summary_provider: openai           # Cloud summarization (quality)

Verification¶

Quick Health Check¶

# 1. Check server is running
curl http://localhost:11434/api/tags

# 2. List available models
ollama list

# 3. Test model directly
ollama run llama3.3:latest "What is a podcast?"

# 4. Test with podcast_scraper
podcast-scraper --rss https://example.com/feed.xml \
  --transcription-provider ollama \
  --speaker-detector-provider ollama \
  --summary-provider ollama \
  --max-episodes 1

Verify Provider is Working¶

# Run with verbose logging
podcast-scraper --rss https://example.com/feed.xml \
  --transcription-provider ollama \
  --speaker-detector-provider ollama \
  --summary-provider ollama \
  --log-level DEBUG \
  --max-episodes 1

# Check metadata files for provider information
cat output/*/metadata/*.json | jq '.processing.config_snapshot.ml_providers'

Provider Configuration Quick Reference - Configuration options
AI Provider Comparison Guide - Ollama vs other providers
E2E Testing Guide - Testing with real Ollama
Troubleshooting Guide - Common issues and solutions

Additional Resources¶

Ollama Documentation: https://github.com/ollama/ollama
Ollama Models: https://ollama.ai/library
Ollama API Reference: https://github.com/ollama/ollama/blob/main/docs/api.md

Ollama Provider Guide¶

Overview¶

Installation¶

Step 1: Install Ollama¶

Step 2: Start Ollama Server¶

Step 3: Pull Required Models¶

Qwen 3.5 (Ollama): three-tier checklist¶

Tier 1 — qwen3.5:9b (fast default for Qwen 3.5)¶

Tier 2 — qwen3.5:27b (quality)¶

Tier 3 — qwen3.5:35b (heavy / Apple Silicon 48 GB class)¶

Qwen 3.5 — repository alignment¶

Step 4: Install Podcast Scraper Dependencies¶

Basic Setup¶

Configuration File¶

CLI Usage¶

Using Ollama as REDUCE Backend (hybrid_ml)¶

Environment Variables¶

Checklist: add support for a new Ollama model (in this repo)¶

Troubleshooting¶

Issue: ollama list Hangs or Times Out¶

Issue: "Ollama server is not running"¶

Issue: "Model 'llama3.3:latest' is not available"¶

Issue: HTTP 500 from Ollama / experiment seems stuck¶

Issue: Connection Refused on Custom Port¶

Issue: Slow Performance¶

Issue: Process Won't Die¶

Issue: Port Already in Use¶

Issue: Permission Denied¶

Testing with Real Models¶

E2E Tests with Real Ollama¶

Test with Real RSS Feed¶

Test Multiple Episodes¶

Performance Tips¶

Model Selection¶

Timeout Configuration¶

Hardware Recommendations¶

Common Workflows¶

Development/Testing Setup¶

Production Setup¶

Hybrid Setup (Ollama + Other Providers)¶

Verification¶

Quick Health Check¶

Verify Provider is Working¶

Related Documentation¶

Additional Resources¶

Tier 1 — `qwen3.5:9b` (fast default for Qwen 3.5)¶

Tier 2 — `qwen3.5:27b` (quality)¶

Tier 3 — `qwen3.5:35b` (heavy / Apple Silicon 48 GB class)¶

Issue: `ollama list` Hangs or Times Out¶