Skip to content

ADR-077: Local Ollama Model Selection — Core 5 from 11

Status: Accepted Date: 2026-04-18 Context: Autoresearch v2 eval (PR #568) + pipeline validation (#591)


Decision

Standardize on 5 Ollama models (from 11 previously tested) for regular sweeps, pipeline validation, and optimization work. One per model family plus one large-model reference.

Core 5

Model Family Size v2 Bullets Role
qwen3.5:9b Qwen 3.5 9B 0.580 Champion — best local quality, default pick
llama3.1:8b Llama 8B 0.518 Llama family — full pipeline at 8B (llama3.2:3b demoted: KG unreliable at 3B)
mistral:7b Mistral 7B 0.526 Mid-tier — best non-Qwen, balanced
gemma2:9b Gemma 9B 0.492 Diversity — Google architecture
qwen3.5:35b Qwen 3.5 35B 0.576 Scale reference — "does bigger help?"

Dropped (6 models)

Model Size v2 Bullets Why dropped
qwen3.5:27b 27B 0.543 Worse than 9b at 15× latency. Same family as champion.
qwen2.5:7b 7B 0.477 Previous generation; superseded by qwen3.5:9b on every metric.
llama3.1:8b 8B 0.518 Same family as llama3.2:3b; marginal gain (+3%), paragraph contested 5/5.
mistral-nemo:12b 12B 0.497 Worse than mistral:7b (-6%) despite 2× size.
mistral-small3.2 22B 0.536 Only +1% bullets over mistral:7b; paragraph contested; 79s/ep.
phi3:mini 3.8B 0.475 4k context variant truncates our ~9k prompts silently. Structurally unsuitable.

Rationale

Family diversity over family depth

Testing qwen3.5 at 9b, 27b, AND 35b measures the same model family at different scales — useful once to establish "bigger ≠ better" (confirmed: 9b > 27b > 35b on paragraph), not useful to repeat on every sweep.

One model per family tests whether a finding is architecture-specific or general. If qwen3.5:9b passes pipeline validation but gemma2:9b fails, the issue is Gemma-specific. If we tested three Qwen variants and they all passed, we'd still not know about Gemma.

Size diversity matters for deployment decisions

The 5 models span 3B → 35B:

  • 3B (llama3.2): edge devices, resource-constrained, speed-first
  • 7-9B (mistral, qwen, gemma): laptop/workstation, balanced
  • 35B (qwen3.5): dedicated GPU, quality-first

This tells us whether pipeline fixes work across the full size spectrum.

Sweep time reduction

  • Before: 11 models × 3 tasks × 5 episodes = ~4 hours
  • After: 5 models × 3 tasks × 5 episodes = ~1.5 hours
  • 60% time reduction while retaining all family + size signal.

Pipeline validation findings (2026-04-18, #591)

Full pipeline validation (summary → GI → KG → bridge) on 5 held-out episodes:

Model Summary GI Grounding KG Bridge
qwen3.5:9b 100%
llama3.1:8b
mistral:7b 98%
gemma2:9b ⚠️ 7.8/ep 95%
qwen3.5:35b 98%

llama3.2:3b (3B) demoted: KG entity extraction fails (0 entities on 2/5 episodes). 3B params insufficient for structured JSON extraction. Replaced by llama3.1:8b (8B) which passes all stages.

gemma2:9b instruction-following gap: produces 7-8 insights when asked for 12 (avg 7.8/ep, threshold 8). Not a pipeline bug — Gemma2 is concise by nature. Grounding (95%) and KG are fine. Keep in Core 5 for architecture diversity but note the GI limitation.

Minimum viable size for full pipeline: 7-8B. All 7B+ models pass (gemma2 borderline on insight count but functional).

Consequences

  • pipeline_validate.py LOCAL_PROVIDERS reduced from 11 to 5
  • --local-fast flag skips the 35b model (runs 4 models in ~1 hr)
  • Dropped models are NOT deleted from Ollama or from v2 eval configs — historical runs remain for reference. They're just excluded from future regular sweeps.
  • If a new model generation ships (e.g., Llama 4, Gemma 3), add the best variant to Core 5 and drop the old family member.
  • EVAL_HELDOUT_V2_2026_04.md §Local models (full 11-model matrix)
  • AI_PROVIDER_COMPARISON_GUIDE.md §Local picks
  • 591 (pipeline validation — uses Core 5)