ADR-072: Llama 3.2:3b as Tier 3 Local LLM¶
- Status: Accepted
- Date: 2026-04-05
- Authors: Podcast Scraper Team
- Related RFCs: RFC-057, RFC-042
- See Also: ADR-069, ADR-071
Context & Problem Statement¶
RFC-057 Track B required selecting the best Ollama model for the LLM reduce stage of the hybrid pipeline (and later, direct LLM inference). Multiple models were evaluated across the 3B–12B parameter range. The selection needed to balance ROUGE-L quality, latency, and local resource constraints (Apple Silicon MPS, no GPU).
Decision¶
Use llama3.2:3b (Llama 3.2 3B Instruct) as the Tier 3 local LLM for direct
podcast summarization.
Evidence¶
Model comparison sweep (smoke dataset, 5 episodes)¶
All models used temperature=0.5, top_p=1.0 (hybrid sweep winners) as REDUCE stage of
BART-base MAP + Ollama REDUCE pipeline.
| Model | Params | ROUGE-L | Embed | Latency | Notes |
|---|---|---|---|---|---|
| llama3.2:3b | 3B | 23.7% | 72.9% | 15.7s | Winner |
| mistral-nemo:12b | 12B | 19.4% | 71.4% | 48.2s | Worse quality at 4x latency |
| qwen2.5:7b | 7B | 20.1% | 70.8% | 28.4s | Generic text style mismatch |
| llama3.1:8b | 8B | 18.8% | 69.3% | 32.1s | Instruction-following weaker |
| mistral:7b | 7B | 17.2% | 68.1% | 25.3s | Older instruct tuning |
Direct inference (Tier 3 benchmark, 10 episodes)¶
After establishing that the MAP stage is lossy (see ADR-071), llama3.2:3b was evaluated
in direct mode with autoresearch-tuned parameters.
Config: direct_llama32_3b_autoresearch_v1.yaml
Dataset: curated_5feeds_benchmark_v1 (10 episodes)
Ref: silver_sonnet46_benchmark_v1
| Param | Hybrid sweep winner | Direct sweep winner |
|---|---|---|
| temperature | 0.5 | 0.3 |
| max_tokens / max_length | 1000 | 1000 |
| ROUGE-L | 23.7% (hybrid) | 26.4% (direct) |
| Embed | 72.9% | 79.5% |
| Latency | 15.7s/ep | 7.5s/ep |
Direct inference with temperature=0.3 outperforms the hybrid on all metrics.
Why Instruction-Following Beats Size¶
The key finding from the sweep is that 3B beats 7-12B models for this task. The mechanism is instruction-following quality, not raw capacity:
-
Task is structured extraction, not open-ended generation. Podcast summarization requires following a format contract (paragraph or bullet JSON), respecting length bounds, and honouring "no invented facts" and "use transcript vocabulary" constraints. Instruction-tuned 3B models outperform larger models with weaker instruction tuning.
-
Llama 3.2:3b is specifically optimised for edge/local deployment. Meta's training for this model emphasises instruction-following fidelity at small scale. Larger general-purpose models (Mistral-Nemo 12B, Qwen 2.5 7B) have more capacity but are not trained to the same instruction-compliance standard for constrained tasks.
-
BART chunk input is already compressed. In hybrid mode, the LLM receives pre-summarised chunk notes, not raw text. The synthesis task is bounded — larger context windows offer no advantage when the input is already short.
-
Temperature asymmetry reveals input quality difference. Hybrid optimal temp=0.5 (BART chunk noise requires more diversity); direct optimal temp=0.3 (clean transcript requires focused synthesis). This confirms the model is performing different cognitive tasks, and 3B is well-suited to both.
Temperature Findings¶
| Mode | Optimal temp | Reason |
|---|---|---|
| Hybrid (BART MAP input) | 0.5 | BART chunk compression introduces noise; LLM needs creative diversity to synthesise coherently |
| Direct (full transcript) | 0.3 | Clean, uncompressed input; focused deterministic generation wins |
Deployment Configuration¶
# config_constants.py
OLLAMA_DEFAULT_SUMMARY_MODEL = "llama3.2:3b"
# direct_llama32_3b_autoresearch_v1.yaml
params:
temperature: 0.3
max_length: 1000
min_length: 200
Consequences¶
- Positive: 3B model is fast (~7.5s/ep on Apple Silicon MPS) and fits in ~2GB RAM — practical for local deployment without dedicated GPU.
- Positive:
llama3.2:3bpull is ~2GB vs 4-8GB for 7-12B alternatives — lower barrier to entry for new deployments. - Positive: Direct mode removes the BART dependency for Tier 3, simplifying the stack.
- Neutral: Quality ceiling for local inference is ~26% ROUGE-L vs ~33% for cloud (Tier 4). The ~7pp gap is the cost of local privacy.
- Negative: If episode lengths grow beyond ~15K tokens, the 3B context window may become a constraint and the hybrid architecture would need to be revisited.
References¶
- RFC-057: AutoResearch Optimization Loop
- ADR-069: Hybrid ML Pipeline as Production Direction
- ADR-070: BART-base as Hybrid MAP Stage
- ADR-071: Four-Tier Summarization Strategy
- Sweep TSV:
autoresearch/ml_param_tuning/results/direct_llama32_3b_sweep_20260404_183918.tsv - Canonical config:
data/eval/configs/summarization/direct_llama32_3b_autoresearch_v1.yaml