ADR-078: GIL Evidence Stack Bundling — Per-Provider Champion Modes¶
- Status: Accepted
- Date: 2026-05-03
- Authors: Podcast Scraper Team
- Related Issues: #698 (GIL evidence stack bundling)
- Related PRs: #711 (implementation + matrix results)
- See Also: ADR-073 (autoresearch closure), ADR-077 (Ollama model selection)
Context & Problem Statement¶
The Grounded Insight List (GIL) evidence stack at
src/podcast_scraper/gi/pipeline.py issued ~76 sequential LLM calls per episode
in its baseline staged shape:
- ~14
extract_quotescalls (one per insight) - ~62
score_entailment(NLI) calls (one per insight × candidate pair)
That call shape dominated end-to-end latency on cloud providers and made the stage operationally infeasible on local Ollama where per-call overhead (~20-30s cold + queue) compounds to 25-40 minutes per episode.
Issue #698 introduces two bundling layers to compress those calls:
- Layer A (
extract_quotes_bundled): one LLM call returns{insight_id: [quotes]}for all insights in the episode. - Layer B (
score_entailment_bundled): one LLM call (chunked at 15 pairs) returns{pair_id: {label, score}}for all (insight, candidate) pairs in the episode.
The decision was: which combination — staged / Layer A only / Layer B only / both bundled — should be the champion per provider, and at what tradeoff between coverage (vs Sonnet-4.6 silver) and call-count reduction?
Decision¶
Per-provider champion modes for GIL evidence (gil_evidence_quote_mode ×
gil_evidence_nli_mode):
| Provider | Model | Champion | Coverage | Calls cut |
|---|---|---|---|---|
| Gemini | 2.5-flash-lite | bundled_ab | 82% (+7pp) | 92% |
| Anthropic | claude-haiku-4-5 | bundled_ab | 82.5% (+5pp) | 92% |
| OpenAI | gpt-4o-mini | bundled_ab | 72.5% (-5pp tradeoff) | 92% |
| DeepSeek | deepseek-chat | bundled_ab | 78% (+6pp) | 92% |
| Grok | grok-3-fast | bundled_ab | 87.5% (-2.5pp) | 92% |
| Mistral | mistral-small-latest | bundled_b_only | 80% (+10pp) | 75% |
| Ollama | qwen3.5:9b | bundled_ab | 72% | 92% |
| Ollama | mistral-small3.2 (24B) | bundled_ab | 70% | 92% |
| Ollama | llama3.2:3b | bundled_ab (fragile) | 62% | 92% |
Default for new providers / unmeasured combinations: bundled_ab with
the shared prompts in src/podcast_scraper/providers/common/bundled_prompts.py.
Mistral is the one provider where Layer A regresses coverage; it gets a
provider-specific override to bundled_b_only.
For local Ollama, only bundled_ab is operationally viable —
the staged baseline is not representative of how operators would deploy
the pipeline on local models, so we did not measure staged Ollama as a
"baseline." Ollama llama3.2:3b is documented as fragile (1+ JSON parse
fallback observed at chunk_size=15); operators preferring a small local
model should pin chunk_size=8 or stay on qwen3.5:9b.
Rationale¶
Cost-aware champion selection. Initial reading of the matrix
(coverage-only) would have picked bundled_a_only for OpenAI (77.5% cov)
over bundled_ab (72.5%). Operator pushed back: more calls = more rate
limit pressure + higher cost + higher latency under contention. Champion
selection was redone treating call-count as a primary axis, accepting up
to 5pp coverage drop in exchange for 92% call cut on cloud providers.
This re-derivation is recorded in autoresearch/gil_evidence_bundling/results.tsv.
Mistral exception. Layer A (bundled extract) on Mistral regressed
coverage from 70% (staged) → 67.5% (bundled_ab). Layer B alone held
+10pp. Hypothesis: Mistral's response shape on the bundled extract prompt
under-extracts when asked for "3-5 quotes per insight" across many
insights at once. The provider-specific override (Layer B bundled, Layer
A staged) is a 75% call cut without coverage regression — strictly better
than rolling back both layers.
Ollama treatment. Per-call overhead on local Ollama makes the staged
shape (76 calls × 20-30s ≈ 30-40min/ep) operationally untenable. The
4-cell matrix used for cloud providers does not map; the bundled shape is
the only one that fits within an operator's actual run budget. We
measured 3 representative local models (3B, 9B, 24B) on bundled_ab
only and documented that staged is not measured.
num_ctx requirement. The bundled prompts cross Ollama's default 2048
context window. Bundled methods on OllamaProvider pass num_ctx=32k
explicitly via _ollama_openai_chat_extra_kwargs. Without that,
mistral-small3.2 silently truncated and timed out. The fix is in
src/podcast_scraper/providers/ollama/ollama_provider.py.
Alternatives Considered¶
-
Per-provider Jinja templates from day one (Path B). Would have produced higher quality on each provider but ballooned implementation surface to 14+ Jinja files maintained per provider. Rejected: shared-prompts result is already at or above silver baseline coverage for 5 of 7 providers; per-provider tuning (Track A) is reserved as follow-up only for providers that fail Phase A gates.
-
Skip Layer A entirely (only ship
bundled_b_only). Layer B alone accounts for the majority of the call cut (75% of total) since NLI calls outnumber extract calls 4:1. Rejected: Layer A delivers an additional 17% call cut and on most providers does not regress coverage (Mistral excepted). Skipping it would leave easy savings on the table. -
Promote
bundled_abas the universal default with no per-provider exceptions. Rejected: Mistral measurably regresses onbundled_ab(67.5% cov, -2.5pp). Layer A regression is a real provider-specific behavior, not noise. -
Run a 4-cell matrix on Ollama too. Rejected: staged Ollama is not representative deployment. We have staged-vs-bundled deltas from 5 cloud providers; Ollama's job in the matrix was to confirm bundled parses cleanly and produces reasonable insights on local models.
Consequences¶
-
Positive: GIL evidence stage is now 8x cheaper in LLM calls on cloud providers (76 → ~6 calls/ep) and operationally viable on local Ollama for the first time. Coverage is preserved or improved on 6 of 7 providers vs the staged baseline.
-
Positive: Bundled fallback path remains in place — if the bundled prompt returns invalid JSON or under-extracts, the system falls back silently to staged. This was observed on llama3.2:3b (3B model is fragile on JSON output) without breaking the run.
-
Negative: Per-provider override surface (Mistral) introduces a small dispatch branch in
gi/pipeline.py. If a future provider also shows Layer A regression, the override list grows. Mitigated by keeping the override mechanism config-driven (per-experimentgil_evidence_*_modeparams) — no provider-class subtyping. -
Neutral: Default profile flips landed in PR #711 alongside the implementation. Profiles affected:
cloud_balanced,cloud_thin,cloud_quality(Gemini / Anthropic), andlocal(Ollama). The airgapped profiles (airgapped,airgapped_thin) use local CrossEncoder for evidence (quote_extraction_provider: transformers) and are unaffected by this flip. The CLI--configpath was also patched to forwardgil_evidence_*_modethrough_build_config's payload — without that the YAML override would have been silently dropped (same #646-class bug).
Implementation Notes¶
-
Module:
src/podcast_scraper/providers/common/bundled_prompts.py(shared system + user prompt builders),src/podcast_scraper/providers/<provider>/<provider>_provider.py(extract_quotes_bundled,score_entailment_bundledper provider),src/podcast_scraper/gi/pipeline.py(mode dispatch + fallback logic). -
Config:
gil_evidence_quote_mode∈ {staged,bundled},gil_evidence_nli_mode∈ {staged,bundled},gil_evidence_nli_chunk_size(default 15). Wired throughmerge_eval_task_into_summarizer_configso experiment YAMLs can override per-cell. -
Pattern: Provider-agnostic dispatch on the bundled methods (any provider exposing
extract_quotes_bundledautomatically participates; no provider-type switching in pipeline code). -
Telemetry:
llm_gi_extract_quotes_*+llm_gi_score_entailment_*substage cost breakdown per episode (committed in #698 phase 1).bundled_fallback_rateis computed at run end and gated at ≤20%.
References¶
- #698 — GIL evidence stack bundling
- PR #711 — implementation + matrix results
- Eval report — EVAL_GIL_BUNDLING_2026_05.md
- autoresearch/gil_evidence_bundling/ — full results.tsv, scaffolds, and per-cell experiment YAMLs
- ADR-077 — Local Ollama model selection
- RFC-073 — Autoresearch v2 framework