Skip to content

ADR-078: GIL Evidence Stack Bundling — Per-Provider Champion Modes

  • Status: Accepted
  • Date: 2026-05-03
  • Authors: Podcast Scraper Team
  • Related Issues: #698 (GIL evidence stack bundling)
  • Related PRs: #711 (implementation + matrix results)
  • See Also: ADR-073 (autoresearch closure), ADR-077 (Ollama model selection)

Context & Problem Statement

The Grounded Insight List (GIL) evidence stack at src/podcast_scraper/gi/pipeline.py issued ~76 sequential LLM calls per episode in its baseline staged shape:

  • ~14 extract_quotes calls (one per insight)
  • ~62 score_entailment (NLI) calls (one per insight × candidate pair)

That call shape dominated end-to-end latency on cloud providers and made the stage operationally infeasible on local Ollama where per-call overhead (~20-30s cold + queue) compounds to 25-40 minutes per episode.

Issue #698 introduces two bundling layers to compress those calls:

  • Layer A (extract_quotes_bundled): one LLM call returns {insight_id: [quotes]} for all insights in the episode.
  • Layer B (score_entailment_bundled): one LLM call (chunked at 15 pairs) returns {pair_id: {label, score}} for all (insight, candidate) pairs in the episode.

The decision was: which combination — staged / Layer A only / Layer B only / both bundled — should be the champion per provider, and at what tradeoff between coverage (vs Sonnet-4.6 silver) and call-count reduction?

Decision

Per-provider champion modes for GIL evidence (gil_evidence_quote_mode × gil_evidence_nli_mode):

Provider Model Champion Coverage Calls cut
Gemini 2.5-flash-lite bundled_ab 82% (+7pp) 92%
Anthropic claude-haiku-4-5 bundled_ab 82.5% (+5pp) 92%
OpenAI gpt-4o-mini bundled_ab 72.5% (-5pp tradeoff) 92%
DeepSeek deepseek-chat bundled_ab 78% (+6pp) 92%
Grok grok-3-fast bundled_ab 87.5% (-2.5pp) 92%
Mistral mistral-small-latest bundled_b_only 80% (+10pp) 75%
Ollama qwen3.5:9b bundled_ab 72% 92%
Ollama mistral-small3.2 (24B) bundled_ab 70% 92%
Ollama llama3.2:3b bundled_ab (fragile) 62% 92%

Default for new providers / unmeasured combinations: bundled_ab with the shared prompts in src/podcast_scraper/providers/common/bundled_prompts.py. Mistral is the one provider where Layer A regresses coverage; it gets a provider-specific override to bundled_b_only.

For local Ollama, only bundled_ab is operationally viable — the staged baseline is not representative of how operators would deploy the pipeline on local models, so we did not measure staged Ollama as a "baseline." Ollama llama3.2:3b is documented as fragile (1+ JSON parse fallback observed at chunk_size=15); operators preferring a small local model should pin chunk_size=8 or stay on qwen3.5:9b.

Rationale

Cost-aware champion selection. Initial reading of the matrix (coverage-only) would have picked bundled_a_only for OpenAI (77.5% cov) over bundled_ab (72.5%). Operator pushed back: more calls = more rate limit pressure + higher cost + higher latency under contention. Champion selection was redone treating call-count as a primary axis, accepting up to 5pp coverage drop in exchange for 92% call cut on cloud providers. This re-derivation is recorded in autoresearch/gil_evidence_bundling/results.tsv.

Mistral exception. Layer A (bundled extract) on Mistral regressed coverage from 70% (staged) → 67.5% (bundled_ab). Layer B alone held +10pp. Hypothesis: Mistral's response shape on the bundled extract prompt under-extracts when asked for "3-5 quotes per insight" across many insights at once. The provider-specific override (Layer B bundled, Layer A staged) is a 75% call cut without coverage regression — strictly better than rolling back both layers.

Ollama treatment. Per-call overhead on local Ollama makes the staged shape (76 calls × 20-30s ≈ 30-40min/ep) operationally untenable. The 4-cell matrix used for cloud providers does not map; the bundled shape is the only one that fits within an operator's actual run budget. We measured 3 representative local models (3B, 9B, 24B) on bundled_ab only and documented that staged is not measured.

num_ctx requirement. The bundled prompts cross Ollama's default 2048 context window. Bundled methods on OllamaProvider pass num_ctx=32k explicitly via _ollama_openai_chat_extra_kwargs. Without that, mistral-small3.2 silently truncated and timed out. The fix is in src/podcast_scraper/providers/ollama/ollama_provider.py.

Alternatives Considered

  1. Per-provider Jinja templates from day one (Path B). Would have produced higher quality on each provider but ballooned implementation surface to 14+ Jinja files maintained per provider. Rejected: shared-prompts result is already at or above silver baseline coverage for 5 of 7 providers; per-provider tuning (Track A) is reserved as follow-up only for providers that fail Phase A gates.

  2. Skip Layer A entirely (only ship bundled_b_only). Layer B alone accounts for the majority of the call cut (75% of total) since NLI calls outnumber extract calls 4:1. Rejected: Layer A delivers an additional 17% call cut and on most providers does not regress coverage (Mistral excepted). Skipping it would leave easy savings on the table.

  3. Promote bundled_ab as the universal default with no per-provider exceptions. Rejected: Mistral measurably regresses on bundled_ab (67.5% cov, -2.5pp). Layer A regression is a real provider-specific behavior, not noise.

  4. Run a 4-cell matrix on Ollama too. Rejected: staged Ollama is not representative deployment. We have staged-vs-bundled deltas from 5 cloud providers; Ollama's job in the matrix was to confirm bundled parses cleanly and produces reasonable insights on local models.

Consequences

  • Positive: GIL evidence stage is now 8x cheaper in LLM calls on cloud providers (76 → ~6 calls/ep) and operationally viable on local Ollama for the first time. Coverage is preserved or improved on 6 of 7 providers vs the staged baseline.

  • Positive: Bundled fallback path remains in place — if the bundled prompt returns invalid JSON or under-extracts, the system falls back silently to staged. This was observed on llama3.2:3b (3B model is fragile on JSON output) without breaking the run.

  • Negative: Per-provider override surface (Mistral) introduces a small dispatch branch in gi/pipeline.py. If a future provider also shows Layer A regression, the override list grows. Mitigated by keeping the override mechanism config-driven (per-experiment gil_evidence_*_mode params) — no provider-class subtyping.

  • Neutral: Default profile flips landed in PR #711 alongside the implementation. Profiles affected: cloud_balanced, cloud_thin, cloud_quality (Gemini / Anthropic), and local (Ollama). The airgapped profiles (airgapped, airgapped_thin) use local CrossEncoder for evidence (quote_extraction_provider: transformers) and are unaffected by this flip. The CLI --config path was also patched to forward gil_evidence_*_mode through _build_config's payload — without that the YAML override would have been silently dropped (same #646-class bug).

Implementation Notes

  • Module: src/podcast_scraper/providers/common/bundled_prompts.py (shared system + user prompt builders), src/podcast_scraper/providers/<provider>/<provider>_provider.py (extract_quotes_bundled, score_entailment_bundled per provider), src/podcast_scraper/gi/pipeline.py (mode dispatch + fallback logic).

  • Config: gil_evidence_quote_mode ∈ {staged, bundled}, gil_evidence_nli_mode ∈ {staged, bundled}, gil_evidence_nli_chunk_size (default 15). Wired through merge_eval_task_into_summarizer_config so experiment YAMLs can override per-cell.

  • Pattern: Provider-agnostic dispatch on the bundled methods (any provider exposing extract_quotes_bundled automatically participates; no provider-type switching in pipeline code).

  • Telemetry: llm_gi_extract_quotes_* + llm_gi_score_entailment_* substage cost breakdown per episode (committed in #698 phase 1). bundled_fallback_rate is computed at run end and gated at ≤20%.

References