GIL Evidence Stack Bundling — Cross-Provider Matrix (2026-05)¶

Snapshot: 2026-05-03 · 7 providers (Gemini + 5 cloud + 3 Ollama models) · curated_5feeds_benchmark_v2 · vs silver_sonnet46_gi_multiquote_benchmark_v2 · #698 / PR #711.

This report measures the impact of bundling the two LLM-bound substages in the Grounded Insight List (GIL) evidence stack:

Layer A (extract_quotes_bundled): all insights → 1 LLM call.
Layer B (score_entailment_bundled): all (insight, candidate) NLI pairs → chunked LLM call (chunk_size=15).

Baseline staged shape issues ~76 sequential LLM calls per episode (~14 extract + ~62 NLI). Question: how much of that can we cut without regressing coverage vs Sonnet-4.6 silver?

Methodology¶

Dataset: curated_5feeds_benchmark_v2 (5 episodes, 5 feeds).
Silver: silver_sonnet46_gi_multiquote_benchmark_v2 (Sonnet 4.6, 40 silver insights total).
Scorer: autoresearch/gil_evidence_bundling/eval/score_gi_vs_silver.py (embedding cosine via all-MiniLM-L6-v2; threshold 0.55 = covered). This is a GI-specific scorer mirroring the canonical score_gi_insight_coverage.py but reading insights from output.gil.nodes[type=Insight] instead of summary bullets.
Cells per cloud provider: 4 (staged baseline / Layer A only / Layer B only / both bundled). Cells per Ollama model: 1 (bundled_ab only — staged is operationally unviable on local CPU; see "Ollama treatment" below).
Champion gates: silver coverage ≥ 70%, fallback rate ≤ 20%, ≥ 30% cost reduction vs staged.

Cross-provider results¶

Cloud providers¶

Provider	Model	Staged cov	bundled_a	bundled_b	bundled_ab	Champion	Calls cut
Gemini	2.5-flash-lite	75%	80%	82%	82%	bundled_ab	92%
Anthropic	claude-haiku-4-5	77.5%	82.5%	80%	82.5%	bundled_ab	92%
OpenAI	gpt-4o-mini	72%	77.5%	72.5%	72.5%	bundled_ab	92%
DeepSeek	deepseek-chat	72%	75%	78%	78%	bundled_ab	92%
Grok	grok-3-fast	90%	87.5%	87.5%	87.5%	bundled_ab	92%
Mistral	mistral-small-latest	70%	70%	80%	67.5%	bundled_b_only	75%

Latency (all cells):

Provider	Staged	bundled_ab	Δ
Gemini	95.6s	36.0s	-62%
Anthropic	42.9s	17.9s	-58%
DeepSeek	51.3s	32.0s	-38%
OpenAI	38.8s	28.1s	-28%
Grok	41.6s	34.4s	-17%
Mistral	35.3s	28.1s (bundled_b 28.1s)	-20%

Ollama (local)¶

Bundled-only matrix; staged not measured (per-call overhead × 76 calls = 30-40 min/ep, not a representative deployment).

Model	Params	Coverage	Grounded	Latency/ep	Notes
qwen3.5:9b	9B	72%	72%	122s	Ollama champion; clean run, no fallbacks
mistral-small3.2	24B	70%	70%	852s	Comparable quality; 7x slower than qwen on local CPU
llama3.2:3b	3B	62%	45%	60s	Below 70% gate; 1+ JSON parse fallback observed (3B model fragile on bundled prompts at chunk_size=15)

Ollama-specific finding: bundled prompts cross Ollama's default 2048 context window. Bundled methods on OllamaProvider pass num_ctx=32k explicitly via _ollama_openai_chat_extra_kwargs. Without that fix, mistral-small3.2 silently truncated input and timed out on retries.

Champion call per provider¶

bundled_ab is the default champion (5 of 7 providers). Mistral is the exception: Layer A regresses Mistral's coverage from 70% → 67.5%, so Mistral pins to bundled_b_only (75% call cut, +10pp coverage). Llama3.2:3b on Ollama passes the operational viability test but falls below the 70% silver coverage gate and is documented as "fragile" (use chunk_size=8 or pin to qwen3.5:9b for higher quality).

OpenAI is a notable case study: pure-quality champion would be bundled_a_only at 77.5% coverage vs bundled_ab at 72.5%. Operator explicitly chose bundled_ab accepting the 5pp drop in exchange for the extra call cut and rate-limit-pressure relief. This decision is recorded in ADR-078 and the per-cell rationale is in autoresearch/gil_evidence_bundling/results.tsv.

Surprises¶

bundled_ab improved coverage on most providers, not just held it. Hypothesis: the bundled extract prompt asking for 3-5 distinct quotes per insight surfaces more candidate evidence than the staged one-quote-per-call prompt did. The +7pp coverage boost on Gemini was the first signal; held up on Anthropic (+5pp), DeepSeek (+6pp).
Mistral regression on Layer A is provider-specific, not noise. Bundled extract on Mistral under-extracts when asked for many insights at once. Layer B alone (NLI bundling) preserves the gain. Track A (provider-specific Jinja prompts) is reserved as follow-up if needed; the bundled_b_only config-level override sidesteps it for now.
Ollama on local CPU works. With num_ctx=32k the bundled stage completes cleanly on 9B and 24B models within minutes, not hours. This is the first time the GIL evidence stage is operationally feasible on local Ollama.

Reproducing¶

Per-cell experiment YAMLs are in autoresearch/gil_evidence_bundling/experiments/. Run + score pattern:

PYTHONPATH=. .venv/bin/python scripts/eval/experiment/run_experiment.py \
    autoresearch/gil_evidence_bundling/experiments/<cell>_<provider>.yaml \
    --reference silver_sonnet46_gi_multiquote_benchmark_v2 \
    --force --log-level INFO

PYTHONPATH=. .venv/bin/python autoresearch/gil_evidence_bundling/eval/score_gi_vs_silver.py \
    --run-id gil_bundling_<cell>_<provider>_v1 \
    --silver silver_sonnet46_gi_multiquote_benchmark_v2 \
    --dataset curated_5feeds_benchmark_v2

Full results log: autoresearch/gil_evidence_bundling/results.tsv (28 rows: 4 Gemini cells + 4×5=20 cloud-provider cells + 3 Ollama cells).

Decision¶

See ADR-078. Profile defaults flipped in PR #711 (cloud_balanced, cloud_thin, cloud_quality, local); airgapped profiles use local CrossEncoder and are unaffected. CLI --config path also patched to forward the new fields (_gil_tuning_keys in cli.py).