ADR-073: RFC-057 Autoresearch Loop — Closure and Final State¶

Status: Accepted
Date: 2026-04-06
Authors: Podcast Scraper Team
Related RFCs: RFC-057
See Also: ADR-067, ADR-068, ADR-069, ADR-070, ADR-071, ADR-072

Context & Problem Statement¶

RFC-057 defined an autoresearch optimization loop with two parallel tracks:

Track A — Prompt tuning: edit Jinja templates, score with make autoresearch-score, ratchet if scalar composite score improves ≥1% relative; early stop after 3 consecutive failures.
Track B — ML inference parameter sweep: greedy one-parameter-at-a-time sweep on the hybrid ML pipeline; same ratchet rules.

This ADR documents what was decided, what was promoted to production, and closes the RFC as complete. It also records the final state of the eval infrastructure that was built or significantly improved as part of this work.

Track B — ML Parameter Sweep (completed 2026-04-04)¶

What was swept¶

Two pipeline families were swept in sequence:

BART+LED (pure HuggingFace): map max_new_tokens, reduce max_new_tokens, num_beams, no_repeat_ngram_size, repetition_penalty.
Hybrid BART+Llama 3.2:3b (HuggingFace MAP + Ollama REDUCE): temperature, top_p, max_tokens, frequency_penalty, map num_beams, map max_new_tokens.

Champions promoted¶

Mode	Config ID	ROUGE-L (benchmark)	Embed	Lat/ep
Tier 2 ML Prod	`ml_bart_led_autoresearch_v1`	20.5%	68.2%	26s
Hybrid (archived)	`ml_hybrid_bart_llama32_3b_autoresearch_v1`	21.1%	76.6%	15s

ml_bart_led_autoresearch_v1 is set as PROD_DEFAULT_SUMMARY_MODE_ID in config_constants.py. The hybrid is retained in the registry but not the production default — direct Llama outperforms it when Ollama is available.

Key findings¶

temperature=0.5 was the single largest lever for the hybrid (+10% ROUGE-L vs hardcoded 0.3). BART chunk noise benefits from more diversity.
top_p=1.0 — no nucleus filtering; BART-extracted chunks are clean enough that filtering hurts.
frequency_penalty=0 — any penalty hurt BART chunk diversity downstream.
Instruction-following > size — llama3.2:3b (3B) beat all tested 7-12B models on the hybrid REDUCE step.
Hybrid variance at benchmark scale — temperature=0.5 sampling noise averages out over 10 episodes; hybrid advantage over ML-prod shrinks from +4.9pp (smoke) to +0.6pp (benchmark). Direct Llama at temp=0.3 is the more reliable Tier 3 choice.

Track A — Prompt Tuning (completed 2026-04-05)¶

Scope¶

Six cloud providers tuned on the paragraph track; Anthropic Haiku then tuned on the bullets track. All tuning used smoke dataset (5 eps) for speed; wins verified at benchmark scale (10 eps).

Results summary¶

Track	Provider	Start score	Final score	Key wins
Paragraph	Anthropic	0.287	0.523	Thesis opener (+0.201), vocab alignment (+0.024)
Paragraph	Gemini	0.446	0.475	Thesis opener + vocab + anchor
Paragraph	Grok	0.437	0.456	Thesis opener + vocab + anchor
Paragraph	DeepSeek	0.502	0.502	Saturated — no wins
Paragraph	Mistral	0.468	0.480	Cause-effect relationships
Paragraph	OpenAI	0.474	0.474	Saturated — judge divergence on all candidates
Bullets	Anthropic	0.546	0.599	No-fence JSON constraint (+3.9%), single-sentence bullets (+5.7%)

Template changes accepted (shared across all providers)¶

Both changes are in src/podcast_scraper/prompts/shared/summarization/bullets_json_v1.j2:

Explicit JSON boundary constraint — "Your response must start with { and end with }. No markdown, no code fences, no prose." Fixed Haiku code-fence wrapping that dropped ROUGE-L from 40.1% to 37.1%.
Single-sentence bullet rule — "Write each bullet as a single, complete sentence; do not split a bullet into two sentences." Structural alignment with the silver reference style (+5.7% for Anthropic bullets).

Ceiling assessment¶

All tracks reached early stop (3 consecutive fails). Remaining quality gap vs the silver reference is structural (reference model writes at a different density and vocabulary register) — not addressable by prompt instruction. A new silver reference from a better frontier model would shift the ceiling.

Eval infrastructure built during RFC-057¶

The following infrastructure was created or significantly improved as part of this RFC and is now the standard eval process:

Silver references (paragraph + bullets × smoke + benchmark)¶

Reference ID	Model	Dataset	Format	Status
`silver_sonnet46_smoke_v1`	Claude Sonnet 4.6	smoke (5 eps)	prose	Active
`silver_sonnet46_benchmark_v1`	Claude Sonnet 4.6	benchmark (10 eps)	prose	Active
`silver_sonnet46_smoke_bullets_v1`	Claude Sonnet 4.6	smoke (5 eps)	JSON bullets	Active
`silver_sonnet46_benchmark_bullets_v1`	Claude Sonnet 4.6	benchmark (10 eps)	JSON bullets	Active

Selected via pairwise LLM judge (dual OpenAI + Anthropic judges): Sonnet 4.6 beat GPT-5.4 (3-1-1) and swept Gemini 2.0 Flash (5-0) on the smoke dataset.

silver_gpt4o_* references are archived — retained for historical traceability.

2×2 eval matrix¶

Every provider now has 4 configs: smoke/benchmark × paragraph/bullets. The full matrix (6 cloud + 12 Ollama = 18 providers × 4 = 72 configs) is documented in data/eval/configs/README.md with explicit silver reference pairing rules and trigger conditions.

Benchmark-scale runs and report¶

The first full 10-episode benchmark sweep was completed:

6 cloud providers × {paragraph, bullets} = 12 runs
12 Ollama models × {paragraph, bullets} = 24 runs
3 ML/hybrid baselines × paragraph = 3 runs

Results documented in docs/guides/eval-reports/EVAL_BENCHMARK_V1_2026_04.md.

Prompt tuning loop infrastructure¶

make autoresearch-score and scripts/eval/autoresearch_score.py provide a composite scalar (ROUGE-L × 0.70 + dual LLM judge × 0.30) used as the ratchet criterion for Track A. The loop is documented in RFC-057.

Final production state¶

Component	Value
`PROD_DEFAULT_SUMMARY_MODE_ID`	`ml_bart_led_autoresearch_v1` (Tier 2, air-gap safe)
`OLLAMA_DEFAULT_SUMMARY_MODEL`	`llama3.2:3b` (Tier 3 pointer)
Active silver references	4 × sonnet46 (smoke/benchmark × paragraph/bullets)
Paragraph prompt	`long_v1.j2` with thesis opener + vocab alignment instructions
Bullets prompt	`bullets_json_v1.j2` with explicit JSON boundary + single-sentence rule
Eval matrix	72 configs — complete 2×2 for all active providers
Benchmark report	`EVAL_BENCHMARK_V1_2026_04.md` — first 10-ep sweep

Decision¶

RFC-057 is closed. No further autoresearch loop iterations are scheduled. The next trigger for a new loop would be:

A new frontier model that qualitatively outperforms Sonnet 4.6 → replace silver reference, re-run all providers, run prompt tuning round.
A new output format added (e.g. structured JSON with metadata fields) → new silver reference and prompt tuning track needed.
A significant regression in production ROUGE-L (>3pp) detected in nightly metrics → diagnose root cause first; may or may not require a new loop.

Consequences¶

All four-tier summarization modes are stable. No planned changes.
The eval matrix is now complete — new providers should follow the 4-config template documented in data/eval/configs/README.md.
Silver references should not be replaced without a pairwise judge tournament (see configs/README.md § Silver reference selection).
The hybrid mode (ml_hybrid_bart_llama32_3b_autoresearch_v1) remains registered for long-transcript use cases but is no longer the recommended Tier 3 default.