RFC-057: AutoResearch Optimization Loop — Prompt & ML Parameter Tuning¶
- Status: Draft
- Authors: Podcast Scraper Team
- Stakeholders: Core team
- Related PRDs: (none yet — candidate for PRD-020)
- Related RFCs (reference — analogous patterns):
docs/rfc/RFC-017-prompt-management.md— Jinja prompt store, shared vs per-provider overrides (mutable prompt targets must align with this)docs/rfc/RFC-015-ai-experiment-pipeline.md— experiment configs,data/eval/runs layoutdocs/rfc/RFC-053-adaptive-summarization-routing.md— summarization pipeline (primary optimization target)docs/rfc/RFC-049-grounded-insight-layer-core.md— provider abstraction patternsdocs/rfc/RFC-055-knowledge-graph-layer-core.md— eval dataset conventions- Related Documents:
data/eval/— Gold-standard eval dataset (episodes with verified transcripts and summaries)scripts/eval/run_experiment.py— experiment runner (predictions →data/eval/runs/<run_id>/)src/podcast_scraper/evaluation/— scoring (scorer.py,gi_scorer.py,kg_scorer.py,eval_gi_kg_runtime.py, experiment config)src/podcast_scraper/prompts/— versioned.j2templates (seeprompts/shared/README.md)docs/api/CONFIGURATION.md—summary_prompt_params, summarization / Whisper fieldsdocs/architecture/ARCHITECTURE.md— Provider system and module boundariesdocs/architecture/TESTING_STRATEGY.md— Existing test pyramid
Abstract¶
This RFC proposes an AutoResearch optimization loop for podcast_scraper, adapting Karpathy's autoresearch pattern to two distinct optimization targets:
- Prompt template tuning — Iteratively improve LLM provider prompts (summarization, speaker detection, transcription) using expensive frontier models as judges.
- ML inference parameter tuning — Iteratively improve local model inference parameters (Whisper, BART/LED) scored against the same eval dataset.
Both tracks share a common ratchet loop: an AI coding agent mutates one target file (or allowlisted set of files), runs a scored eval, commits improvements via git or reverts failures, and repeats autonomously. The human defines the eval harness and walks away.
Problem Statement¶
The pipeline's LLM prompts and ML inference parameters were set by hand and have never been systematically optimized. Without an automated loop:
- Prompt quality degrades silently across provider updates and new podcast domains.
- ML inference params (beam search width, length penalty, silence thresholds) are static defaults, not tuned to this project's content.
- Manual iteration is too slow — a human can reasonably run 5–10 prompt variants; an agent can run 100 overnight.
Use cases:
- Overnight prompt tuning: Agent rewrites summarization/speaker-detection templates, judges score each variant, best changes committed under
src/podcast_scraper/prompts/(per RFC-017). - ML param sweep: Agent mutates Whisper / local summarization inference settings, runs local eval on Apple Silicon (MPS), keeps improvements.
- Provider comparison: Run the loop independently per provider (OpenAI, Gemini, local) to surface which provider + prompt combination scores best on the eval set.
Goals¶
- Define the three-file contract (
program.md, mutable target(s), immutable eval harness) for each optimization track. - Define the eval harness — scoring script structure, judge LLM selection, rubric format, reproducibility, and cost controls.
- Define cost guardrails — experiment caps, two-stage filtering, episode subset sizing (smoke vs default).
- Define the mutable targets — exact paths and allowlists aligned with the prompt store and
Config. - Reuse existing eval plumbing where possible — avoid parallel scoring implementations that drift from
podcast_scraper.evaluation. - Define git conventions for autoresearch branches to avoid polluting
main. - Isolate autoresearch spend — dedicated credentials and guardrails so overnight loops cannot silently drain production keys (see §Credentials and cost isolation).
Reuse-First Thin Automation Layer (v1 Non-Negotiable)¶
Autoresearch is not a second product pipeline. It is a thin control loop on top of existing code:
autoresearch/<track>/eval/score.pyonly orchestrates: CLI / env → invokescripts/eval/run_experiment.pyand/orpodcast_scraper.evaluation(ExperimentConfig,scorer.score_run, comparators) → optional judge calls → single scalar on stdout. Do not reimplement ROUGE, WER, prediction I/O, or experiment layout underautoresearch/except minimal glue.- New Python under
autoresearch/should stay small (orchestrator, judge HTTP/SDK calls, rubric loading). If logic is generally useful, add it undersrc/podcast_scraper/evaluation/(or shared utils) and import it fromscore.pyso production and autoresearch share one implementation. - The coding agent +
program.mdreplace the human clicking; they do not replacedata/eval/,run_experiment.py, orConfig.
Relationship to Existing Code (Implementation Alignment)¶
Prompts (Track A)¶
- Templates are Jinja2 (
.j2) files loaded viapodcast_scraper.prompts.store, not free-standing YAML prompt blobs underconfig/prompts/. - Resolution order: provider-specific path under
src/podcast_scraper/prompts/<provider>/...overridessrc/podcast_scraper/prompts/shared/...for the same logical template name (seeprompts/shared/README.md). - Mutable targets for a given run are an allowlist in
autoresearch/prompt_tuning/program.md, e.g. one or more of: src/podcast_scraper/prompts/shared/summarization/bullets_json_v1.j2(bundled autoresearch experiment YAML uses logical nameshared/summarization/bullets_json_v1)src/podcast_scraper/prompts/openai/summarization/bullets_json_v1.j2(optional provider override; switch YAML toopenai/summarization/bullets_json_v1to prefer it)- Other
.j2paths for speaker detection / cleaning as scoped by the run. - Optional secondary target: structured knobs already supported by config (e.g.
summary_prompt_paramsin YAML config) if the run optimizes parameters passed into templates — document those keys inprogram.mdand keep them consistent withdocs/api/CONFIGURATION.md.
Eval & scoring¶
- Preferred:
autoresearch/<track>/eval/score.pyis a thin orchestrator run from the repository root (e.g.python autoresearch/prompt_tuning/eval/score.py) that: - Imports from
podcast_scraper.evaluation.*and, where applicable, shells out to or reuses patterns fromscripts/eval/run_experiment.py. - Reads episodes from
data/eval/and writes/reads predictions underdata/eval/runs/(or a dedicateddata/eval/autoresearch_runs/subtree if isolation is needed — decide once in implementation; either way, never mutate golden inputs under the curated eval corpus paths). - GIL/KG autoresearch (future): If the loop later optimizes insight or graph prompts, reuse
eval_gi_kg_runtime.py,gi_scorer.py, andkg_scorer.pyrather than duplicating metrics. - Summarization-only Track A (v1): May compose
scorer.score_run/ experiment YAML contracts already used byscripts/eval/; extend only when the rubric requires judge LLM logic not present inscorer.py.
ML parameters (Track B)¶
- Today, many inference knobs live on
Config(e.g. Whisper model/device fields,map_inference_params/reduce_inference_paramsdicts — seesrc/podcast_scraper/config.py). - Mutable target (v1): a single YAML file, e.g.
config/autoresearch/ml_params.yaml, whose schema is documented in the harness and maps toConfigfields (or to a small loader that builds a partialConfigoverlay). The agent must not editconfig.pydirectly. - If the project later introduces a first-class
config/ml_params.yamlconsumed by the app, align this file with that schema or merge them in one RFC follow-up.
Credentials and Cost Isolation¶
You do not strictly need a second “pair” of accounts. What you need is clear separation of spend and blast radius. Practical stack (combine as needed):
- Dedicated environment variables (required for v1 harness)
score.py(and any judge helper it imports) should read autoresearch-prefixed vars for every LLM call the harness makes — at minimum the judges and any cheap filter model, e.g.: AUTORESEARCH_JUDGE_OPENAI_API_KEYAUTORESEARCH_JUDGE_ANTHROPIC_API_KEY
Optional, if you want summarization-under-test billed separately from day-to-day CLI:-
AUTORESEARCH_EXPERIMENT_OPENAI_API_KEY(or per-provider equivalents)
Do not read bareOPENAI_API_KEY/ANTHROPIC_API_KEYin the harness unless an explicit opt-in exists, e.g.AUTORESEARCH_ALLOW_PRODUCTION_KEYS=1for local dev only (default off). Document all vars inconfig/examples/.env.example. -
Second API keys vs same account
- Same org, new API key (many providers allow multiple keys): point autoresearch vars at those keys — dashboards often aggregate by key or project.
- Stronger isolation: separate project / workspace (OpenAI, Anthropic, etc.) with its own keys and monthly budget / alerts — still two keys, not necessarily two billing identities.
-
Same key as production only for quick hacks; you lose cost separation and risk accidental overlap with manual runs.
-
Provider-side controls
Set budgets, alerts, and hard caps in the cloud console for the project tied to autoresearch keys. -
Harness-side controls (already in this RFC)
AUTORESEARCH_EVAL_N, experiment cap inprogram.md, two-stage judge filter,--dry-run, and optional max judge calls / max USD per invocation implemented once inscore.py(fail closed).
Summary: Prefer autoresearch-dedicated env vars + keys tied to a small-budget cloud project; a literal “second pair” of keys is the usual way to achieve that, not a separate requirement from (1) and (2).
Constraints & Assumptions¶
Constraints:
- Must not modify curated gold inputs under
data/eval/during a run — eval inputs are immutable ground truth. - Must not allow the agent to edit the scoring script (
eval/score.pyor equivalent) — same principle as autoresearch's immutableprepare.py. - Must run within a fixed per-experiment budget (API cost cap for prompt track; wall-clock time cap for ML track).
- Must operate on a dedicated
autoresearch/<tag>branch, never directly onmain. - Prompt track must be runnable without GPU (API-only, laptop or CI).
- ML track must be runnable on Apple Silicon via MPS (no CUDA required).
Assumptions:
data/eval/contains a representative set of ≥10 episodes with verified transcripts and reference summaries.- Claude Code (or equivalent coding agent) is the loop driver, pointed at
program.md. - Autoresearch LLM calls use dedicated
AUTORESEARCH_*API keys (see §Credentials and cost isolation), documented inconfig/examples/.env.example. - Per-experiment cost for prompt track is ~$0.10–0.20 at 10 eval episodes; total overnight run ~$15–30 (scales with episode count and judge models).
Design & Implementation (High Level)¶
1. Three-File Contract¶
Each optimization track instantiates the same contract:
| File | Role | Editable by agent? |
|---|---|---|
autoresearch/<track>/program.md |
Agent instructions, loop rules, stopping criteria, allowlisted mutable paths | Human only |
| Mutable target(s) | .j2 templates and/or config/autoresearch/ml_params.yaml (see §Relationship to Existing Code) |
Agent only (within allowlist) |
autoresearch/<track>/eval/score.py |
Immutable scoring harness | Never |
The agent reads program.md, mutates only allowlisted targets, runs score.py, reads the score, commits or reverts, logs to results.tsv, and repeats.
2. Track A — Prompt Template Tuning¶
Mutable target: Allowlisted src/podcast_scraper/prompts/**/*.j2 files (and optionally config keys documented in program.md), per §Relationship to Existing Code.
Eval harness (score.py) flow:
- Load
data/eval/episodes (transcripts + reference summaries) per existing eval conventions. - Run the pipeline or summarization provider using current templates (e.g. via experiment config +
run_experiment.pyor direct provider calls — choose the smallest integration that stays representative). - Call judge LLM with a fixed rubric (see §Rubric) for each output.
- Aggregate scores → single float (0.0–1.0). Higher is better.
- Print score to stdout. Exit non-zero on failure.
Judge selection: Two judges to reduce bias (e.g. OpenAI and Anthropic flagship chat models) with averaged scores. Pin API model identifiers (e.g. gpt-4o-2024-08-06, claude-3-5-sonnet-20241022) in autoresearch/prompt_tuning/eval/judge_config.yaml (human-only, immutable during a run) and log the exact strings used on each experiment row in results.tsv. If scores diverge by >0.15, log as contested and treat as no-improvement (conservative).
Reproducibility: Same pinned models + rubric version + data/eval/ dataset id (commit SHA or manifest hash) should be recorded per run so scores are comparable across days.
Cost controls:
- Default eval subset: 10 episodes (configurable via
AUTORESEARCH_EVAL_N). Smoke / first loop: 5 episodes — same mechanism, smaller N; see Rollout. - Hard cap: 50 experiments per run (set in
program.md; agent must stop). - Two-stage filter: run cheap judge first; only call expensive judges if cheap score improves over baseline by ≥0.02.
- Pre-cache all transcript inputs before the loop starts — only prompt outputs vary per experiment.
3. Track B — ML Inference Parameter Tuning¶
Mutable target: config/autoresearch/ml_params.yaml (maps to Config inference-related fields), per §Relationship to Existing Code.
Eval harness (score.py) flow:
- Load
data/eval/episodes (raw audio + reference transcripts for Whisper; transcripts + reference summaries for BART/LED). - Run inference with current params on MPS.
- Score: Word Error Rate (WER) for transcription; ROUGE-L + optional LLM judge for summarization.
- Combine into a single scalar using a fixed formula documented in
program.mdand implemented once inscore.py, e.g.
score = w_wer * (1 - norm_wer) + w_rouge * norm_rouge + w_judge * norm_judge
with weights summing to 1 and normalization chosen so each term is in [0, 1] (documentnorm_wer/ baseline WER for division). Constraint pass (recommended): reject any candidate that regresses WER vs. baseline by more than a small ε even if ROUGE improves — encode as hard filter or penalty term. - Print score to stdout.
Episode subsets: Use 5 episodes for smoke / speed validation (same AUTORESEARCH_EVAL_N as Track A). Scale to 10+ once wall-clock per experiment is acceptable.
Time controls:
- Fixed wall-clock budget per experiment: 10 minutes on Apple Silicon (vs. 5 min in original autoresearch on H100 — adjusted for MPS).
- Kill and mark as failure if exceeded.
4. Rubric (Prompt Track Judge)¶
The scoring rubric is defined in autoresearch/prompt_tuning/eval/rubric.md and passed verbatim to the judge LLM. It must not change during a run. Example dimensions (finalize before first run):
| Dimension | Weight | Description |
|---|---|---|
| Topic coverage | 30% | Does the summary mention the main topics discussed? |
| Factual accuracy | 30% | No hallucinated names, dates, or claims vs. transcript |
| Speaker attribution | 20% | Speakers correctly identified where detectable |
| Conciseness | 20% | No padding, appropriate length for episode duration |
Rubric changes require a new autoresearch/<tag> branch — never mid-run.
5. Git Conventions¶
- Branch:
autoresearch/<tag>where<tag>is date-based (e.g.20260401-prompts,20260401-ml). Branch is agent-only until the PR: do not mix human feature work on the same branch, sogit reset --hard HEADafter a failed experiment never drops uncommitted human changes. results.tsvcommitted per experiment. Minimum columns:experiment_id,score,delta,status(kept/reverted/crash/contested),notes,judge_a_model,judge_b_model,rubric_hash,eval_dataset_ref(git SHA or manifest id).- Successful experiments:
git commitwith message[autoresearch] exp-<N>: <one-line hypothesis>. - Failed experiments:
git reset --hard HEAD— no commit. - After a run: open a PR from the branch for human review before merging to
main.
6. Directory Layout¶
autoresearch/
prompt_tuning/
program.md # Agent instructions (Track A) + allowlisted paths
eval/
score.py # Immutable harness (thin wrapper → podcast_scraper.evaluation)
rubric.md # Judge rubric (immutable during run)
judge_config.yaml # Pinned judge model IDs (human-only, immutable during run)
results.tsv
ml_params/
program.md # Agent instructions (Track B) + scalar formula / weights
eval/
score.py # Immutable harness
results.tsv
src/podcast_scraper/prompts/
shared/ ... # Shared .j2 (mutable only if allowlisted)
<provider>/ ... # Provider overrides (typical Track A target)
config/
autoresearch/
ml_params.yaml # Mutable target (Track B)
data/
eval/ # Immutable gold dataset (inputs)
eval/runs/ # Predictions / run outputs (existing convention; see RFC-015)
Cost Model¶
| Phase | Track | Estimated Cost |
|---|---|---|
| Eval harness development (manual iteration) | Prompt | $20–40 one-time |
| Single overnight run (50 experiments, 10 episodes) | Prompt | $15–30 |
| Additional runs | Prompt | $15–30 each |
| Single overnight run (50 experiments, 5 episodes) | ML | ~$0 (local compute only) |
Total budget to reach stable optimized prompts: ~$80–120 across 3–4 runs.
Testing Strategy¶
- Harness tests: Unit test
score.pywith fixture outputs to verify scoring logic before any agent run; place tests undertests/unit/...following the existing layout (e.g.tests/unit/autoresearch/or next to evaluation tests). - Dry-run mode:
score.py --dry-runprints score for current baseline without calling judge — validates setup is working. - Baseline lock: First experiment always runs against unmodified target to establish
baseline_scoreinresults.tsv. Agent must not proceed if baseline run fails. - Regression guard: If final merged prompts regress on the existing
tests/suite, revert before merging tomain.
Rollout¶
- Implement Track A (prompt tuning) first — lower cost, faster feedback loop.
- Add
judge_config.yamlwith pinned models; document them inresults.tsv. - Validate eval harness independently with
--dry-runbefore first agent run. - Run Track A on a 5-episode subset (
AUTORESEARCH_EVAL_N=5), reviewresults.tsv, confirm loop is behaving. - Scale to 10-episode default for overnight run.
- Implement Track B (ML params) after Track A prompts are stable.
- Document both tracks in
docs/guides/AUTORESEARCH_GUIDE.md.
Alternatives Considered¶
- Manual prompt iteration — Rejected: too slow, not reproducible, no structured logging of what was tried.
- Hyperparameter tuning libraries (Optuna, Ray Tune) — Rejected for prompt track: these optimize numerical search spaces, not open-ended text. For ML params, may revisit as a complement to the agent loop.
- Single judge LLM — Rejected: single-judge scoring introduces provider bias; dual-judge with divergence detection is more robust.
- Run on cloud GPU — Deferred: Apple Silicon (MPS) is sufficient for Track B inference param tuning. Revisit if eval set grows beyond ~20 episodes or weight fine-tuning is added.
- YAML prompt files under
config/prompts/— Rejected for Track A v1: conflicts with RFC-017 and the existing.j2prompt store; would require a parallel loading path.
Track A v1 Implementation Checklist¶
Use this as the minimal first PR scope; extend only after one green agent smoke run.
Decisions (record in program.md or team notes)
- [ ] Run output root: reuse
data/eval/runs/vsdata/eval/autoresearch_runs/(pick one; avoid touching gold inputs). - [ ] Summarization keys: production keys for candidate summaries vs
AUTORESEARCH_EXPERIMENT_*only (see §Credentials). - [ ] First provider + template: e.g. OpenAI + one allowlisted
.j2path only.
Repository layout
- [ ] Add
autoresearch/prompt_tuning/program.md(allowlist, experiment cap, how to runscore.py, git rules). - [ ] Add
autoresearch/prompt_tuning/eval/score.py— thin orchestrator; calls existingrun_experiment.py/podcast_scraper.evaluation(no forked metrics). - [ ] Add
autoresearch/prompt_tuning/eval/rubric.mdandjudge_config.yaml(pinned model IDs). - [ ] Add
autoresearch/prompt_tuning/results.tsv(seed header row) or document append-only creation on first run. - [ ] Add
config/examples/.env.exampleentries for allAUTORESEARCH_*vars used by the harness.
Behavior
- [ ]
score.pyimplements--dry-run(no judge calls; exercise experiment + metric path). - [ ] Baseline row required before agent continues (per Testing Strategy).
- [ ]
AUTORESEARCH_EVAL_Nrespected; default smoke = 5, documented default full = 10. - [ ] Judge calls use only
AUTORESEARCH_JUDGE_*keys unlessAUTORESEARCH_ALLOW_PRODUCTION_KEYS=1. - [ ] Each result line logs
judge_*_model,rubric_hash,eval_dataset_ref.
Quality gates
- [ ] Unit tests for scoring aggregation / contested logic with fixtures (no live API).
- [ ]
make lint/ tests green for new modules. - [ ] Optional:
docs/guides/AUTORESEARCH_GUIDE.mdstub pointing at this RFC (full guide can follow Rollout step 7).
References¶
- karpathy/autoresearch — Original pattern
- autoresearch-mlx fork — Apple Silicon adaptation
- RFC-017: Prompt Management
- RFC-015: AI Experiment Pipeline
- RFC-053: Adaptive Summarization Routing
- RFC-049: GIL Core
data/eval/— Gold eval dataset