Optimization Workflow Guide¶
How to investigate and solve performance or cost problems in the pipeline. This process is data-driven: every decision is backed by measured numbers, and all artifacts live in dedicated folders so you can always trace back what was measured, when, and why.
Related guides (read first if unfamiliar with the tooling):
- Experiment Guide -- eval system: datasets, baselines, configs, scoring, references, silver/gold promotion
- Performance Profile Guide -- frozen
profiles (
make profile-freeze), psutil sampling, stage attribution,stage_truth.json(RFC-064) - Experiment Guide -- what each metric means
(ROUGE, gates, latency, cost),
metrics.jsonschema, vs_reference scoring (see Step 4: Evaluate Results) - Live Pipeline Monitor -- real-time stage /
RSS / CPU view during a dev run (
--monitor, RFC-065) - Performance Guide -- general performance considerations, cache behavior, audio preprocessing costs
- AI Provider Comparison Guide -- provider decision matrices, cost analysis, eval reports
- Experiment Guide -- how runs become baselines, silvers, or app defaults (see "Step 5: Promote a run")
The process¶
1. Start from a GitHub issue¶
The GitHub issue is the source of truth for the optimization. It captures the objective, constraints, product decisions, and all results.
- Open (or find) an issue that describes the problem or opportunity.
- Use the issue body for scope, constraints, and success criteria.
- Post updates as comments: hypothesis results, baseline numbers, experiment outcomes, and the final decision.
2. Analyze and make hypotheses¶
Before writing any code, understand where the time and money go.
- Read existing profiles (
data/profiles/) for wall time per stage -- see Performance Profile Guide for how to interpret them. - Read existing eval runs (
data/eval/runs/) for quality and cost numbers -- see Experiment Guide (Step 4) for what each metric means. - Optionally use Live Pipeline Monitor
(
--monitor) to watch a single run in real time and spot where the pipeline stalls. - Check AI Provider Comparison Guide for existing cross-provider benchmarks that may already answer your question.
- Write down testable hypotheses (e.g. "cleaning + summarization dominate sequential cost per episode").
- Each hypothesis needs a how to test (which metric, which tool) and a what it means if true (which lever to pull).
- Post hypotheses on the GitHub issue.
3. Measure the baseline (eval + profile)¶
Create a dedicated folder inside data/eval/ and data/profiles/
for this work (e.g. data/eval/issue-NNN/, data/profiles/issue-NNN/).
Everything for this optimization lives there: configs, outputs, READMEs.
Quality + cost baseline (eval):
Use the experiment system described in Experiment Guide.
Create an experiment config YAML in your dedicated folder and run it with
--cost-report to capture token counts:
.venv/bin/python3 scripts/eval/experiment/run_experiment.py \
data/eval/issue-NNN/my_config.yaml \
--reference silver_sonnet46_smoke_v1 --cost-report
This produces metrics.json (ROUGE, gates, latency -- see
Experiment Guide Step 4) and eval_pipeline_metrics.json
(tokens, calls, estimated USD per stage).
Performance baseline (profile):
Use the profiling system described in Performance Profile Guide. Create a capture config in your dedicated folder and run a freeze:
.venv/bin/python3 scripts/eval/profile/freeze_profile.py \
--version issue-NNN-staged \
--pipeline-config data/profiles/issue-NNN/capture_config.yaml \
--dataset-id e2e_podcast1_mtb_n2 \
--output data/profiles/issue-NNN/issue-NNN-staged.yaml
This produces a frozen YAML profile (per-stage wall time, CPU%, RSS) and
a companion stage_truth.json (token counts, per-episode breakdowns,
parallelism ratio).
What to capture in the baseline:
| Dimension | Source | Key metrics |
|---|---|---|
| Quality | metrics.json |
ROUGE-L, gates, embedding cosine |
| Cost | eval_pipeline_metrics.json |
Tokens per stage, calls, est. USD |
| Time | Profile YAML + stage_truth | Wall time per stage, CPU%, parallelism |
4. Validate hypotheses with numbers¶
Compare the measured data against each hypothesis. For every hypothesis, state: confirmed or rejected, with the specific numbers.
- Post the validation results on the GitHub issue.
- Update the experiment plan (
docs/wip/) with a Problem Definition section that combines quality, cost, and time baselines. - If a hypothesis is rejected, re-analyze before proceeding.
5. Implement the solution¶
Now that the problem is defined with data, implement the fix behind an opt-in flag or config (so the existing path remains the default).
- Keep the implementation focused on the lever identified in step 4.
- Do not change multiple things at once -- one lever per experiment so diffs stay interpretable.
6. Validate against the baseline¶
Re-run the same eval and profile from step 3, but with the new code path enabled. Compare against the baseline:
# Eval (same reference, same dataset -- see Experiment Guide)
.venv/bin/python3 scripts/eval/experiment/run_experiment.py \
data/eval/issue-NNN/my_optimized_config.yaml \
--reference silver_sonnet46_smoke_v1 --cost-report
# Profile (same capture setup -- see Performance Profile Guide)
.venv/bin/python3 scripts/eval/profile/freeze_profile.py \
--version issue-NNN-optimized \
--pipeline-config data/profiles/issue-NNN/capture_optimized.yaml \
--dataset-id e2e_podcast1_mtb_n2 \
--output data/profiles/issue-NNN/issue-NNN-optimized.yaml
# Diff the two profiles
.venv/bin/python3 scripts/eval/profile/diff_profiles.py \
data/profiles/issue-NNN/issue-NNN-staged.yaml \
data/profiles/issue-NNN/issue-NNN-optimized.yaml
Decision gate: Compare quality, cost, and time against the baseline thresholds defined in step 1. Post the results on the GitHub issue and make the ship/iterate/reject decision.
If the optimization passes, consider whether the new config should be promoted to a baseline or app default -- see Experiment Guide (Step 5: Promote a run).
Folder layout¶
data/eval/issue-NNN/
README.md # What each config measures, commands
baseline_config.yaml # Eval config for the current path
optimized_config.yaml # Eval config for the new path
data/profiles/issue-NNN/
README.md # What each profile measures, commands
capture_baseline.yaml # Pipeline config for baseline profile
capture_optimized.yaml # Pipeline config for optimized profile
issue-NNN-staged.yaml # Frozen baseline profile
issue-NNN-staged.stage_truth.json
issue-NNN-optimized.yaml # Frozen optimized profile
issue-NNN-optimized.stage_truth.json
Checklist¶
- [ ] GitHub issue exists with objective and constraints
- [ ] Hypotheses written and posted on the issue
- [ ] Dedicated folders created in
data/eval/anddata/profiles/ - [ ] Quality baseline measured (eval with
--cost-report) - [ ] Performance baseline measured (profile freeze)
- [ ] Hypotheses validated with numbers, posted on the issue
- [ ] Problem definition documented (quality + cost + time)
- [ ] Solution implemented behind opt-in
- [ ] Solution validated against baseline (eval + profile + diff)
- [ ] Decision posted on the GitHub issue
Tooling quick reference¶
| What you need | Tool / guide |
|---|---|
| Run an experiment (quality + cost) | Experiment Guide |
| Understand metrics (ROUGE, gates, cost) | Experiment Guide (Step 4) |
| Capture a frozen profile (wall time, CPU%, RSS) | Performance Profile Guide |
| Watch a run in real time (stage, RSS, CPU) | Live Pipeline Monitor |
| Compare providers (cost, quality, speed) | AI Provider Comparison Guide |
| General performance tips (caching, audio) | Performance Guide |
| Promote a run to baseline or silver | Experiment Guide (Step 5) |
| Eval system mechanics (datasets, refs) | Experiment Guide |