GIL: ML vs OpenAI — outcome benchmark (WIP note)¶
Status: Working note — methodology for comparing grounded-insights outcomes on the same episodes, not for aligning YAML parameter numbers across stacks.
Related: GROUNDED_INSIGHTS_GUIDE.md,
scripts/tools/gil_quality_metrics.py, PRD-017 / src/podcast_scraper/gi/quality_metrics.py.
Goal¶
- Use OpenAI (or another LLM evidence path) as a reference outcome on a fixed episode set.
- Improve the ML evidence stack (extractive QA + local NLI, thresholds, models) so that artifacts and metrics get as close as practical to that reference.
- Do not tune ML so that
gi_qa_score_min/gi_nli_entailment_min“match” OpenAI numerically — scores are not comparable across implementations (e.g. OpenAIextract_quotesusesqa_score=1.0after span resolution; ML uses real SQuAD-style probabilities and CrossEncoder entailment).
Compare outcomes, not hyperparameters¶
Same inputs
- Same feed and same episode list (or frozen episode IDs / transcript hashes).
- Where possible, same insight text for both runs (e.g. reuse summary bullets from one summary generation, or document explicitly if insights differ).
Two runs
- Reference: e.g.
quote_extraction_provider/entailment_provider: openai(and/or LLMsummary_providerif comparing full stack). - Candidate: best-effort ML path (
transformers/hybrid_mlevidence).
Diff the products
- Per run:
metadata/*.gi.json, run-levelmetrics.json. - Compare grounding coverage, quotes per insight, verbatim span validity, not config literals.
Operational metrics (suggested)¶
Use a small fixed benchmark (e.g. 20–50 episodes) and report side by side.
| Level | Metric | Notes |
|---|---|---|
| Episode | % episodes with ≥1 grounded quote | Coverage vs reference |
| Insight | % insights with ≥1 quote when reference has one | Recall-like vs OpenAI |
| Quote | Span overlap (IoU) or exact [char_start, char_end) match |
Same insight index / aligned row |
| Quality | Verbatim slice check; optional human or embedding sanity | Precision / semantics |
Existing tooling
- Run-level aggregates:
make gil-quality-metrics DIR=<run_root>orscripts/tools/gil_quality_metrics.py(see GROUNDED_INSIGHTS_GUIDE).
Gap / follow-up
- A small paired-diff script (load two run dirs, match by
episode_id, align insights by order or stable text hash, compute overlap / agreement rates) would make “distance to OpenAI” explicit. Not required to start manually.
Closing the ML ↔ OpenAI gap (order of leverage)¶
- Lock the benchmark — frozen episode list + cached transcripts/summaries so comparisons are reproducible.
- Align insight text — same bullets (or same provider-generated insights) for both evidence paths so QA/NLI see identical hypotheses where intended.
- Improve the stack — QA/NLI model choice, windowing, ASR cleanup for evidence only, etc. Re-score agreement vs reference after each change.
- Tune thresholds last — adjust
gi_qa_score_min/gi_nli_entailment_minon the dev benchmark to maximize outcome agreement (or F1 vs reference), subject to floors on invalid / non-verbatim spans. - Hybrid ceiling — ML summaries + LLM evidence (
quote_extraction_provider/entailment_provider: openai) is a practical way to approach OpenAI-like behavior without re-summarizing on the API.
Expectations¶
- You can systematically shrink the outcome gap and report how close ML gets on the benchmark.
- Perfect per-insight parity with OpenAI on hard paraphrase + noisy ASR is not guaranteed with a purely local stack; some LLM in the evidence path (or richer ML) may be needed for a tight ceiling.
Implementation note (CLI + YAML)¶
- As of the fix tracked in this branch,
_build_config()incli.pymust forward GIL tuning fields from the merged argparse namespace intoConfig; otherwise YAMLgi_qa_score_min/gi_nli_entailment_min/ window fields are ignored on CLI runs. When running paired benchmarks, confirmConfigreflects the file (e.g. log orconfig_snapshotin metadata).
Implemented (v1 housekeeping)¶
- Paired configs:
config/manual/gil_paired_benchmark_openai_evidence.yaml(OpenAI evidence, ML summaries) andconfig/manual/gil_paired_benchmark_ml_evidence.yaml(local evidence). Same feed andmax_episodes; distinctoutput_dirunder.test_outputs/benchmark/. - Compare script:
make compare-gil-runs REF=<run_a> CAND=<run_b>orscripts/tools/compare_gil_runs.py(logic insrc/podcast_scraper/gi/compare_runs.py) — per-episode quote/grounded counts and episode-level “both have quotes” summary. - CLI:
_build_configforwardsgi_embedding_model(and other GIL keys); INFO log line for effectivegi_qa_score_min/gi_nli_entailment_min/ evidence providers /gi_embedding_modelwhengenerate_giis true.
Next steps (optional)¶
- [ ] Run both benchmark configs on the same machine; record two
run_*roots and archive metrics. - [ ] Add span-level IoU or insight-text alignment in
compare_runs(beyond quote counts).