Skip to content

Experiment Guide

This guide explains how to run AI experiments using the podcast_scraper benchmarking framework. Experiments allow you to test different models, prompts, and parameters on canonical datasets and compare results against frozen baselines.

For frozen resource-cost profiles (peak RSS, CPU%, wall time per pipeline stage) at release time, see the Performance Profile Guide (RFC-064). That track is separate from quality evaluation here.


Getting Started

Workflow order:

  1. Prepare Source Data (Step 0) -- Generate metadata and source indexes from RSS XML files
  2. Create a Dataset (Step 1) -- Create a canonical dataset from your eval data
  3. Materialize Dataset (Step 1a) -- Validate and materialize the dataset (optional but recommended)
  4. Create a Baseline (Step 2) -- Create a baseline using that dataset
  5. Run Experiments (Step 3) -- Run experiments that compare against the baseline
  6. Promote (Step 5) -- Promote a run to baseline or reference when it earns that role

Why this order?

  • Datasets require source data with transcripts
  • Baselines require a dataset to know which episodes to process
  • Experiments require both a dataset (for input) and a baseline (for comparison)
  • Materialization validates dataset integrity before use

Overview

The experiment system consists of several components:

  1. Source Data -- Raw RSS XML files, transcripts, and metadata in data/eval/sources/
  2. Datasets -- Canonical, frozen sets of episodes with transcripts and golden references
  3. Materialized Datasets -- Validated, copied datasets in data/eval/materialized/
  4. Runs -- Execution results in data/eval/runs/ (temporary, disposable)
  5. Baselines -- Promoted runs that serve as comparison points in data/eval/baselines/
  6. References -- Promoted runs that serve as "truth" for evaluation metrics in data/eval/references/

Key Concepts

  • Source ID: Identifier for a source directory (e.g. curated_5feeds_raw_v1)
  • Dataset ID: Identifier for a canonical dataset (e.g. curated_5feeds_smoke_v1)
  • Run ID: Identifier for an execution result (e.g. run_2026-01-16_11-52-03)
  • Baseline ID: Identifier for a promoted baseline (e.g. baseline_ml_prod_authority_v1)
  • Reference ID: Identifier for a promoted reference (e.g. silver_sonnet46_smoke_v1)

Runs, baselines, and references

Core principle: execution is neutral. Meaning is assigned afterward.

Every baseline, reference, and experiment is produced by the same runner -- same code, same provider abstraction, same fingerprinting, same artifacts (predictions.jsonl, metrics.json, fingerprint.json). The distinction is post-run promotion, not execution.

Run (execution only)

A run is the result of executing the experiment runner. It lives in data/eval/runs/ and has no special meaning yet.

  • Temporary (can be deleted)
  • No governance rules
  • Just execution results

Baseline (promoted run)

A baseline is a run promoted to serve as a comparison point.

  • Required for experiments (experiments must specify a baseline)
  • Can block CI (regressions against this baseline can fail CI)
  • Immutable (cannot be overwritten)
  • Used as comparison point (not as "truth")
  • A baseline is the contract for default app behavior -- frozen, explicit, and intentional

Storage: data/eval/baselines/{baseline_id}/

Reference (promoted run)

A reference is a run promoted to serve as "truth" for evaluation metrics (e.g. ROUGE, embedding similarity).

  • Not required for experiments (experiments can run without references)
  • Cannot block CI (references are informational)
  • Rarely updated (only when truth changes)
  • Immutable (cannot be overwritten)

Storage:

  • Silver: data/eval/references/silver/{reference_id}/
  • Gold NER: data/eval/references/gold/ner_entities/{reference_id}/
  • Gold summarization: data/eval/references/gold/summarization/{reference_id}/

Reference quality:

  • Silver -- machine-generated by a strong model (e.g. Claude Sonnet 4.6)
  • Gold -- human-verified summaries or entity annotations

Baselines vs references vs experiments

Role Purpose Immutable CI gating
Run Execution result No No
Baseline Comparison point (default app behavior) Yes Yes
Silver reference Approximate "truth" for metrics Yes No
Gold reference Exact "truth" for metrics Yes Yes
Experiment Test new approaches No No

From baseline to app default

The workflow for promoting a baseline to app default:

  1. Run evaluation with your intended default config
  2. Validate metrics meet acceptance criteria
  3. Promote the run to a baseline (make run-promote)
  4. Promote the baseline into the Model Registry (RFC-044): make registry-promote
  5. Set summary_mode_id in config to the promoted mode
  6. Verify app behavior matches baseline

The app runtime never imports data/eval/. Proven baseline configs are promoted into the code registry as modes:

make registry-promote \
  BASELINE_ID=baseline_ml_prod_authority_v1 \
  MODE_ID=ml_prod_authority_v1

Step 0: Prepare Source Data

Prerequisites: You need RSS XML files and transcript files in data/eval/sources/.

Before creating datasets, you should:

  1. Generate episode metadata from RSS XML files
  2. Generate source indexes for inventory management

Generate Episode Metadata

Generate metadata JSON files from RSS XML files:

make metadata-generate INPUT_DIR=data/eval/sources

This will:

  • Scan data/eval/sources/ recursively for RSS XML files
  • Parse each RSS feed and extract episode metadata
  • Generate *.metadata.json files next to each XML file

Output format:

Each {episode_id}.metadata.json contains:

{
  "source_episode_id": "p01_e01",
  "feed_name": "Singletrack Sessions",
  "feed_url": "http://localhost/",
  "episode_title": "Episode 1: Building Trails That Last...",
  "published_at": "2025-09-01",
  "duration_seconds": 630,
  "language": "en",
  "scraped_at": "2026-01-13T12:07:56.657450Z"
}

Optional parameters:

make metadata-generate \
  INPUT_DIR=data/eval/sources \
  OUTPUT_DIR=data/eval/metadata \
  LOG_LEVEL=DEBUG

Generate Source Index

Create an inventory index for each source directory:

make source-index SOURCE_DIR=data/eval/sources/curated_5feeds_raw_v1

This will:

  • Scan the source directory for feed subdirectories
  • Find all transcript and metadata files
  • Compute SHA256 hashes for transcripts
  • Generate index.json in the source directory

Output format:

The index.json contains:

{
  "source_id": "curated_5feeds_raw_v1",
  "created_at": "2026-01-13T12:11:46.314843Z",
  "episodes": [
    {
      "source_episode_id": "p01_e01",
      "feed": "feed-p01",
      "transcript_path": "feed-p01/p01_e01.txt",
      "transcript_sha256": "a650e729cc8b7379c94fd5b29c092bcd32a8c7e4c2086f1321d6ed496718b9b4",
      "meta_path": "feed-p01/p01_e01.metadata.json"
    }
  ]
}

Process all sources:

make source-index SOURCE_DIR=data/eval/sources ALL=1

Benefits of source indexes:

  • Programmatic dataset generation
  • Drift detection (hash changes)
  • Dataset definition validation
  • Avoid ad-hoc directory scanning

Step 1: Create a Dataset

Prerequisites: You need evaluation data in data/eval/ with transcript files (.txt files in any subdirectory).

Datasets are canonical, frozen sets of episodes stored as JSON files. The script recursively finds all .txt files in subdirectories and treats each as a transcript.

Quick Start: Predefined Datasets

For the curated 5 feeds source, we have three predefined datasets:

Smoke Test Dataset (first episode per feed):

make dataset-smoke

Creates data/eval/datasets/curated_5feeds_smoke_v1.json with 5 episodes.

Benchmark Dataset (first 2 episodes per feed):

make dataset-benchmark

Creates data/eval/datasets/curated_5feeds_benchmark_v1.json with 10 episodes.

Raw Dataset (all episodes):

make dataset-raw

Creates data/eval/datasets/curated_5feeds_raw_v1.json with all episodes.

Custom Dataset Creation

Using the Make Command (Recommended):

make dataset-create \
  DATASET_ID=indicator_v1 \
  EVAL_DIR=data/eval \
  DESCRIPTION="Lenny's Podcast evaluation episodes (interview style)"

Default values:

  • EVAL_DIR defaults to data/eval (can be omitted)
  • OUTPUT_DIR defaults to benchmarks/datasets (can be omitted)
  • DESCRIPTION defaults to "Dataset {DATASET_ID}" (can be omitted)

With all options:

make dataset-create \
  DATASET_ID=indicator_v1 \
  EVAL_DIR=data/eval \
  OUTPUT_DIR=data/eval/datasets \
  DESCRIPTION="Lenny's Podcast evaluation episodes (interview style)" \
  CONTENT_REGIME=narrative \
  MAX_EPISODES_PER_FEED=2

Filtering episodes:

Use MAX_EPISODES_PER_FEED to limit episodes per feed:

  • MAX_EPISODES_PER_FEED=1 - First episode per feed (smoke test)
  • MAX_EPISODES_PER_FEED=2 - First 2 episodes per feed (benchmark)
  • Omit parameter - All episodes (full dataset)

Using the Script Directly

python scripts/eval/data/create_dataset_json.py \
  --dataset-id indicator_v1 \
  --eval-dir data/eval \
  --output-dir data/eval/datasets \
  --description "Lenny's Podcast evaluation episodes (interview style)" \
  --max-episodes-per-feed 2

How it works:

  • Recursively scans data/eval/ for all .txt files
  • Derives episode IDs from filenames (without extension)
  • Looks for associated files:
  • {episode_id}.metadata.json - Episode metadata (new format)
  • metadata.json - Episode metadata (old format)
  • {episode_id}.raw.txt - Raw transcript
  • {episode_id}.summary.gold.long.txt - Long golden summary
  • {episode_id}.summary.gold.short.txt - Short golden summary
  • Computes SHA256 hashes for transcripts
  • Creates dataset JSON with all episode information

Dataset JSON Structure

A dataset JSON looks like this:

{
  "dataset_id": "curated_5feeds_smoke_v1",
  "version": "1.0",
  "description": "Smoke test dataset: first episode per feed from curated_5feeds_raw_v1",
  "created_at": "2026-01-13T12:22:41.258855Z",
  "content_regime": "explainer",
  "num_episodes": 5,
  "episodes": [
    {
      "episode_id": "p01_e01",
      "title": "Episode 1: Building Trails That Last (with Liam Verbeek)",
      "transcript_path": "data/eval/sources/curated_5feeds_raw_v1/feed-p01/p01_e01.txt",
      "transcript_hash": "a650e729cc8b7379c94fd5b29c092bcd32a8c7e4c2086f1321d6ed496718b9b4",
      "preprocessing_profile": "cleaning_v3",
      "duration_minutes": 10.5
    }
  ]
}

Manual Dataset Creation

You can also create dataset JSONs manually. Each episode must have:

  • episode_id: Unique identifier
  • transcript_path: Path to cleaned transcript file
  • transcript_hash: SHA256 hash of transcript content

Optional fields:

  • title: Episode title
  • preprocessing_profile: Profile used for cleaning (see Preprocessing Profiles Guide)
  • transcript_raw_path: Path to raw transcript
  • golden_summary_long_path: Path to long golden summary
  • golden_summary_short_path: Path to short golden summary
  • duration_minutes: Episode duration in minutes

Prerequisites: You must have created a dataset first (Step 1).

Materialization validates dataset integrity and creates a clean, reproducible copy of all transcripts.

Why Materialize?

Materialization proves:

  • Dataset JSON is correct
  • Paths resolve correctly
  • Hashes match expected values
  • Materialization is reproducible

Materializing a Dataset

Using the Make Command (Recommended):

make dataset-materialize DATASET_ID=curated_5feeds_smoke_v1

With custom output directory:

make dataset-materialize \
  DATASET_ID=curated_5feeds_smoke_v1 \
  OUTPUT_DIR=data/eval/materialized

Using the Script Directly:

python scripts/eval/data/materialize_dataset.py \
  --dataset-id curated_5feeds_smoke_v1 \
  --output-dir data/eval/materialized

What Materialization Does

  1. Validates dataset JSON - Checks that all required fields are present
  2. Resolves paths - Verifies all transcript files exist
  3. Validates hashes - Computes SHA256 and compares to expected hash
  4. Copies transcripts - Creates clean copies in materialized directory
  5. Creates metadata - Generates episode and dataset metadata files

Hash validation:

If a transcript hash doesn't match, materialization fails with a clear error:

ERROR: Episode p01_e01: HASH MISMATCH - transcript file has been modified!
  Expected hash: abc123...
  Actual hash:   def456...
  File:          data/eval/sources/curated_5feeds_raw_v1/feed-p01/p01_e01.txt
  This indicates the transcript file has changed since the dataset was created.

Materialized Dataset Structure

data/eval/materialized/curated_5feeds_smoke_v1/
├── meta.json                    # Dataset-level metadata
├── p01_e01.txt                  # Copied transcript
├── p01_e01.meta.json            # Episode metadata
├── p02_e01.txt
├── p02_e01.meta.json
└── ...

Dataset metadata (meta.json):

{
  "dataset_id": "curated_5feeds_smoke_v1",
  "source_dataset_file": "data/eval/datasets/curated_5feeds_smoke_v1.json",
  "num_episodes": 5,
  "materialized_at": "2026-01-13T12:22:41.258855Z",
  "episodes": [
    {
      "episode_id": "p01_e01",
      "transcript_path": "p01_e01.txt",
      "meta_path": "p01_e01.meta.json"
    }
  ]
}

Episode metadata ({episode_id}.meta.json):

{
  "episode_id": "p01_e01",
  "transcript_path": "p01_e01.txt",
  "transcript_hash": "a650e729cc8b7379c94fd5b29c092bcd32a8c7e4c2086f1321d6ed496718b9b4",
  "source_transcript_path": "/path/to/source/p01_e01.txt",
  "preprocessing_profile": "cleaning_v3",
  "title": "Episode 1: Building Trails That Last...",
  "duration_minutes": 10.5
}

Reproducibility

Materialization is reproducible - you can delete the materialized directory and regenerate it byte-for-byte:

rm -rf data/eval/materialized/curated_5feeds_smoke_v1
make dataset-materialize DATASET_ID=curated_5feeds_smoke_v1

Step 2: Create a Baseline

Prerequisites: You must have created a dataset first (Step 1). The baseline will use that dataset to know which episodes to process.

Baselines are frozen reference results from a known system state. They serve as comparison points for experiments.

Use the make command to materialize a baseline:

make baseline-create \
  BASELINE_ID=bart_led_baseline_v1 \
  DATASET_ID=curated_5feeds_smoke_v1

With optional experiment config:

make baseline-create \
  BASELINE_ID=bart_led_baseline_v1 \
  DATASET_ID=curated_5feeds_smoke_v1 \
  EXPERIMENT_CONFIG=data/eval/configs/baseline_config.yaml \
  PREPROCESSING_PROFILE=cleaning_v3

Creating a Baseline with the Script

Alternatively, you can call the script directly:

python scripts/eval/data/materialize_baseline.py \
  --baseline-id bart_led_baseline_v1 \
  --dataset-id curated_5feeds_smoke_v1 \
  --experiment-config data/eval/configs/baseline_config.yaml \
  --preprocessing-profile cleaning_v3

This will:

  • Load the dataset JSON (created in Step 1)
  • Process each episode using the specified configuration
  • Save predictions to benchmarks/baselines/{baseline_id}/predictions/
  • Generate metadata, fingerprints, and metrics
  • Important: Baselines are immutable - you cannot overwrite an existing baseline

Baseline Structure

A baseline directory contains:

benchmarks/baselines/bart_led_baseline_v1/
├── metadata.json          # Baseline metadata (dataset_id, git commit, stats)
├── fingerprint.json       # System fingerprint (model, version, device)
├── metrics.json           # Aggregate metrics
├── config.yaml            # Experiment config used (if provided)
├── predictions/           # Individual episode predictions
│   ├── ep01.json
│   ├── ep02.json
│   └── ...
└── artifacts/             # Additional artifacts (if any)

Baseline Metadata

The metadata.json includes:

  • baseline_id: Unique identifier
  • dataset_id: Dataset used
  • created_at: Timestamp
  • git_commit: Git commit SHA when baseline was created
  • git_is_dirty: Whether repo had uncommitted changes
  • provider_type: Provider used (e.g., "OpenAIProvider")
  • model_name: Model name
  • preprocessing_profile: Preprocessing profile ID (see Preprocessing Profiles Guide)
  • stats: Processing statistics (num_episodes, avg_time, compression, etc.)

Step 3: Run an Experiment

Experiments test new configurations against datasets and compare results to baselines.

Creating an Experiment Config

Create a YAML file (e.g., data/eval/configs/my_experiment.yaml):

id: "summarization_openai_long_v2"
task: "summarization"

backend:
  type: "openai"
  model: "gpt-4o-mini"

prompts:
  system: "summarization/system_v1"
  user: "summarization/long_v2_more_narrative"
  params:
    paragraphs_min: 3
    paragraphs_max: 6

data:
  dataset_id: "curated_5feeds_smoke_v1"  # Use dataset-based mode (recommended)

params:
  max_output_tokens: 900
  temperature: 0.7

# Contract fields (RFC-015)
dataset_id: "curated_5feeds_smoke_v1"
baseline_id: "bart_led_baseline_v1"
golden_required: true
golden_ref: "data/eval"  # Path to golden references

Grounded insights (GIL) and knowledge graph (KG) experiments

For transcript-only evaluation on a materialized dataset (no RSS run), use separate configs and runs—one task per YAML:

  • task: grounded_insights — GI extraction + evidence grounding. Supports all backends: eval_stub (skip inference), openai, gemini, anthropic, deepseek, mistral, grok, ollama. With LLM backends, the evidence stack (QA + NLI) auto-aligns to the summary provider for quote extraction and entailment scoring. Sample: data/eval/configs/gil_gemini_benchmark_v2_provider.yaml.
  • task: knowledge_graph — KG topic/entity extraction. Same backend support. Sample: data/eval/configs/kg_gemini_benchmark_v2_provider.yaml.

Key config params for GI/KG experiments:

params:
  gi_insight_source: provider     # "provider" (recommended) or "summary_bullets"
  gi_max_insights: 12             # sweet spot from autoresearch
  gi_require_grounding: true
  kg_extraction_source: provider  # "provider" (recommended) or "summary_bullets"
  kg_max_topics: 10               # sweet spot from autoresearch

NER experiments

task: ner_entities supports all backend types: spacy_local (original), openai, gemini, anthropic, deepseek, mistral, grok, ollama. LLM backends use a structured NER prompt to extract PERSON + ORG entities. Score against gold references in data/eval/references/gold/ner_entities/. Use entity_set scoring mode for name-level comparison (position-agnostic).

Sample: data/eval/configs/ner/ner_openai_gpt4omini_smoke_v1.yaml.

Data Configuration Modes

The experiment runner supports two data configuration modes:

data:
  dataset_id: "curated_5feeds_smoke_v1"

This loads episode information from data/eval/datasets/curated_5feeds_smoke_v1.json (or benchmarks/datasets/ if not found). Episode IDs are taken directly from the dataset JSON.

Glob-Based Mode (Legacy)

data:
  episodes_glob: "data/episodes/ep*/transcript.txt"
  id_from: "parent_dir"  # or "stem"

This uses glob patterns to discover files. Episode IDs are derived from paths using the id_from rule.

Note: You cannot specify both dataset_id and episodes_glob in the same config.

Running an Experiment

Prerequisites: You must have created both a dataset (Step 1) and a baseline (Step 2). The experiment will use the dataset for input and compare against the baseline.

export OPENAI_API_KEY="your-key-here"
make experiment-run CONFIG=data/eval/configs/my_experiment.yaml

With custom log level:

make experiment-run CONFIG=data/eval/configs/my_experiment.yaml LOG_LEVEL=DEBUG

Running an Experiment with the Script

Alternatively, you can call the script directly:

export OPENAI_API_KEY="your-key-here"
python scripts/eval/experiment/run_experiment.py data/eval/configs/my_experiment.yaml

The experiment runner will:

  1. Validate the experiment contract (dataset_id, baseline_id, etc.)
  2. Load the dataset and discover input files
  3. Process each episode with the specified provider
  4. Save predictions to results/{experiment_id}/predictions.jsonl
  5. Generate metadata, fingerprints, and statistics

Experiment Results

Results are saved to results/{experiment_id}/:

results/summarization_openai_long_v2/
├── predictions.jsonl      # One JSON object per episode (input/output/hashes/timing)
├── run_metadata.json      # Experiment metadata (config, stats, contract info)
└── fingerprint.json       # System fingerprint

Understanding Predictions

Each line in predictions.jsonl contains:

{
  "episode_id": "p01_e01",
  "input_path": "data/eval/sources/curated_5feeds_raw_v1/feed-p01/p01_e01.txt",
  "input_hash": "a650e729cc8b7379c94fd5b29c092bcd32a8c7e4c2086f1321d6ed496718b9b4",
  "output": "Summary text here...",
  "output_hash": "abc123...",
  "processing_time_seconds": 2.5,
  "input_length_chars": 50000,
  "output_length_chars": 500
}

Step 4: Evaluate Results

Not the CI dashboard: for pytest / coverage / GitHub Pages test metrics, see Test dashboard (GitHub Pages). This section covers experiment / eval metrics only.

Evaluation is handled automatically by the experiment runner. When you run an experiment with --baseline and/or --reference flags, the system automatically:

  1. Computes intrinsic metrics (gates, length, performance, cost)
  2. Computes vs_reference metrics (ROUGE, embedding similarity) if references are provided
  3. Computes deltas vs baseline if baseline is provided

Metrics Calculation Flow

experiment-run → run_experiment.py → score_run() → metrics.json
  1. Run Experiment: scripts/eval/experiment/run_experiment.py processes episodes and generates predictions.jsonl
  2. Compute Metrics: score_run() in src/podcast_scraper/evaluation/scorer.py reads predictions and computes metrics
  3. Save Results: Metrics are saved to data/eval/runs/<run_id>/metrics.json and metrics_report.md

Running Experiments with Evaluation

To run an experiment with full evaluation, use the --baseline and/or --reference flags:

make experiment-run \
  CONFIG=experiments/my_experiment.yaml \
  BASELINE=bart_led_baseline_v1 \
  REFERENCE=silver_gpt52_v1,gold_human_v1

Arguments:

  • CONFIG (required) - Experiment config YAML
  • BASELINE (optional) - Baseline ID for comparison
  • REFERENCE (optional, comma-separated) - Reference IDs for evaluation (can be silver/gold)
  • LOG_LEVEL (optional) - Logging level

Evaluation Architecture

The evaluation system consists of three separate roles that work together:

  1. Runner - Produces outputs (predictions + fingerprint + run metadata)
  2. Scorer - Computes metrics (gates, stability, cost/latency, and optionally "vs reference" metrics)
  3. Comparator - Computes deltas vs baseline

These roles are kept separate in code, even though they can be wired together in one script.

Runner (Execution)

The runner executes the experiment and produces:

  • predictions.jsonl - Model outputs for all episodes
  • fingerprint.json - System fingerprint (reproducibility)
  • run_metadata.json - Experiment metadata

Location: scripts/eval/experiment/run_experiment.py (runner phase)

Scorer (Metrics)

The scorer computes metrics from predictions. Metrics are divided into two categories:

Intrinsic Metrics

Intrinsic metrics are computed from predictions alone and don't require reference summaries. They include:

1. Quality Gates

Detect common issues in generated summaries:

  • boilerplate_leak_rate: Fraction of episodes with promotional/sponsor content leaks
  • Patterns detected: "subscribe to our newsletter", "follow us on", "rate and review", etc.
  • speaker_label_leak_rate: Fraction of episodes with speaker labels leaking through (FAIL gate)
  • Patterns detected: "Host:", "Guest:", "Speaker 1:", "Interviewer:", etc.
  • This is the main summarization gate - should be 0.0
  • truncation_rate: Fraction of episodes that appear truncated
  • Detected by truncation markers ("...", "[TRUNCATED]") or suspiciously short outputs
  • failed_episodes: List of episode IDs that failed quality gates

Warnings (Not Gates):

  • speaker_name_leak_rate: Fraction of episodes with actual speaker names leaking through (WARN only)
  • Detects actual names from metadata (e.g., "Alice", "Bob") appearing in summaries
  • This is tracked for monitoring but does not cause gate failures

2. Length Metrics

Token-based length statistics:

  • avg_tokens: Average number of tokens per summary (estimated as chars/4)
  • min_tokens: Minimum tokens across all summaries
  • max_tokens: Maximum tokens across all summaries

3. Performance Metrics

Latency measurements:

  • avg_latency_ms: Average processing time per episode in milliseconds
  • Extracted from metadata.processing_time_seconds in predictions

4. Cost Metrics (OpenAI Only)

Note: Cost metrics are only included for OpenAI runs. ML model runs skip this section entirely.

  • avg_cost_usd: Average cost per episode in USD
  • total_cost_usd: Total cost for all episodes in USD

Cost is computed from:

  • metadata.cost_usd (if directly provided by provider)
  • metadata.usage (token counts) with model-specific pricing:
  • GPT-4o-mini: $0.15/1M input, $0.60/1M output
  • GPT-4o: $2.50/1M input, $10.00/1M output

Location: src/podcast_scraper/evaluation/scorer.py

Comparator (Deltas)

The comparator computes deltas between experiment and baseline:

  • Cost deltas
  • Latency deltas
  • Gate regressions
  • ROUGE deltas (if both have same references)

Location: src/podcast_scraper/evaluation/comparator.py

Reference Model

References are optional evaluation targets. You can have:

  • Baseline (optional but usually required for experiments) - for regression detection
  • Silver references (optional) - machine-generated, higher quality
  • Gold references (optional) - human-verified summaries

Key principle: A reference is anything that looks like a run output (predictions.jsonl + fingerprint.json + baseline.json).

vs_reference Metrics

vs_reference metrics compare your predictions against reference summaries (golden or silver standards). These are optional and only computed when references are provided.

When is vs_reference null?

vs_reference is null when:

  • No references were provided via --reference CLI argument or REFERENCE_IDS Makefile variable
  • The experiment was run without reference evaluation

This is the normal state for most runs - references are optional and only needed when you want to compare against golden/silver standards.

How to provide references

# Single reference via Makefile
make experiment-run CONFIG=... REFERENCE_IDS=golden_v1

# Multiple references via Makefile
make experiment-run CONFIG=... REFERENCE_IDS="golden_v1 silver_v2"

# Via CLI
python scripts/eval/experiment/run_experiment.py config.yaml --reference golden_v1 --reference silver_v2

Reference Structure

References can be:

  • Baselines: data/eval/baselines/<baseline_id>/
  • References:
  • Silver: data/eval/references/silver/<reference_id>/
  • Gold NER: data/eval/references/gold/ner_entities/<reference_id>/
  • Gold Summarization: data/eval/references/gold/summarization/<reference_id>/
  • Gold GIL: data/eval/references/gold/gil/<reference_id>/ ({episode_id}.json per episode)
  • Gold KG: data/eval/references/gold/kg/<reference_id>/ ({episode_id}.json per episode)
  • Legacy baselines: benchmarks/baselines/<baseline_id>/

Reference payloads: Silver references and summarization-style gold often include predictions.jsonl with the same episode IDs as your run. Gold NER, GIL, and KG may instead use per-episode JSON files only (no predictions.jsonl); rescore_baseline and baseline materialization accept either pattern when resolving references.

vs_reference Metrics Computed

When references are provided, the following metrics are computed:

  1. reference_quality: Metadata about the reference (episode count, quality level, etc.)

  2. ROUGE Scores (requires rouge-score package):

  3. rouge1_f1: ROUGE-1 F1 score (unigram overlap) - measures coverage
  4. rouge2_f1: ROUGE-2 F1 score (bigram overlap) - measures local coherence
  5. rougeL_f1: ROUGE-L F1 score (longest common subsequence) - measures structural similarity

  6. BLEU Score (requires nltk package):

  7. bleu: BLEU score (n-gram precision with brevity penalty)

  8. WER (Word Error Rate) (requires jiwer package):

  9. wer: Word-level edit distance normalized by reference length

  10. Embedding Similarity (requires sentence-transformers package):

  11. embedding_similarity: Cosine similarity between embeddings of predictions and references

  12. numbers_retained: Fraction of reference numbers retained in predictions (average over episodes); implemented in evaluation/scorer.py (_extract_numbers, _compute_numbers_retained). Omitted when the reference has no numbers.

Example vs_reference Structure

{
  "vs_reference": {
    "golden_v1": {
      "reference_quality": {
        "episode_count": 5,
        "quality_level": "gold"
      },
      "rouge1_f1": 0.45,
      "rouge2_f1": 0.32,
      "rougeL_f1": 0.42,
      "bleu": 0.38,
      "wer": 0.15,
      "embedding_similarity": 0.87
    },
    "silver_v2": {
      "reference_quality": {
        "episode_count": 5,
        "quality_level": "silver"
      },
      "rouge1_f1": 0.42,
      "rouge2_f1": 0.19,
      "rougeL_f1": 0.39,
      "bleu": 0.35,
      "wer": 0.18,
      "embedding_similarity": 0.85
    }
  }
}

Key points:

  • Each reference ID becomes a key in the vs_reference dictionary
  • All metrics are computed independently for each reference
  • Missing dependencies (e.g., rouge-score not installed) will result in null values for those metrics
  • You can compare against multiple references in a single run

Metrics Structure

metrics.json

The scorer generates a metrics.json file with the following structure:

{
  "dataset_id": "curated_5feeds_benchmark_v1",
  "run_id": "run_2026-01-16_12-10-03",
  "episode_count": 10,

  "intrinsic": {
    "gates": {
      "speaker_label_leak_rate": 0.0,
      "boilerplate_leak_rate": 0.0,
      "truncation_rate": 0.0,
      "failed_episodes": []
    },
    "length": {
      "avg_tokens": 420,
      "min_tokens": 310,
      "max_tokens": 560
    },
    "performance": {
      "avg_latency_ms": 1800
    },
    "cost": {
      "total_cost_usd": 0.14,
      "avg_cost_usd": 0.014
    }
  },

  "vs_reference": null
}

Key points:

  • intrinsic - Always present (computed from predictions alone)
  • vs_reference - null when no references provided, or a dictionary with reference IDs as keys when references are provided
  • Cost section is only included for OpenAI runs (ML models skip it entirely)

metrics_report.md

Human-readable markdown report with formatted metrics, suitable for viewing in GitHub or documentation. Includes formatted tables and summaries of all computed metrics.

comparisons/vs_{baseline_id}.json

The comparator generates comparison files with deltas:

{
  "baseline_id": "baseline_prod_authority_v1",
  "dataset_id": "curated_5feeds_benchmark_v1",
  "experiment_run_id": "run_2026-01-16_12-10-03",
  "deltas": {
    "cost_total_usd": -0.05,
    "avg_latency_ms": 120,
    "gate_regressions": [],
    "rougeL_f1_vs_silver_gpt52_v1": 0.01
  }
}

Key points:

  • Deltas are computed as: experiment_value - baseline_value
  • gate_regressions is a list of gate names that regressed
  • ROUGE deltas are included if both experiment and baseline have the same reference

Reference Validation

For every reference (baseline/silver/gold), the system enforces:

  1. Episode ID match: Episode IDs match exactly (no missing/extra)
  2. Immutable: Reference is write-once (cannot be overwritten)

If any of these fail → scoring refuses to run.

Reference Pack Structure

A reference pack should contain at minimum:

# Silver references
references/silver/{reference_id}/
├── predictions.jsonl      # Reference text per episode
├── fingerprint.json        # How reference was generated
├── baseline.json           # Reference metadata (reference_quality)
└── config.yaml             # Config used (optional)

# Gold NER references
references/gold/ner_entities/{reference_id}/
├── index.json              # Index of episodes
├── {episode_id}.json       # Gold entities per episode
└── README.md               # Reference documentation

# Gold summarization references
references/gold/summarization/{reference_id}/
├── predictions.jsonl      # Gold summaries per episode
└── README.md              # Reference documentation

# Gold GIL / KG references (eval vs_reference)
references/gold/gil/{reference_id}/
├── {episode_id}.json      # Gold GIL payload per episode (same shape as output.gil)
└── README.md              # Optional

references/gold/kg/{reference_id}/
├── {episode_id}.json      # Gold KG payload per episode (same shape as output.kg)
└── README.md              # Optional

Note: A baseline can be promoted to a reference pack if you want. That's fine.

Evaluation Results

When you run an experiment with evaluation, results are saved to results/{experiment_id}/:

results/summarization_openai_long_v2/
├── predictions.jsonl      # Model outputs for all episodes
├── fingerprint.json       # System fingerprint
├── run_metadata.json      # Experiment metadata
├── metrics.json           # Intrinsic + vs_reference metrics
└── comparisons/
    └── vs_baseline_prod_authority_v1.json  # Deltas vs baseline

Key Design Decisions

1. Separation of Concerns

  • Runner = execution only
  • Scorer = metrics computation
  • Comparator = delta computation

This allows:

  • Recomputing metrics without re-running inference
  • Recomputing comparisons without re-running inference
  • Testing each component independently

2. Optional References

References are optional because:

  • You can do rigorous evaluation without goldens (Phase 1)
  • You can add references incrementally (Phase 2/3)
  • Different experiments may need different references

3. Reference as "Anything"

A reference is anything that looks like a run output:

  • Baseline can be a reference
  • Silver reference can be a reference
  • Gold reference can be a reference

This keeps the system flexible.

4. Metrics vs Comparisons

  • Metrics = absolute facts about this run (+ vs reference scores)
  • Comparisons = deltas between two runs

This separation allows recomputing comparisons later without re-running inference.

Pipeline run metrics (download resilience)

The main podcast pipeline (CLI run_pipeline, service mode) writes a different metrics.json under each run directory (see Pipeline and Workflow). For download hardening observability it includes:

Field Meaning
http_urllib3_retry_events Count of urllib3 retry scheduling events (transient connection/read/status errors) since the downloader was last configured (pipeline start); process-wide
episode_download_retries Number of application-level episode download retries (after urllib3 exhausted), when episode_retry_max > 0
episode_download_retry_sleep_seconds Sum of configured backoff sleeps before those episode-level retries
host_throttle_wait_seconds Time spent in per-host throttle or Retry-After alignment waits (Issue #522)
host_throttle_events Count of those wait episodes
retry_after_events Count of Retry-After-driven sleeps recorded in policy metrics
retry_after_total_sleep_seconds Sum of those sleeps
circuit_breaker_trips Times a circuit opened
circuit_breaker_open_feeds Scope keys that opened (feed URL or host label)
rss_conditional_hit RSS responses served from cache after HTTP 304
rss_conditional_miss RSS full-body 200 responses under conditional GET

Semantics: http_urllib3_retry_events is shared across all threads in the process. A normal single run_pipeline is fine. Do not run two full pipelines concurrently in one process if you need that field to represent one run only (use separate processes). See CONFIGURATION.md -- Download resilience (threading and metrics) and WIP: concurrent pipelines and HTTP retry metrics.

Configuration: CONFIGURATION.md -- Download resilience. ADR-028 documents LLM/API provider retry metrics, not these HTTP download counters.


Complete Workflow Example

Here's a complete example workflow:

# Step 0: Prepare source data
make metadata-generate INPUT_DIR=data/eval/sources
make source-index SOURCE_DIR=data/eval/sources/curated_5feeds_raw_v1

# Step 1: Create datasets
make dataset-smoke      # Creates curated_5feeds_smoke_v1 (5 episodes)
make dataset-benchmark  # Creates curated_5feeds_benchmark_v1 (10 episodes)
make dataset-raw        # Creates curated_5feeds_raw_v1 (all episodes)

# Step 1a: Materialize dataset (validate integrity)
make dataset-materialize DATASET_ID=curated_5feeds_smoke_v1

# Step 2: Create baseline
make baseline-create \
  BASELINE_ID=bart_led_baseline_v1 \
  DATASET_ID=curated_5feeds_smoke_v1

# Step 3: Run experiment
export OPENAI_API_KEY="your-key-here"
make experiment-run CONFIG=data/eval/configs/my_experiment.yaml

# Step 4: Run experiment with evaluation
make experiment-run \
  CONFIG=experiments/my_experiment.yaml \
  BASELINE=bart_led_baseline_v1 \
  REFERENCE=silver_sonnet46_smoke_v1

# Results are automatically computed:
# - data/eval/runs/<run_id>/metrics.json (intrinsic + vs_reference)
# - data/eval/runs/<run_id>/comparisons/vs_<baseline_id>.json (deltas)

# Step 5: Promote if results are good
make run-promote \
  RUN_ID=run_2026-04-12_14-30-00 \
  AS=baseline \
  PROMOTED_ID=baseline_prod_authority_v2 \
  REASON="Improved preprocessing, all gates pass"

Best Practices

Source Data Management

  • Generate metadata first: Always generate metadata from RSS XML before creating datasets
  • Create source indexes: Use indexes for inventory management and drift detection
  • Freeze source data: Once datasets are created, avoid modifying source transcripts

Dataset Management

  • Freeze datasets: Once created, datasets should be immutable
  • Version datasets: Use versioned IDs (e.g., curated_5feeds_smoke_v1, curated_5feeds_smoke_v2)
  • Document datasets: Include clear descriptions and content regime
  • Materialize datasets: Always materialize datasets to validate integrity before use
  • Use appropriate sizes: Use smoke datasets for quick tests, benchmark datasets for evaluation, raw datasets for comprehensive analysis

Baseline Management

  • Create baselines on clean commits: Avoid creating baselines with uncommitted changes
  • Document baseline purpose: Use descriptive baseline IDs
  • Version baselines: Use versioned IDs (e.g., bart_led_baseline_v1, bart_led_baseline_v2)
  • Never overwrite: Baselines are immutable - create new ones for changes

Experiment Management

  • Use descriptive IDs: Include model, task, and version in experiment ID
  • Always specify baseline: Experiments must compare against a baseline
  • Validate contracts: Ensure dataset_id matches between experiment and baseline
  • Track golden references: Use golden_required: true when evaluation is needed

Download resilience (when the pipeline fetches RSS or transcripts)

Eval workflows often use materialized or local inputs so HTTP is out of the path. When an experiment or acceptance run does hit the network (real RSS, transcript URLs), align retry policy with the job:

  • Long batch / production-like ingestion: Rely on defaults or raise http_retry_total, rss_retry_total, and episode_retry_max if feeds are flaky. See CONFIGURATION.md — Download resilience and recommended presets (canonical).
  • Polite HTTP toward CDNs (high parallelism, same host for RSS and media): Use the Polite HTTP preset in CONFIGURATION.md — Recommended presets (copy keys into your YAML).
  • Fast CI or smoke: Lower retries (episode_retry_max: 0, small http_retry_total) so failures fail quickly; use the Fast fail preset / minimal-retry YAML in the same CONFIGURATION section.
  • Triage: Use failure_summary in run.json and http_urllib3_retry_events / episode_download_retries in metrics.json after the run (see Pipeline run metrics above).

CLI overrides: CLI.md — Control Options (--http-retry-total, --episode-retry-max, etc.).

Workflow

  1. Prepare source data -- make metadata-generate then make source-index
  2. Create dataset -- make dataset-smoke / make dataset-benchmark / make dataset-raw
  3. Materialize dataset -- make dataset-materialize DATASET_ID=... (recommended)
  4. Create baseline -- make baseline-create BASELINE_ID=... DATASET_ID=...
  5. Run experiment -- make experiment-run CONFIG=... BASELINE=... REFERENCE=...
  6. Promote -- make run-promote RUN_ID=... AS=baseline PROMOTED_ID=... REASON="..."

Experiment Lifecycle Management

When iterating on ML models and preprocessing, you'll make many small changes. Having a clear strategy for what to keep and what to delete prevents clutter while preserving important reference points.

The General Strategy

┌─────────────────────────────────────────────────────────────────────────┐
│                    EXPERIMENT LIFECYCLE                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   KEEP                              DELETE                              │
│   ────                              ──────                              │
│   • Configs you might reuse         • Most intermediate runs            │
│   • One "before" run per major      • Most exploratory configs          │
│     change (frozen)                 • Failed experiment attempts        │
│   • All promoted baselines          • Superseded parameter sweeps       │
│   • Committed references                                                │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

What to Keep

1. Configs You Might Reuse

Archive useful configurations before deleting them:

# Archive all current configs with today's date
make configs-archive

# Creates: data/eval/configs/_archive/configs_YYYY-MM-DD/

This preserves your parameter sweep configurations for reference without cluttering the active configs directory.

2. One "Before" Run Per Major Change (Frozen)

Before implementing a significant change (new preprocessing profile, model switch, etc.), freeze one representative run:

# Freeze a run as a baseline comparison point
make run-freeze RUN_ID=baseline_bart_v1 REASON="Pre-cleanup baseline for comparison"

# Creates: data/eval/runs/_frozen_pre_cleanup/<run_id>/
# Adds: NOTE.md with reason and date

Why freeze runs?

  • Quantify improvement after the change
  • Track metrics like: repetition rate, garbage tokens, coherence, speaker label leakage
  • Provides rollback reference if the change regresses quality

3. All Promoted Baselines/References

Baselines that become app defaults and reference summaries (gold/silver) should be:

  • Committed to version control
  • Never deleted (they're immutable)
  • Located in benchmarks/baselines/ (promoted) or data/eval/references/ (gold/silver)

What to Delete

1. Most Intermediate Runs

During parameter sweeps, you'll generate many runs. Delete all except the best performer:

# Delete multiple runs at once
make runs-delete RUN_IDS="run_v2 run_v3 run_v4 run_v5"

# Keep only the winning configuration

2. Most Exploratory Configs

After a parameter sweep, archive then clean:

# First archive for reference
make configs-archive

# Then clean, optionally keeping one
make configs-clean KEEP=baseline_bart_best.yaml

Makefile Commands Reference

Command Purpose Example
make configs-archive Archive all baseline_*.yaml configs make configs-archive
make configs-clean Delete baseline_*.yaml configs make configs-clean KEEP=best.yaml
make run-freeze Freeze a run for baseline comparison make run-freeze RUN_ID=my_run REASON="Pre-X baseline"
make runs-delete Delete multiple runs make runs-delete RUN_IDS="run1 run2 run3"
make experiment-run FORCE=1 Re-run experiment, deleting existing results make experiment-run CONFIG=... FORCE=1
make report-multi-run Generate multi-run comparison report (baseline + N runs) See Multi-run comparison report below

Multi-run comparison report

The multi-run comparison report builds a single markdown table from one optional baseline and any number of experiment runs, using their metrics.json and the same vs-reference metrics (ROUGE, BLEU, embedding, coverage, WER) and latency. Use it to compare baseline vs tier1 vs tier2, or any set of runs, in one view.

Make target: report-multi-run

  • Default: With no arguments, uses baseline baseline_ml_prod_authority_smoke_v1, runs hybrid_ml_tier1_smoke_v1 and hybrid_ml_tier2_qwen25_7b_smoke_v1, reference silver_gpt4o_smoke_v1, and writes docs/wip/multi_run_comparison.md.
  • Tier 2 (32B): For larger hardware, eval config hybrid_ml_tier2_qwen25_32b_smoke_v1 is available (ollama pull qwen2.5:32b). Add it to RUN_IDS when comparing against 7B or tier1.
make report-multi-run
  • Custom baseline + runs: Specify baseline, comma-separated run IDs, and reference. Output path is optional.
make report-multi-run \
  BASELINE_ID=baseline_ml_prod_authority_smoke_v1 \
  RUN_IDS=hybrid_ml_tier1_smoke_v1,hybrid_ml_tier2_qwen25_7b_smoke_v1 \
  REFERENCE_ID=silver_gpt4o_smoke_v1 \
  OUTPUT=docs/wip/my_comparison.md
  • Runs only (no baseline): Omit BASELINE_ID and pass only RUN_IDS and REFERENCE_ID.
make report-multi-run \
  RUN_IDS=run_a,run_b,run_c \
  REFERENCE_ID=silver_gpt4o_smoke_v1

Options (make variables):

Option Required Description
REFERENCE_ID Yes (or default) Reference ID for vs_reference metrics (e.g. silver_gpt4o_smoke_v1). Default: silver_gpt4o_smoke_v1 when using default baseline/runs.
BASELINE_ID No Baseline ID; included as first row. Looked up in data/eval/baselines/.
RUN_IDS No* Comma-separated run IDs. Looked up in data/eval/runs/. *At least one of BASELINE_ID or RUN_IDS required.
OUTPUT No Output markdown path. Default: docs/wip/multi_run_comparison.md.
TITLE No Report title (default: "Multi-Run Comparison").
LABELS No Comma-separated labels for each row (same order: baseline first, then runs). Default: use ID.
DATASET_ID No Dataset ID for report subtitle (default: from first metrics).
BASELINES_DIR No Baselines directory (default: data/eval/baselines).
RUNS_DIR No Runs directory (default: data/eval/runs).

Direct script usage:

python scripts/eval/smoke_three_way_report.py \
  --reference-id silver_gpt4o_smoke_v1 \
  --baseline-id baseline_ml_prod_authority_smoke_v1 \
  --run-ids hybrid_ml_tier1_smoke_v1,hybrid_ml_tier2_qwen25_7b_smoke_v1 \
  --output docs/wip/smoke_three_way_comparison.md \
  [--title "Smoke comparison"] [--labels "Prod,Tier1,Tier2"]

All script options are documented in the script's help: python scripts/eval/smoke_three_way_report.py --help.

Typical Workflow: Parameter Sweep

# 1. Create multiple experiment configs
# baseline_bart_v1.yaml, baseline_bart_v2.yaml, ...

# 2. Run experiments
make experiment-run CONFIG=data/eval/configs/baseline_bart_v1.yaml
make experiment-run CONFIG=data/eval/configs/baseline_bart_v2.yaml
# ...

# 3. Compare results, pick winner (e.g., v3)

# 4. Archive configs before cleanup
make configs-archive

# 5. Clean configs, keeping winner
make configs-clean KEEP=baseline_bart_v3.yaml

# 6. Delete non-winning runs
make runs-delete RUN_IDS="baseline_bart_v1 baseline_bart_v2 baseline_bart_v4"

# 7. Optionally freeze winning run if it's a major milestone
make run-freeze RUN_ID=baseline_bart_v3 REASON="Best params before preprocessing change"

Typical Workflow: Major Change

# 1. Freeze current best run as "before"
make run-freeze RUN_ID=baseline_bart_current REASON="Pre-cleaning_v4 baseline"

# 2. Implement the change (e.g., new preprocessing profile)

# 3. Run new experiment
make experiment-run CONFIG=data/eval/configs/baseline_bart_cleaning_v4.yaml

# 4. Compare frozen "before" vs new "after"
# - Check metrics.json for improvements
# - Verify no regressions in gates

# 5. If improved: promote new run, delete old intermediates
# If regressed: investigate, iterate, compare against frozen baseline

Directory Structure After Cleanup

data/eval/
├── configs/
│   ├── _archive/
│   │   └── configs_2026-01-30/        # Archived parameter sweeps
│   │       ├── baseline_bart_v1.yaml
│   │       └── ...
│   └── baseline_bart_best.yaml        # Current best config
├── runs/
│   ├── _frozen_pre_cleanup/           # Frozen baseline runs
│   │   └── baseline_bart_v1/
│   │       ├── NOTE.md                # Why it was frozen
│   │       ├── metrics.json
│   │       └── ...
│   ├── baseline_bart_best/            # Current best run
│   └── README.md
└── ...

Key Principles

  1. Always archive before delete -- you may need to reference old configs
  2. Freeze before major changes -- enables quantitative comparison
  3. Keep promoted baselines forever -- they are your quality contracts
  4. Delete aggressively otherwise -- clutter obscures signal
  5. Document frozen runs -- the NOTE.md explains why they matter

Step 5: Promote a run

Promotion moves a run from data/eval/runs/ into baselines/ or references/ and stamps it with governance metadata. A run's role is determined by where it lives and how it is referenced, not by how it was executed.

Promote to baseline

make run-promote \
  RUN_ID=run_2026-01-16_11-52-03 \
  AS=baseline \
  PROMOTED_ID=baseline_prod_authority_v2 \
  REASON="New production baseline with improved preprocessing"

This:

  1. Moves artifacts from runs/ to baselines/
  2. Assigns a stable ID (baseline_prod_authority_v2)
  3. Marks as immutable (cannot overwrite)
  4. Creates README.md explaining purpose
  5. Updates baseline.json with promotion metadata
  6. Removes the source run (promotion is one-way)

Promote to reference

make run-promote \
  RUN_ID=run_2026-01-16_14-30-00 \
  AS=reference \
  PROMOTED_ID=silver_sonnet46_smoke_v1 \
  REASON="Silver reference using Sonnet 4.6 for smoke dataset" \
  REFERENCE_QUALITY=silver

This:

  1. Auto-detects task type from the run's metrics.json or predictions.jsonl
  2. Moves artifacts to the appropriate location (references/silver/, references/gold/{task_type}/)
  3. Assigns a stable ID and marks as immutable
  4. Creates README.md and updates baseline.json with promotion metadata
  5. Removes the source run

Promotion metadata

When a run is promoted, baseline.json is updated:

{
  "baseline_id": "baseline_prod_authority_v2",
  "dataset_id": "curated_5feeds_smoke_v1",
  "created_at": "2026-01-16T11:52:03Z",
  "promoted_from": "run_2026-01-16_11-52-03",
  "promoted_at": "2026-01-16T12:00:00Z",
  "promoted_as": "baseline",
  "promoted_id": "baseline_prod_authority_v2",
  "fingerprint_ref": "fingerprint.json"
}

For references, it also includes "reference_quality": "silver" (or "gold").

Promotion rules

Rule Baseline Reference
Required for experiments Yes No
Can block CI Yes No
Updated often Sometimes Rarely
Replaced silently No No
Used as "truth" No Yes (approximate or exact)

Legacy support

make baseline-create still works for backward compatibility -- it creates a run and auto-promotes it to a baseline in one step. For new workflows, prefer the explicit create-then-promote path.

Promotion best practices

  1. Always review before promoting -- do not auto-promote without checking results
  2. Use descriptive IDs -- baseline_prod_authority_v2 is better than baseline_v2
  3. Document the reason -- the REASON parameter is important for future reference
  4. Do not promote experiments -- only promote runs that serve a clear purpose
  5. Keep runs clean -- delete runs that are not useful (they are temporary)

Visual run comparison (RFC-047)

To compare experiment or baseline runs side by side (artifact status, KPI tiles, deltas vs a chosen baseline, token/latency charts, optional map/reduce diagnostics, and per-episode diffs), use the Streamlit tool described in the compare tool README in the repository.

pip install -e '.[compare]'
make run-compare

Optional BASELINE picks the default row in the Baseline (for deltas) dropdown when that run is selected (see the README). On load, all runs matching the category filter are selected; use Select all / Deselect all in the sidebar as needed. This complements text reports such as make runs-compare and make report-multi-run.


Troubleshooting

"Dataset definition not found"

  • Check that data/eval/datasets/{dataset_id}.json or benchmarks/datasets/{dataset_id}.json exists
  • Verify the dataset_id in your experiment config matches the JSON filename

"Baseline not found"

  • Check that benchmarks/baselines/{baseline_id}/ exists
  • Verify the baseline_id in your experiment config is correct
  • Create the baseline first using make baseline-create

"Dataset mismatch"

  • The experiment's dataset_id must match the baseline's dataset_id
  • Check benchmarks/baselines/{baseline_id}/metadata.json to see which dataset was used

"No input files found"

  • For dataset mode: Verify transcript paths in the dataset JSON exist
  • For glob mode: Check that the glob pattern matches files in your directory

"Episode not found in dataset"

  • The transcript path in the dataset JSON must match the actual file path
  • Use absolute paths or paths relative to the project root

"Hash mismatch" (during materialization)

  • The transcript file has been modified since the dataset was created
  • Regenerate the dataset or restore the original transcript file
  • Check data/eval/sources/ for the original files

"Materialized directory already exists"

  • The script will automatically remove and recreate the directory
  • This ensures reproducible materialization

References

Doc Role
RFC-015 AI experiment pipeline design
RFC-041 Benchmarking framework
RFC-044 Model registry and mode promotion
Performance Profile Guide Resource-cost profiles (RSS, CPU%, wall time)
Optimization Workflow End-to-end optimization workflow
data/eval/README.md Eval directory contract
data/eval/references/silver/README.md Active silver references