Skip to content

Promotion Workflow Guide

Overview

The promotion workflow implements the principle: Execution is neutral. Meaning is assigned afterward.

Every baseline, reference, and experiment is produced by the same runner:

  • Same code
  • Same provider abstraction
  • Same fingerprinting
  • Same artifacts (predictions.jsonl, metrics.json, fingerprint.json)

The distinction is post-run promotion, not execution.


Key Concepts

1. Run (Execution Only)

A run is the result of executing the materialization script. It's stored in data/eval/runs/ and has no special meaning yet.

Characteristics:

  • Temporary (can be deleted)
  • No special rules
  • Just execution results

2. Baseline (Promoted Run)

A baseline is a run that has been promoted to serve as a comparison point for experiments.

Characteristics:

  • Required for experiments (experiments must specify this baseline)
  • Can block CI (regressions against this baseline can fail CI)
  • Updated sometimes (when you want a new comparison point)
  • Immutable (cannot be overwritten)
  • Used as comparison point (not as "truth")

Storage: data/eval/baselines/{baseline_id}/

3. Reference (Promoted Run)

A reference is a run that has been promoted to serve as "truth" for evaluation metrics (e.g., ROUGE).

Characteristics:

  • Not required for experiments (experiments can run without references)
  • Cannot block CI (references are informational)
  • Rarely updated (only when truth changes)
  • Immutable (cannot be overwritten)
  • Used as "truth" for absolute quality assessment

Storage:

  • Silver: data/eval/references/silver/{reference_id}/
  • Gold NER: data/eval/references/gold/ner_entities/{reference_id}/
  • Gold Summarization: data/eval/references/gold/summarization/{reference_id}/

Reference Quality:

  • Silver: Machine-generated by a strong model (e.g., GPT-5)
  • Gold: Human-verified summaries or entity annotations

Workflow

Step 1: Create a Run

Create a run using the materialization script:

make run-create \
  RUN_ID=run_2026-01-16_11-52-03 \
  DATASET_ID=curated_5feeds_smoke_v1 \
  EXPERIMENT_CONFIG=data/eval/configs/baseline_bart_small_led_long_fast.yaml

This creates:

data/eval/runs/run_2026-01-16_11-52-03/
├── predictions.jsonl
├── metrics.json
├── fingerprint.json
├── baseline.json
└── config.yaml

Same artifacts as baselines/references - just in a different location.

Step 2: Review Results

Look at the results and decide:

  • "This is our new prod baseline" → Promote to baseline
  • "This is our silver reference" → Promote to reference
  • "Just an experiment" → Leave as run (can delete later)

Step 3: Promote (If Needed)

Promote to Baseline

make run-promote \
  RUN_ID=run_2026-01-16_11-52-03 \
  --as baseline \
  PROMOTED_ID=baseline_prod_authority_v2 \
  REASON="New production baseline with improved preprocessing"

This:

  1. Moves artifacts from runs/baselines/
  2. Assigns stable ID (baseline_prod_authority_v2)
  3. Marks as immutable (cannot overwrite)
  4. Creates README.md explaining purpose
  5. Updates baseline.json with promotion metadata
  6. Removes source run (promotion is one-way)

Promote to Reference

make run-promote \
  RUN_ID=run_2026-01-16_11-52-03 \
  --as reference \
  PROMOTED_ID=silver_gpt5_2_v1 \
  REASON="Silver reference using GPT-5 for benchmark dataset" \
  REFERENCE_QUALITY=silver

This:

  1. Auto-detects task type from run's metrics.json or predictions.jsonl
  2. Moves artifacts to appropriate location:
  3. Silver: references/silver/{reference_id}/
  4. Gold: references/gold/{task_type}/{reference_id}/
  5. Assigns stable ID (silver_gpt5_2_v1)
  6. Marks as immutable
  7. Creates README.md explaining purpose
  8. Updates baseline.json with promotion metadata
  9. Removes source run

Note: DATASET_ID is no longer required. Task type is automatically detected from the run.


Directory Structure

data/eval/
├── runs/                    # Temporary runs (can be deleted)
│   └── run_2026-01-16_11-52-03/
│       ├── predictions.jsonl
│       ├── metrics.json
│       ├── fingerprint.json
│       └── baseline.json
│
├── baselines/               # Promoted baselines (comparison points)
│   └── baseline_prod_authority_v2/
│       ├── predictions.jsonl
│       ├── metrics.json
│       ├── fingerprint.json
│       ├── baseline.json
│       ├── config.yaml
│       └── README.md        # Explains why this baseline exists
│
└── references/              # Promoted references (truth)
    ├── silver/              # Silver references (machine-generated)
    │   └── silver_gpt5_2_v1/
    │       ├── predictions.jsonl
    │       ├── metrics.json
    │       ├── fingerprint.json
    │       ├── baseline.json
    │       ├── config.yaml
    │       └── README.md
    └── gold/                 # Gold references (human-verified)
        ├── ner_entities/     # Gold NER references
        │   └── ner_entities_smoke_gold_v1/
        │       ├── index.json
        │       ├── p01_e01.json
        │       └── README.md
        └── summarization/    # Gold summarization references
            └── summarization_gold_v1/
                ├── predictions.jsonl
                └── README.md

The One Rule That Keeps This Sane

A run's role is determined by where it lives and how it's referenced, not by how it was executed.

If you follow that rule, everything stays clean.


Promotion Rules

Baseline Rules

Rule Baseline
Required for experiments Yes
Can block CI Yes
Updated often Sometimes
Replaced silently No
Used as "truth" No
Compared against Yes
Compared to baseline N/A

Reference Rules

Rule Reference
Required for experiments No
Can block CI No
Updated often Rarely
Replaced silently No
Used as "truth" Approximate (silver) or Exact (gold)
Compared against Yes
Compared to baseline Yes

Why This Is Powerful

Because you get:

  • One execution path (less code, fewer bugs)
  • Late binding of meaning (decide role after seeing results)
  • Reproducibility (same fingerprint everywhere)
  • Auditability (promotion is explicit and reviewable)

This is how mature ML platforms work internally.


Legacy Support

The baseline-create command still works for backward compatibility:

make baseline-create \
  BASELINE_ID=baseline_prod_authority_v2 \
  DATASET_ID=curated_5feeds_smoke_v1

What it does:

  1. Creates a run with timestamp ID
  2. Auto-promotes it to baseline
  3. Uses the provided BASELINE_ID as the promoted ID

For new workflows, prefer:

  1. make run-create (explicit)
  2. Review results
  3. make run-promote (explicit promotion)

Examples

Example 1: Create and Promote Baseline

# Step 1: Create run
make run-create \
  RUN_ID=run_2026-01-16_11-52-03 \
  DATASET_ID=curated_5feeds_smoke_v1 \
  EXPERIMENT_CONFIG=data/eval/configs/baseline_bart_small_led_long_fast.yaml

# Step 2: Review results in data/eval/runs/run_2026-01-16_11-52-03/

# Step 3: Promote to baseline
make run-promote \
  RUN_ID=run_2026-01-16_11-52-03 \
  --as baseline \
  PROMOTED_ID=baseline_prod_authority_v2 \
  REASON="New production baseline with improved preprocessing profile"

Example 2: Create and Promote Reference

# Step 1: Create run
make run-create \
  RUN_ID=run_2026-01-16_14-30-00 \
  DATASET_ID=curated_5feeds_benchmark_v1 \
  EXPERIMENT_CONFIG=experiments/silver_reference_gpt5.yaml

# Step 2: Review results

# Step 3: Promote to reference
make run-promote \
  RUN_ID=run_2026-01-16_14-30-00 \
  --as reference \
  PROMOTED_ID=silver_gpt5_2_v1 \
  REASON="Silver reference using GPT-5 for benchmark dataset" \
  REFERENCE_QUALITY=silver

Example 3: Leave as Run (No Promotion)

# Create run
make run-create \
  RUN_ID=run_2026-01-16_15-00-00 \
  DATASET_ID=curated_5feeds_smoke_v1 \
  EXPERIMENT_CONFIG=experiments/test_new_prompt.yaml

# Review results - decide it's just an experiment
# No promotion needed - can delete later if not useful

Promotion Metadata

When a run is promoted, the baseline.json file is updated with promotion metadata:

{
  "baseline_id": "baseline_prod_authority_v2",
  "dataset_id": "curated_5feeds_smoke_v1",
  "created_at": "2026-01-16T11:52:03Z",
  "promoted_from": "run_2026-01-16_11-52-03",
  "promoted_at": "2026-01-16T12:00:00Z",
  "promoted_as": "baseline",
  "promoted_id": "baseline_prod_authority_v2",
  "fingerprint_ref": "fingerprint.json"
}

For references, it also includes:

{
  "reference_quality": "silver"
}

README.md Template

Each promoted run gets a README.md that explains:

  • What it is
  • Why it was promoted
  • How it should be used

This makes promotion explicit and reviewable.


Best Practices

  1. Always review before promoting - Don't auto-promote without checking results
  2. Use descriptive IDs - baseline_prod_authority_v2 is better than baseline_v2
  3. Document the reason - The REASON parameter is important for future reference
  4. Don't promote experiments - Only promote runs that serve a clear purpose
  5. Keep runs clean - Delete runs that aren't useful (they're temporary)

Summary

Execution is neutral. Meaning is assigned afterward.

  • Create runs with make run-create
  • Review results
  • Promote with make run-promote if needed
  • Same artifacts, different governance

This keeps the system clean, auditable, and flexible.