Promotion Workflow Guide¶

Overview¶

The promotion workflow implements the principle: Execution is neutral. Meaning is assigned afterward.

Every baseline, reference, and experiment is produced by the same runner:

Same code
Same provider abstraction
Same fingerprinting
Same artifacts (predictions.jsonl, metrics.json, fingerprint.json)

The distinction is post-run promotion, not execution.

Key Concepts¶

1. Run (Execution Only)¶

A run is the result of executing the materialization script. It's stored in data/eval/runs/ and has no special meaning yet.

Characteristics:

Temporary (can be deleted)
No special rules
Just execution results

2. Baseline (Promoted Run)¶

A baseline is a run that has been promoted to serve as a comparison point for experiments.

Characteristics:

Required for experiments (experiments must specify this baseline)
Can block CI (regressions against this baseline can fail CI)
Updated sometimes (when you want a new comparison point)
Immutable (cannot be overwritten)
Used as comparison point (not as "truth")

Storage: data/eval/baselines/{baseline_id}/

3. Reference (Promoted Run)¶

A reference is a run that has been promoted to serve as "truth" for evaluation metrics (e.g., ROUGE).

Characteristics:

Not required for experiments (experiments can run without references)
Cannot block CI (references are informational)
Rarely updated (only when truth changes)
Immutable (cannot be overwritten)
Used as "truth" for absolute quality assessment

Storage:

Silver: data/eval/references/silver/{reference_id}/
Gold NER: data/eval/references/gold/ner_entities/{reference_id}/
Gold Summarization: data/eval/references/gold/summarization/{reference_id}/

Reference Quality:

Silver: Machine-generated by a strong model (e.g., GPT-5)
Gold: Human-verified summaries or entity annotations

Workflow¶

Step 1: Create a Run¶

Create a run using the materialization script:

make run-create \
  RUN_ID=run_2026-01-16_11-52-03 \
  DATASET_ID=curated_5feeds_smoke_v1 \
  EXPERIMENT_CONFIG=data/eval/configs/baseline_bart_small_led_long_fast.yaml

This creates:

data/eval/runs/run_2026-01-16_11-52-03/
├── predictions.jsonl
├── metrics.json
├── fingerprint.json
├── baseline.json
└── config.yaml

Same artifacts as baselines/references - just in a different location.

Step 2: Review Results¶

Look at the results and decide:

"This is our new prod baseline" → Promote to baseline
"This is our silver reference" → Promote to reference
"Just an experiment" → Leave as run (can delete later)

Step 3: Promote (If Needed)¶

Promote to Baseline¶

make run-promote \
  RUN_ID=run_2026-01-16_11-52-03 \
  --as baseline \
  PROMOTED_ID=baseline_prod_authority_v2 \
  REASON="New production baseline with improved preprocessing"

This:

Moves artifacts from runs/ → baselines/
Assigns stable ID (baseline_prod_authority_v2)
Marks as immutable (cannot overwrite)
Creates README.md explaining purpose
Updates baseline.json with promotion metadata
Removes source run (promotion is one-way)

Promote to Reference¶

make run-promote \
  RUN_ID=run_2026-01-16_11-52-03 \
  --as reference \
  PROMOTED_ID=silver_gpt5_2_v1 \
  REASON="Silver reference using GPT-5 for benchmark dataset" \
  REFERENCE_QUALITY=silver

This:

Auto-detects task type from run's metrics.json or predictions.jsonl
Moves artifacts to appropriate location:
Silver: references/silver/{reference_id}/
Gold: references/gold/{task_type}/{reference_id}/
Assigns stable ID (silver_gpt5_2_v1)
Marks as immutable
Creates README.md explaining purpose
Updates baseline.json with promotion metadata
Removes source run

Note: DATASET_ID is no longer required. Task type is automatically detected from the run.

Directory Structure¶

data/eval/
├── runs/                    # Temporary runs (can be deleted)
│   └── run_2026-01-16_11-52-03/
│       ├── predictions.jsonl
│       ├── metrics.json
│       ├── fingerprint.json
│       └── baseline.json
│
├── baselines/               # Promoted baselines (comparison points)
│   └── baseline_prod_authority_v2/
│       ├── predictions.jsonl
│       ├── metrics.json
│       ├── fingerprint.json
│       ├── baseline.json
│       ├── config.yaml
│       └── README.md        # Explains why this baseline exists
│
└── references/              # Promoted references (truth)
    ├── silver/              # Silver references (machine-generated)
    │   └── silver_gpt5_2_v1/
    │       ├── predictions.jsonl
    │       ├── metrics.json
    │       ├── fingerprint.json
    │       ├── baseline.json
    │       ├── config.yaml
    │       └── README.md
    └── gold/                 # Gold references (human-verified)
        ├── ner_entities/     # Gold NER references
        │   └── ner_entities_smoke_gold_v1/
        │       ├── index.json
        │       ├── p01_e01.json
        │       └── README.md
        └── summarization/    # Gold summarization references
            └── summarization_gold_v1/
                ├── predictions.jsonl
                └── README.md

The One Rule That Keeps This Sane¶

A run's role is determined by where it lives and how it's referenced, not by how it was executed.

If you follow that rule, everything stays clean.

Promotion Rules¶

Baseline Rules¶

Rule	Baseline
Required for experiments	Yes
Can block CI	Yes
Updated often	Sometimes
Replaced silently	No
Used as "truth"	No
Compared against	Yes
Compared to baseline	N/A

Reference Rules¶

Rule	Reference
Required for experiments	No
Can block CI	No
Updated often	Rarely
Replaced silently	No
Used as "truth"	Approximate (silver) or Exact (gold)
Compared against	Yes
Compared to baseline	Yes

Why This Is Powerful¶

Because you get:

One execution path (less code, fewer bugs)
Late binding of meaning (decide role after seeing results)
Reproducibility (same fingerprint everywhere)
Auditability (promotion is explicit and reviewable)

This is how mature ML platforms work internally.

Legacy Support¶

The baseline-create command still works for backward compatibility:

make baseline-create \
  BASELINE_ID=baseline_prod_authority_v2 \
  DATASET_ID=curated_5feeds_smoke_v1

What it does:

Creates a run with timestamp ID
Auto-promotes it to baseline
Uses the provided BASELINE_ID as the promoted ID

For new workflows, prefer:

make run-create (explicit)
Review results
make run-promote (explicit promotion)

Examples¶

Example 1: Create and Promote Baseline¶

# Step 1: Create run
make run-create \
  RUN_ID=run_2026-01-16_11-52-03 \
  DATASET_ID=curated_5feeds_smoke_v1 \
  EXPERIMENT_CONFIG=data/eval/configs/baseline_bart_small_led_long_fast.yaml

# Step 2: Review results in data/eval/runs/run_2026-01-16_11-52-03/

# Step 3: Promote to baseline
make run-promote \
  RUN_ID=run_2026-01-16_11-52-03 \
  --as baseline \
  PROMOTED_ID=baseline_prod_authority_v2 \
  REASON="New production baseline with improved preprocessing profile"

Example 2: Create and Promote Reference¶

# Step 1: Create run
make run-create \
  RUN_ID=run_2026-01-16_14-30-00 \
  DATASET_ID=curated_5feeds_benchmark_v1 \
  EXPERIMENT_CONFIG=experiments/silver_reference_gpt5.yaml

# Step 2: Review results

# Step 3: Promote to reference
make run-promote \
  RUN_ID=run_2026-01-16_14-30-00 \
  --as reference \
  PROMOTED_ID=silver_gpt5_2_v1 \
  REASON="Silver reference using GPT-5 for benchmark dataset" \
  REFERENCE_QUALITY=silver

Example 3: Leave as Run (No Promotion)¶

# Create run
make run-create \
  RUN_ID=run_2026-01-16_15-00-00 \
  DATASET_ID=curated_5feeds_smoke_v1 \
  EXPERIMENT_CONFIG=experiments/test_new_prompt.yaml

# Review results - decide it's just an experiment
# No promotion needed - can delete later if not useful

Promotion Metadata¶

When a run is promoted, the baseline.json file is updated with promotion metadata:

{
  "baseline_id": "baseline_prod_authority_v2",
  "dataset_id": "curated_5feeds_smoke_v1",
  "created_at": "2026-01-16T11:52:03Z",
  "promoted_from": "run_2026-01-16_11-52-03",
  "promoted_at": "2026-01-16T12:00:00Z",
  "promoted_as": "baseline",
  "promoted_id": "baseline_prod_authority_v2",
  "fingerprint_ref": "fingerprint.json"
}

For references, it also includes:

{
  "reference_quality": "silver"
}

README.md Template¶

Each promoted run gets a README.md that explains:

What it is
Why it was promoted
How it should be used

This makes promotion explicit and reviewable.

Best Practices¶

Always review before promoting - Don't auto-promote without checking results
Use descriptive IDs - baseline_prod_authority_v2 is better than baseline_v2
Document the reason - The REASON parameter is important for future reference
Don't promote experiments - Only promote runs that serve a clear purpose
Keep runs clean - Delete runs that aren't useful (they're temporary)

Summary¶

Execution is neutral. Meaning is assigned afterward.

Create runs with make run-create
Review results
Promote with make run-promote if needed
Same artifacts, different governance

This keeps the system clean, auditable, and flexible.