Promotion Workflow Guide¶
Overview¶
The promotion workflow implements the principle: Execution is neutral. Meaning is assigned afterward.
Every baseline, reference, and experiment is produced by the same runner:
- Same code
- Same provider abstraction
- Same fingerprinting
- Same artifacts (predictions.jsonl, metrics.json, fingerprint.json)
The distinction is post-run promotion, not execution.
Key Concepts¶
1. Run (Execution Only)¶
A run is the result of executing the materialization script. It's stored in data/eval/runs/ and has no special meaning yet.
Characteristics:
- Temporary (can be deleted)
- No special rules
- Just execution results
2. Baseline (Promoted Run)¶
A baseline is a run that has been promoted to serve as a comparison point for experiments.
Characteristics:
- Required for experiments (experiments must specify this baseline)
- Can block CI (regressions against this baseline can fail CI)
- Updated sometimes (when you want a new comparison point)
- Immutable (cannot be overwritten)
- Used as comparison point (not as "truth")
Storage: data/eval/baselines/{baseline_id}/
3. Reference (Promoted Run)¶
A reference is a run that has been promoted to serve as "truth" for evaluation metrics (e.g., ROUGE).
Characteristics:
- Not required for experiments (experiments can run without references)
- Cannot block CI (references are informational)
- Rarely updated (only when truth changes)
- Immutable (cannot be overwritten)
- Used as "truth" for absolute quality assessment
Storage:
- Silver:
data/eval/references/silver/{reference_id}/ - Gold NER:
data/eval/references/gold/ner_entities/{reference_id}/ - Gold Summarization:
data/eval/references/gold/summarization/{reference_id}/
Reference Quality:
- Silver: Machine-generated by a strong model (e.g., GPT-5)
- Gold: Human-verified summaries or entity annotations
Workflow¶
Step 1: Create a Run¶
Create a run using the materialization script:
make run-create \
RUN_ID=run_2026-01-16_11-52-03 \
DATASET_ID=curated_5feeds_smoke_v1 \
EXPERIMENT_CONFIG=data/eval/configs/baseline_bart_small_led_long_fast.yaml
This creates:
data/eval/runs/run_2026-01-16_11-52-03/
├── predictions.jsonl
├── metrics.json
├── fingerprint.json
├── baseline.json
└── config.yaml
Same artifacts as baselines/references - just in a different location.
Step 2: Review Results¶
Look at the results and decide:
- "This is our new prod baseline" → Promote to baseline
- "This is our silver reference" → Promote to reference
- "Just an experiment" → Leave as run (can delete later)
Step 3: Promote (If Needed)¶
Promote to Baseline¶
make run-promote \
RUN_ID=run_2026-01-16_11-52-03 \
--as baseline \
PROMOTED_ID=baseline_prod_authority_v2 \
REASON="New production baseline with improved preprocessing"
This:
- Moves artifacts from
runs/→baselines/ - Assigns stable ID (
baseline_prod_authority_v2) - Marks as immutable (cannot overwrite)
- Creates
README.mdexplaining purpose - Updates
baseline.jsonwith promotion metadata - Removes source run (promotion is one-way)
Promote to Reference¶
make run-promote \
RUN_ID=run_2026-01-16_11-52-03 \
--as reference \
PROMOTED_ID=silver_gpt5_2_v1 \
REASON="Silver reference using GPT-5 for benchmark dataset" \
REFERENCE_QUALITY=silver
This:
- Auto-detects task type from run's
metrics.jsonorpredictions.jsonl - Moves artifacts to appropriate location:
- Silver:
references/silver/{reference_id}/ - Gold:
references/gold/{task_type}/{reference_id}/ - Assigns stable ID (
silver_gpt5_2_v1) - Marks as immutable
- Creates
README.mdexplaining purpose - Updates
baseline.jsonwith promotion metadata - Removes source run
Note: DATASET_ID is no longer required. Task type is automatically detected from the run.
Directory Structure¶
data/eval/
├── runs/ # Temporary runs (can be deleted)
│ └── run_2026-01-16_11-52-03/
│ ├── predictions.jsonl
│ ├── metrics.json
│ ├── fingerprint.json
│ └── baseline.json
│
├── baselines/ # Promoted baselines (comparison points)
│ └── baseline_prod_authority_v2/
│ ├── predictions.jsonl
│ ├── metrics.json
│ ├── fingerprint.json
│ ├── baseline.json
│ ├── config.yaml
│ └── README.md # Explains why this baseline exists
│
└── references/ # Promoted references (truth)
├── silver/ # Silver references (machine-generated)
│ └── silver_gpt5_2_v1/
│ ├── predictions.jsonl
│ ├── metrics.json
│ ├── fingerprint.json
│ ├── baseline.json
│ ├── config.yaml
│ └── README.md
└── gold/ # Gold references (human-verified)
├── ner_entities/ # Gold NER references
│ └── ner_entities_smoke_gold_v1/
│ ├── index.json
│ ├── p01_e01.json
│ └── README.md
└── summarization/ # Gold summarization references
└── summarization_gold_v1/
├── predictions.jsonl
└── README.md
The One Rule That Keeps This Sane¶
A run's role is determined by where it lives and how it's referenced, not by how it was executed.
If you follow that rule, everything stays clean.
Promotion Rules¶
Baseline Rules¶
| Rule | Baseline |
|---|---|
| Required for experiments | Yes |
| Can block CI | Yes |
| Updated often | Sometimes |
| Replaced silently | No |
| Used as "truth" | No |
| Compared against | Yes |
| Compared to baseline | N/A |
Reference Rules¶
| Rule | Reference |
|---|---|
| Required for experiments | No |
| Can block CI | No |
| Updated often | Rarely |
| Replaced silently | No |
| Used as "truth" | Approximate (silver) or Exact (gold) |
| Compared against | Yes |
| Compared to baseline | Yes |
Why This Is Powerful¶
Because you get:
- One execution path (less code, fewer bugs)
- Late binding of meaning (decide role after seeing results)
- Reproducibility (same fingerprint everywhere)
- Auditability (promotion is explicit and reviewable)
This is how mature ML platforms work internally.
Legacy Support¶
The baseline-create command still works for backward compatibility:
make baseline-create \
BASELINE_ID=baseline_prod_authority_v2 \
DATASET_ID=curated_5feeds_smoke_v1
What it does:
- Creates a run with timestamp ID
- Auto-promotes it to baseline
- Uses the provided
BASELINE_IDas the promoted ID
For new workflows, prefer:
make run-create(explicit)- Review results
make run-promote(explicit promotion)
Examples¶
Example 1: Create and Promote Baseline¶
# Step 1: Create run
make run-create \
RUN_ID=run_2026-01-16_11-52-03 \
DATASET_ID=curated_5feeds_smoke_v1 \
EXPERIMENT_CONFIG=data/eval/configs/baseline_bart_small_led_long_fast.yaml
# Step 2: Review results in data/eval/runs/run_2026-01-16_11-52-03/
# Step 3: Promote to baseline
make run-promote \
RUN_ID=run_2026-01-16_11-52-03 \
--as baseline \
PROMOTED_ID=baseline_prod_authority_v2 \
REASON="New production baseline with improved preprocessing profile"
Example 2: Create and Promote Reference¶
# Step 1: Create run
make run-create \
RUN_ID=run_2026-01-16_14-30-00 \
DATASET_ID=curated_5feeds_benchmark_v1 \
EXPERIMENT_CONFIG=experiments/silver_reference_gpt5.yaml
# Step 2: Review results
# Step 3: Promote to reference
make run-promote \
RUN_ID=run_2026-01-16_14-30-00 \
--as reference \
PROMOTED_ID=silver_gpt5_2_v1 \
REASON="Silver reference using GPT-5 for benchmark dataset" \
REFERENCE_QUALITY=silver
Example 3: Leave as Run (No Promotion)¶
# Create run
make run-create \
RUN_ID=run_2026-01-16_15-00-00 \
DATASET_ID=curated_5feeds_smoke_v1 \
EXPERIMENT_CONFIG=experiments/test_new_prompt.yaml
# Review results - decide it's just an experiment
# No promotion needed - can delete later if not useful
Promotion Metadata¶
When a run is promoted, the baseline.json file is updated with promotion metadata:
{
"baseline_id": "baseline_prod_authority_v2",
"dataset_id": "curated_5feeds_smoke_v1",
"created_at": "2026-01-16T11:52:03Z",
"promoted_from": "run_2026-01-16_11-52-03",
"promoted_at": "2026-01-16T12:00:00Z",
"promoted_as": "baseline",
"promoted_id": "baseline_prod_authority_v2",
"fingerprint_ref": "fingerprint.json"
}
For references, it also includes:
{
"reference_quality": "silver"
}
README.md Template¶
Each promoted run gets a README.md that explains:
- What it is
- Why it was promoted
- How it should be used
This makes promotion explicit and reviewable.
Best Practices¶
- Always review before promoting - Don't auto-promote without checking results
- Use descriptive IDs -
baseline_prod_authority_v2is better thanbaseline_v2 - Document the reason - The
REASONparameter is important for future reference - Don't promote experiments - Only promote runs that serve a clear purpose
- Keep runs clean - Delete runs that aren't useful (they're temporary)
Summary¶
Execution is neutral. Meaning is assigned afterward.
- Create runs with
make run-create - Review results
- Promote with
make run-promoteif needed - Same artifacts, different governance
This keeps the system clean, auditable, and flexible.