RFC-041: Podcast ML Benchmarking Framework¶

Status¶

🟢 Phase 0-1 Complete - Dataset materialization, baseline creation, and metrics structure implemented. CI integration pending.

RFC Number¶

041

Authors¶

Podcast Scraper Team

Date¶

2026-01-08

Updated¶

2026-01-16

RFC-025: Test Metrics and Health Tracking
RFC-027: Pipeline Metrics Improvements
RFC-028: ML Model Preloading and Caching

Motivation¶

As the podcast processing pipeline evolves (audio preprocessing, ASR, chunking, LLM summarization), we need a repeatable, objective benchmarking system to ensure improvements do not introduce regressions in quality, latency, stability, or cost.

This RFC proposes a lightweight, engineering-focused benchmarking framework designed for:

Fast iteration
Automation in CI
Clear regression signals
Low operational overhead

The framework avoids academic complexity and focuses on metrics that directly impact product and cost.

Goals¶

Detect regressions in ASR quality, latency, and cost
Measure the impact of audio preprocessing decisions
Track chunking and summarization stability over time
Enable safe provider and model swaps
Provide a frozen, reproducible benchmark dataset

Non-Goals¶

Human-level transcription evaluation at scale
Academic benchmarking or leaderboard comparisons
Real-time streaming evaluation (future work)

Benchmark Datasets¶

Primary Dataset (Clean Baseline)¶

Podcast: The Indicator from Planet Money

Rationale:

Highly consistent structure
Short episodes (8–12 min)
Studio-grade audio
Minimal ads and sound effects

Selection Criteria:

20 fixed episodes
Mix of older and recent episodes
Single-topic episodes only

This dataset is versioned and frozen (e.g., indicator_v1).

Secondary Dataset (Medium Noise)¶

Podcast: Short Wave

Rationale:

Slightly more conversational
Moderate sound design
Still structurally consistent

Used for stress-testing beyond the clean baseline.

Pipeline Stages Under Test¶

Each stage is benchmarked independently to isolate regressions.

→ Audio Preprocessing
→ ASR
→ Text Post-processing
→ Chunking
→ LLM (Summarization / Embeddings)
```yaml

Each stage emits structured metrics and artifacts.

---

## Metrics

### 1. Audio Preprocessing Metrics

| Metric | Description |
| -------- | ------------- |
| Original duration | Raw audio length |
| Processed duration | Post-trimming duration |
| File size delta | Compression effectiveness |
| Sample rate | Output consistency |
| Channels | Mono vs stereo |

Output: `audio_metrics.json`

---

### 2. ASR Metrics

| Metric | Description |
| -------- | ------------- |
| WER | Compared to reference |
| Tokens per minute | Cost proxy |
| Real-time factor | Performance (speed) |
| ASR latency | Wall-clock time |

**Reference Strategy:**

- Use Whisper-large-v3 as a frozen baseline when human transcripts are unavailable
- **Note:** WER is relative to Whisper-large-v3, not absolute human ground truth
- **Future improvement:** Obtain 5-10 human-verified transcripts from NPR for gold standard validation

Output: `asr_metrics.json`

---

### 3. Chunking Metrics

| Metric | Description |
| -------- | ------------- |
| Avg chunk size | Tokens |
| Chunk variance | Stability |
| Overlap percentage | Redundancy |
| Sentence boundary accuracy | Readability |

Output: `chunk_metrics.json`

---

### 4. LLM Output Metrics

#### Summarization

| Metric | Description |
| -------- | ------------- |
| Summary token count | Cost |
| Summary latency | UX |
| Semantic similarity | Drift detection |

#### Stability Test

Run identical input 3× with **different random seeds** to measure:

- Temperature-induced variance (for LLMs with temperature > 0)
- Non-deterministic behavior (even at temperature=0, some APIs vary)

**Metrics:**

- Embedding cosine similarity (pairwise avg)
- Length variance (tokens)
- BLEU score between outputs

**Embedding Model:** `sentence-transformers/all-MiniLM-L6-v2` (local, no API costs)

**Threshold:** Cosine similarity < 0.90 triggers instability alert

Output: `llm_metrics.json`

---

## Golden Signals

Each benchmark run produces the following **top-level signals**:

1. ASR WER
2. ASR latency (sec / min audio)
3. Tokens per episode
4. Average chunk size
5. Summary latency
6. Total estimated cost

These are tracked as time series.

---

## Cost Calculation Methodology

Cost includes:

- **ASR cost:** GPU time × compute rate (for Whisper) OR API cost (for OpenAI/Deepgram)
- **LLM cost:** (Input tokens × $X) + (Output tokens × $Y)
- **Storage cost:** Audio files + transcripts + artifacts

**Pricing Assumptions:**

- Whisper local: $0.50/hour GPU compute (A100 equivalent)
- OpenAI Whisper API: $0.006/minute audio
- GPT-4: $30/1M input tokens, $60/1M output tokens
- Claude 3.5 Sonnet: $3/1M input tokens, $15/1M output tokens
- Storage: $0.023/GB/month (S3 Standard)

**Cost Formula:**

```python
total_cost = (
    asr_cost +
    (input_tokens * input_price_per_million / 1_000_000) +
    (output_tokens * output_price_per_million / 1_000_000) +
    (storage_gb * storage_price_per_gb_month * days / 30)
)
```yaml

Output: `cost.json`

---

## Regression Rules

A benchmark run fails if:

- WER increases by >3%
- Token usage increases by >10%
- Latency increases by >15%
- Chunk size variance increases by >20%
- Cost increases without justification

Failures block merges unless explicitly overridden.

---

## Baseline Establishment

### Initial Baseline

1. Run benchmark suite on current `main` branch
2. Manual review of results for sanity
3. Tag as `benchmark-baseline-v1`
4. Store golden artifacts in `benchmarks/baselines/v1/`

### Updating Baselines

Baselines are updated only when:

1. Intentional improvements are merged (e.g., new model upgrade)
2. Manual approval from team
3. Documented in `benchmarks/CHANGELOG.md`

### Baseline Storage

- Store baseline metrics in git (JSON files)
- Store baseline artifacts (transcripts, summaries) in git-lfs or S3
- Never delete old baselines (for historical comparison)

---

## CI Integration Strategy

### When Benchmarks Run

**Option A (Comprehensive):**

- Run on all PRs to `main`
- Run nightly on `main`
- ~15-30 min duration (20 episodes × 10 min avg)

**Option B (Lightweight - Recommended):**

- Run smoke test (3 episodes) on all PRs (~5 min)
- Run full benchmark (20 episodes) nightly + on release branches
- This balances speed vs coverage

**Recommendation:** Start with Option B

### Workflow

```yaml
name: ML Benchmarks

on:
  pull_request:
    paths:

      - 'podcast_scraper/transcription/**'
      - 'podcast_scraper/summarization/**'
      - 'podcast_scraper/audio_processing/**'
      - 'benchmarks/**'
  schedule:

    - cron: '0 3 * * *'  # 3 AM UTC

jobs:
  smoke-test:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:

      - uses: actions/checkout@v4
      - name: Run smoke test (3 episodes)
        run: make benchmark-smoke

      - name: Upload results
        uses: actions/upload-artifact@v4

```text

        with:
          name: benchmark-smoke
          path: benchmarks/runs/

if: github.event_name == 'schedule'
runs-on: ubuntu-latest
timeout-minutes: 60
steps:

  - uses: actions/checkout@v4
  - name: Run full benchmark (20 episodes)
    run: make benchmark-full

  - name: Upload results
    uses: actions/upload-artifact@v4

        with:
          name: benchmark-full
          path: benchmarks/runs/

    run: python scripts/upload_benchmark_metrics.py

### Artifacts

- Upload benchmark reports as GitHub Actions artifacts
- Publish results to metrics dashboard (reuse existing dashboard from RFC-026)
- Post summary to PR comments (for quick feedback)

---

## Directory Structure

```text
benchmarks/
  datasets/
    indicator_v1.json          # 20 episode metadata + download URLs
    shortwave_v1.json          # Secondary dataset
  baselines/
    v1/
      audio_metrics.json
      asr_metrics.json
      chunk_metrics.json
      llm_metrics.json
      cost.json
  runs/
    2026-01-08-pr-123/
      audio_metrics.json
      asr_metrics.json
      chunk_metrics.json
      llm_metrics.json
      cost.json
      summary.md
    2026-01-09-nightly/
      audio_metrics.json
      asr_metrics.json
      chunk_metrics.json
      llm_metrics.json
      cost.json
      summary.md
  reports/
    latest.md                  # Human-readable report
    history.jsonl              # Time series data
  CHANGELOG.md                 # Baseline update log
```yaml

---

## Example Benchmark Report

```text
=== Benchmark Run: 2026-01-08 ===
Dataset: indicator_v1 (20 episodes)
Trigger: PR #123 (feat: add silence trimming)

Golden Signals:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Metric               Current    Baseline   Delta      Status
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ASR WER              3.2%       3.1%       +0.1%      ⚠️  WARN
ASR latency          0.42s/min  0.40s/min  +5%        ✅  PASS
Tokens/episode       2,450      2,400      +2.1%      ⚠️  WARN
Avg chunk size       512 tok    510 tok    +0.4%      ✅  PASS
Summary latency      1.8s       1.9s       -5.3%      ✅  PASS
Total cost/episode   $0.35      $0.33      +6%        ⚠️  WARN
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Regression Check: ⚠️  WARN (3 metrics increased but within thresholds)

Details:
- ASR: Silence trimming reduced audio duration by 8%, slightly increased WER
- Cost: Higher token count due to preserving more context at boundaries
- Recommendation: Acceptable trade-off for reduced processing time

Episodes Processed: 20/20
Total Duration: 187 minutes
Total Runtime: 14.3 minutes (real-time factor: 0.08)
```yaml

---

## Implementation Timeline

### Phase 0: Infrastructure Setup (Week 1)

- Set up `benchmarks/` directory structure
- Create dataset collection script (`scripts/collect_benchmark_dataset.py`)
- Download and version The Indicator episodes (20 episodes)
- Document dataset in `benchmarks/datasets/indicator_v1.json`

**Deliverables:**

- [ ] Directory structure created
- [ ] Dataset collected and committed to git-lfs
- [ ] README.md in `benchmarks/` with usage instructions

### Phase 1: Audio Preprocessing Benchmarks (Week 2)

- Implement audio preprocessing metrics collection
- Baseline establishment for stereo vs mono, silence trimming, sample rate
- Run initial experiments and document findings

**Deliverables:**

- [ ] `scripts/benchmark_audio.py`
- [ ] `benchmarks/baselines/v1/audio_metrics.json`
- [ ] Initial findings report

### Phase 2: ASR Benchmarks (Week 3)

- Implement ASR benchmarks (WER, latency, tokens)
- Establish Whisper-large-v3 as reference
- Compare local vs API Whisper

**Deliverables:**

- [ ] `scripts/benchmark_asr.py`
- [ ] `benchmarks/baselines/v1/asr_metrics.json`
- [ ] WER baseline established

### Phase 3: Chunking + LLM Benchmarks (Week 4)

- Implement chunking metrics
- Implement LLM stability tests
- Cost calculation automation

**Deliverables:**

- [ ] `scripts/benchmark_chunking.py`
- [ ] `scripts/benchmark_llm.py`
- [ ] `benchmarks/baselines/v1/llm_metrics.json`

### Phase 4: CI Integration + Dashboard (Week 5)

- Create GitHub Actions workflow
- Integrate with metrics dashboard
- PR comment automation

**Deliverables:**

- [ ] `.github/workflows/ml-benchmarks.yml`
- [ ] Dashboard integration (reuse RFC-026 dashboard)
- [ ] PR comment bot

**Total Effort:** ~5 weeks (1 person, part-time)

---

## Dependencies

### New Python Packages

```toml
[project.optional-dependencies]
benchmarks = [
    "jiwer>=3.0.0",              # WER calculation
    "scipy>=1.11.0",             # Audio processing metrics
    "sentence-transformers>=2.2.0",  # Semantic similarity (local)
    "nltk>=3.8.0",               # Text processing for BLEU
    "pydub>=0.25.0",             # Audio file handling
]

External Services (Optional)¶

OpenAI API (for embeddings, if not using sentence-transformers)
AWS S3 / GCS (for large artifact storage, ~5-10 GB)

Infrastructure¶

GitHub Actions minutes: +30 min/day (nightly runs) + 5 min/PR (smoke tests)
Storage: 5-10 GB for dataset + artifacts (recommend git-lfs or S3)
GPU access (optional): For faster local Whisper benchmarking

Initial Experiments¶

The first benchmarking phase focuses on:

Stereo vs mono audio
Hypothesis: Mono reduces Whisper inference time by 30-40%
Metric: ASR latency, WER, file size
Silence trimming on/off
Hypothesis: Trimming reduces tokens by 5-10%
Metric: Tokens/episode, cost, WER
Sample rate: 16 kHz vs 44.1 kHz
Hypothesis: 16 kHz is sufficient for speech, reduces processing time
Metric: ASR latency, WER, file size

These experiments are expected to yield measurable gains in:

ASR speed (target: 20-30% improvement)
Token usage (target: 5-10% reduction)
Cost predictability (target: consistent costs within ±5%)

Out of Scope (Explicitly Excluded from v1)¶

The following are intentionally NOT benchmarked in v1:

Ad detection accuracy - No ground truth available yet
Speaker name accuracy - Requires manual annotation
Metadata extraction - Too subjective for automated testing
End-to-end user workflows - Covered by E2E tests instead (see RFC-019)
Real podcast variability - Using controlled datasets intentionally
Cost optimization strategies - Focus is on measurement, not optimization

These may be added in future versions as the system matures.

Alternatives Considered¶

Option A: Use Existing Benchmarking Frameworks¶

Tools: MLflow, Weights & Biases (W&B), DVC

Pros:

Rich ecosystem, established patterns
Built-in visualization and experiment tracking
Industry-standard tools

Cons:

Too heavyweight for our needs
Requires infrastructure (MLflow server, W&B account)
Steep learning curve for team
Vendor lock-in risk (W&B)
Overkill for simple regression detection

Decision: Rejected - build lightweight custom framework

Option B: Manual Testing Only¶

Pros:

No infrastructure needed
Flexible, human judgment

Cons:

Not scalable
No regression prevention
High risk of missing regressions
Inconsistent methodology

Decision: Rejected - automation is critical

Option C: Academic Benchmarks (LibriSpeech, CommonVoice)¶

Pros:

Established baselines
Large datasets
Comparable to other systems

Cons:

Not representative of podcast audio
Long episodes (real podcasts are 30-90 min)
Different domain (audiobooks vs podcasts)
No chunking/summarization context

Decision: Rejected - use real podcast data for domain-specific insights

Future Work¶

Ad detection and removal metrics
Speaker diarization benchmarking
Real-time / streaming ASR evaluation
Human-evaluated gold transcripts (partnership with NPR?)
Multi-language podcast support
Podcast-specific metrics (music detection, intro/outro detection)

🚀 Evolution & Improvements (2026-01-10 Update)¶

Critical Enhancements for Phase 0 Implementation¶

Based on lessons learned from RFC-015 and RFC-016, the following improvements are critical before Phase 0 implementation.

1. Align Dataset Definitions with Experiment Runner¶

Problem: RFC-041 defines indicator_v1.json datasets. RFC-015 defines episode globs. Two systems of truth will cause confusion.

Solution: Make dataset JSON the canonical definition.

Dataset JSON Format¶

{
  "dataset_id": "indicator_v1",
  "version": "1.0",
  "description": "NPR Planet Money: The Indicator episodes (explainer style)",
  "created_at": "2026-01-10T14:00:00Z",
  "content_regime": "explainer",
  "episodes": [
    {
      "episode_id": "ep001",
      "title": "Why gas prices are so high",
      "duration_minutes": 8,
      "audio_path": "data/eval/datasets/indicator_v1/audio/ep001.mp3",
      "transcript_path": "data/eval/datasets/indicator_v1/transcripts/ep001.txt",
      "golden_summary_path": "data/eval/golden/indicator_v1/ep001.txt",
      "content_hash": "abc123...",
      "preprocessing_profile": "cleaning_v3"
    }
  ]
}

Materialization Script¶

# scripts/eval/materialize_dataset.py

def materialize_dataset(dataset_json: Path, output_dir: Path):
    """Materialize dataset JSON into episode folders."""

    dataset = json.loads(dataset_json.read_text())

    for episode in dataset["episodes"]:
        ep_dir = output_dir / dataset["dataset_id"] / episode["episode_id"]
        ep_dir.mkdir(parents=True, exist_ok=True)

        # Copy/link files
        shutil.copy(episode["transcript_path"], ep_dir / "transcript.txt")
        if episode.get("golden_summary_path"):
            shutil.copy(episode["golden_summary_path"], ep_dir / "golden.txt")

        # Write metadata
        (ep_dir / "metadata.json").write_text(json.dumps({
            "episode_id": episode["episode_id"],
            "dataset_id": dataset["dataset_id"],
            "content_hash": episode["content_hash"],
            "preprocessing_profile": episode["preprocessing_profile"],
        }, indent=2))

Why: Single source of truth. Experiment runner reads dataset JSON, not globs. Prevents mismatched comparisons.

2. Add Summarization Quality Gates¶

Problem: Current regression rules only cover ASR/chunking (WER, latency, cost). Missing summarization-specific failures.

Solution: Add quality gates that match known podcast summary issues.

Summarization-Specific Regression Rules¶

# benchmarks/regression_rules.yaml

summarization_gates:
  # Boilerplate leak (MUST be zero)
  boilerplate_leak_rate:
    baseline: 0.0
    max_delta: 0.0  # Zero tolerance
    severity: "critical"

  # Repetition score (n-gram duplication)
  repetition_score:
    baseline: 0.15  # 15% duplicate trigrams (from baseline measurement)
    max_delta: 0.05  # Allow up to 20% (15% + 5%)
    severity: "major"

  # Truncation rate (ellipsis, incomplete sentences)
  truncation_rate:
    baseline: 0.02  # 2% of sentences incomplete
    max_delta: 0.03  # Allow up to 5%
    severity: "major"

  # Numbers retained (preserve quantitative data)
  numbers_retained:
    baseline: 0.85  # 85% of numbers from reference
    min_threshold: 0.80  # Never drop below 80%
    severity: "minor"

  # Speaker label leak
  speaker_label_leak_rate:
    baseline: 0.0
    max_delta: 0.0  # Zero tolerance
    severity: "critical"

  # Summary length variance (stability)
  summary_length_variance:
    baseline_mean: 450  # chars
    baseline_std: 50    # chars
    max_std_delta: 20   # Allow std up to 70 chars
    severity: "minor"

Gate Evaluation¶

# src/podcast_scraper/benchmarks/gates.py

def evaluate_quality_gates(metrics: Dict[str, float], rules: Dict[str, Any]) -> List[GateViolation]:
    """Evaluate quality gates and return violations."""

    violations = []

    for metric_name, rule in rules.items():
        current_value = metrics.get(metric_name)
        if current_value is None:
            continue

        baseline = rule.get("baseline")
        max_delta = rule.get("max_delta")
        min_threshold = rule.get("min_threshold")
        severity = rule["severity"]

        # Check delta from baseline
        if max_delta is not None and baseline is not None:
            delta = current_value - baseline
            if delta > max_delta:
                violations.append(GateViolation(
                    metric=metric_name,
                    severity=severity,
                    baseline=baseline,
                    current=current_value,
                    delta=delta,
                    threshold=max_delta,
                    message=f"{metric_name} increased by {delta:.3f} (max allowed: {max_delta})"
                ))

        # Check absolute threshold
        if min_threshold is not None:
            if current_value < min_threshold:
                violations.append(GateViolation(
                    metric=metric_name,
                    severity=severity,
                    current=current_value,
                    threshold=min_threshold,
                    message=f"{metric_name} is {current_value:.3f} (min required: {min_threshold})"
                ))

    return violations

Why: Catches real regressions that ROUGE misses. Zero-tolerance for critical failures (boilerplate, speaker labels).

3. Add Content Regime Datasets (Feed-Style Buckets)¶

Problem: Current plan uses Indicator (explainer) + Short Wave (science). Missing narrative journalism stress case.

Solution: Add The Journal as third dataset to cover narrative regime.

Three Content Regimes¶

CONTENT_REGIMES = {
    "explainer": {
        "datasets": ["indicator_v1"],
        "characteristics": [
            "Short episodes (5-10 min)",
            "Single concept deep-dives",
            "Data/stats heavy",
            "Educational tone",
        ],
        "stress_tests": ["Numbers retention", "Concept clarity"],
    },
    "science": {
        "datasets": ["shortwave_v1"],
        "characteristics": [
            "Medium episodes (15-20 min)",
            "Scientific topics",
            "Interview format",
            "Technical vocabulary",
        ],
        "stress_tests": ["Technical term preservation", "Interview structure"],
    },
    "narrative": {
        "datasets": ["journal_v1"],  # NEW
        "characteristics": [
            "Long episodes (20-30 min)",
            "Story-driven journalism",
            "Multiple speakers/sources",
            "Chronological narrative",
        ],
        "stress_tests": ["Narrative flow", "Multi-speaker attribution", "Chronology preservation"],
    },
}

Dataset Sizing¶

# Phase 0 (Baseline Establishment)
datasets:
  indicator_v1:
    episodes: 10  # Representative explainer set
    content_regime: "explainer"

  shortwave_v1:
    episodes: 10  # Representative science set
    content_regime: "science"

  journal_v1:  # NEW
    episodes: 5-10  # Smaller but critical stress case
    content_regime: "narrative"

Why: The Journal is your "real-world stress case" for long-form narrative. Don't skip it.

4. Baseline Integration (Shared with RFC-015/016)¶

Problem: RFC-041 regression rules reference "baseline" but don't define the artifact structure.

Solution: Use shared baseline_id concept from RFC-015.

Baseline Reference in Benchmarks¶

# benchmarks/benchmark_config.yaml

benchmark_id: "podcast_ml_v1"
baseline_id: "bart_led_baseline_v2"  # Shared with RFC-015
baseline_path: "benchmarks/baselines/bart_led_baseline_v2/"

datasets:
  - indicator_v1
  - shortwave_v1
  - journal_v1

regression_rules:
  baseline: "bart_led_baseline_v2"  # Reference by ID
  asr_gates: {...}
  summarization_gates: {...}

Why: Single baseline artifact used by both experiment runner (RFC-015) and benchmarking (RFC-041). No duplication.

Implementation Status¶

✅ Phase 0: Dataset Freezing + Baseline Artifacts (Complete)¶

✅ Dataset JSON format implemented (data/eval/datasets/ and benchmarks/datasets/)
✅ Dataset creation scripts (scripts/eval/create_dataset_json.py)
✅ Dataset materialization (scripts/eval/materialize_dataset.py) with hash validation
✅ Source data inventory (scripts/eval/generate_source_index.py, scripts/eval/generate_episode_metadata.py)
✅ Baseline creation (scripts/eval/materialize_baseline.py) with comprehensive fingerprinting
✅ Baseline storage structure (data/eval/baselines/) with predictions.jsonl, metrics.json, fingerprint.json, baseline.json
✅ Metrics structure (metrics.json) with intrinsic and vs_reference sections
✅ README governance layer for all artifact types

✅ Phase 1: Integration with RFC-015 (Complete)¶

✅ Experiment runner reads dataset JSONs (scripts/eval/run_experiment.py supports dataset_id)
✅ Experiment runner references baseline_id and optional reference_ids
✅ Quality gates evaluated automatically (intrinsic metrics: gates, length, performance, cost)
✅ Regression detection via comparison deltas (comparisons/vs_{baseline_id}.json)
✅ Reference model implemented (baseline, silver, gold references)
✅ Promotion workflow (scripts/eval/promote_run.py, make run-promote)

🟡 Phase 2: CI Integration (Pending)¶

⏳ Smoke tests (3 episodes from curated datasets)
⏳ Nightly full benchmarks (all datasets)
⏳ Regression alerts (Slack/email)

Timeline: Phase 0-1 complete. Phase 2 depends on RFC-015 Phase 4 (CI integration).

Conclusion¶

This framework provides fast feedback, objective guardrails, and long-term visibility into the health of the podcast ML pipeline. It is intentionally simple, automation-friendly, and designed to evolve as the system grows.

By focusing on engineering-relevant metrics (cost, latency, quality) and avoiding academic complexity, we can iterate quickly while maintaining confidence in our changes.

Current Status:

✅ Phase 0-1 complete (dataset materialization, baseline creation, metrics structure, promotion workflow)
✅ Integration with RFC-015 complete (experiment runner uses dataset JSONs, baseline/reference support)
⏳ Phase 2 pending (CI integration - depends on RFC-015 Phase 4)

Next Steps:

Complete RFC-015 Phase 4 (CI integration)
Deploy smoke tests to CI
Deploy nightly benchmarks
Set up regression alerts

RFC-041: Podcast ML Benchmarking Framework¶

Status¶

RFC Number¶

Authors¶

Date¶

Updated¶

Related ADRs¶

Related RFCs¶

Motivation¶

Goals¶

Non-Goals¶

Benchmark Datasets¶

Primary Dataset (Clean Baseline)¶

Secondary Dataset (Medium Noise)¶

Pipeline Stages Under Test¶

External Services (Optional)¶

Infrastructure¶

Initial Experiments¶

Out of Scope (Explicitly Excluded from v1)¶

Alternatives Considered¶

Option A: Use Existing Benchmarking Frameworks¶

Option B: Manual Testing Only¶

Option C: Academic Benchmarks (LibriSpeech, CommonVoice)¶

Future Work¶

🚀 Evolution & Improvements (2026-01-10 Update)¶

Critical Enhancements for Phase 0 Implementation¶

1. Align Dataset Definitions with Experiment Runner¶

Dataset JSON Format¶

Materialization Script¶

2. Add Summarization Quality Gates¶

Summarization-Specific Regression Rules¶

Gate Evaluation¶

3. Add Content Regime Datasets (Feed-Style Buckets)¶

Three Content Regimes¶

Dataset Sizing¶

4. Baseline Integration (Shared with RFC-015/016)¶

Baseline Reference in Benchmarks¶

Implementation Status¶

✅ Phase 0: Dataset Freezing + Baseline Artifacts (Complete)¶

✅ Phase 1: Integration with RFC-015 (Complete)¶

🟡 Phase 2: CI Integration (Pending)¶

Conclusion¶

References¶