RFC-066: Run Comparison Tool — Performance Tab¶
- Status: Completed (v2.6.0) — Streamlit app
tools/run_compare/app.pyexposes Performance (?page=performance) joining run comparison data with frozen RFC-064 profiles (split from RFC-064). - Authors: Podcast Scraper Team
- Stakeholders: Maintainers, operators comparing eval runs and release profiles
- Related PRDs:
- PRD-007: AI Quality & Experimentation Platform — experiment / comparison tooling (RFC-047, RFC-041)
- Related ADRs:
- ADR-076: Streamlit for operator run comparison and performance views — eval / Performance UI stays in Streamlit
- ADR-075: Frozen YAML performance profiles — profile artifact pattern consumed here
- ADR-014: Codified Comparison Baselines
- ADR-027: Unified Provider Metrics Contract
- ADR-040: Explicit Golden Dataset Versioning
- Related RFCs:
- RFC-064: Performance profiling and release freeze — frozen profiles (data source)
- RFC-065: Live pipeline monitor — sibling split from RFC-064
- RFC-047: Run comparison visual tool — Streamlit app this RFC extends
- RFC-041: Podcast ML benchmarking framework — quality side of run comparison
- Related Documents:
- Performance Profile Guide
- GitHub #511 — tracking issue
- GitHub #510 — RFC-064 epic (profiles prerequisite)
- Updated: 2026-04-09 (stub), 2026-04-10 (expanded)
Abstract¶
Extend the existing RFC-047 Streamlit run comparison tool
(tools/run_compare/app.py) with a Performance navigation page that
displays resource cost metrics from frozen profiles (RFC-064) alongside the
existing quality comparison views.
Where RFC-047 answers "did quality change between runs?", this extension adds
"did resource cost change?" — joining quality evaluation data (data/eval/)
and performance profiles (data/profiles/) by release tag into a single visual
tool.
Split from RFC-064 to keep the release freeze framework focused on its core deliverable (frozen profiles + diff tool) and to avoid mixing Streamlit UI design with profiling infrastructure.
Problem Statement¶
After RFC-064 shipped the freeze_profile.py script and diff_profiles.py
terminal diff, we have frozen YAML profiles accumulating in data/profiles/.
The terminal diff is useful for quick two-profile comparisons, but it cannot:
- Show trends across many releases at a glance.
- Overlay quality metrics (ROUGE, success rate) on the same view as resource cost, so the developer can see whether a quality improvement came at a resource cost or vice versa.
- Let the developer interactively filter by hostname, dataset, or stage.
The run comparison tool already provides interactive Streamlit UI, Plotly charts, and a nav structure. Adding a Performance page is the natural extension.
Design¶
Join Key¶
Quality runs in data/eval/runs/ and performance profiles in
data/profiles/*.yaml are joined by release tag. Both artifact types carry
the release tag in their metadata:
- Frozen profiles: top-level
releasefield (e.g.v2.6.0). - Eval runs: the run directory name or
fingerprint.jsonreleasefield.
The tool matches them automatically when both exist for the same tag. When only one side exists, the tool shows what it has and marks the other as "missing".
Profile Discovery¶
A new function discover_profiles() in tools/run_compare/data.py scans
data/profiles/*.yaml, parses each YAML, and returns a list of
ProfileEntry dataclass objects:
@dataclass(frozen=True)
class ProfileEntry:
release: str
date: str
dataset_id: str
hostname: str
path: Path
stages: Dict[str, Dict[str, float]]
totals: Dict[str, float]
environment: Dict[str, Any]
Profiles are sorted by date (oldest first) for trend charts.
Joined Data Model¶
A JoinedRelease dataclass pairs an optional RunEntry (quality) with an
optional ProfileEntry (performance) by release tag:
@dataclass(frozen=True)
class JoinedRelease:
release: str
eval_entry: Optional[RunEntry]
profile_entry: Optional[ProfileEntry]
The join is a full outer join: releases with only quality data appear (profile column blank), and releases with only a profile appear (quality column blank).
Navigation¶
Add Performance to the existing nav bar:
Home · KPIs · Delta · Episodes · Performance
The ?page=performance slug routes to the new page. Existing pages are
unchanged.
Performance Page Layout¶
The page has four sections, rendered top to bottom:
1. KPI Tiles (Top Row)¶
For each selected release that has a profile, display a row of metric tiles:
| Tile | Source |
|---|---|
| Peak RSS (MB) | totals.peak_rss_mb |
| Total wall time (s) | totals.wall_time_s |
| Avg wall time / episode (s) | totals.avg_wall_time_per_episode_s |
| Episodes processed | episodes_processed |
| Hostname | environment.hostname |
When two or more releases are selected, show delta arrows (green/red) between
baseline and candidates, reusing the same delta_direction_good pattern from
the quality delta table (lower is better for RSS and wall time).
2. Stage Delta Table¶
Same colored-delta pattern as RFC-047's quality delta table, applied to per-stage resource metrics:
| Column | Description |
|---|---|
| Stage | Pipeline stage name (e.g. summarization) |
| Metric | wall_time_s, peak_rss_mb, or avg_cpu_pct |
| Baseline | Value from the baseline release |
| Candidate | Value from the candidate release |
| Delta | Absolute difference |
| Delta % | Percentage change |
| Good? | Green check / red X (lower is better for all three) |
The baseline is selected from the sidebar dropdown (same pattern as quality baseline). When more than two releases are selected, each non-baseline release gets its own candidate column.
3. Trend Chart (Multi-Release Line Chart)¶
Line chart with releases on the x-axis (ordered by date) and one line per stage. Two sub-charts stacked vertically:
- Wall time per stage — y-axis in seconds.
- Peak RSS per stage — y-axis in MB.
Stages that are absent in some releases show gaps (no interpolation). A hostname filter in the sidebar restricts the chart to profiles from the same machine (with a warning banner when mixed-machine profiles are displayed together).
Chart style follows docs/guides/TUFTE_CHART_CRITIQUE.md: white background,
minimal chrome, small multiples where appropriate, no bubble-area distortion.
4. Quality vs Cost Scatter¶
A two-axis scatter plot where each dot is one release:
- X-axis: user-selectable quality metric from a dropdown:
- ROUGE-L F1 (default, from
metrics.jsonvs_reference) - Success rate
- Avg latency (s)
- Y-axis: user-selectable resource metric from a dropdown:
- Total wall time (s) (default)
- Peak RSS (MB)
- Avg wall time per episode (s)
Only releases that have both a quality eval run and a frozen profile are plotted. Dots are labeled with the release tag. Hover shows full details.
This chart answers the key question: "Did the quality improvement come at a resource cost?"
Cross-Machine Handling¶
Profiles carry environment.hostname. When the selected set includes profiles
from more than one hostname:
- A warning banner appears at the top of the Performance page: "Selected profiles come from different machines — absolute RSS/CPU values are not directly comparable. Filter by hostname in the sidebar for meaningful comparisons."
- The sidebar gains a Hostname filter (multi-select, default: all).
- Trend charts and delta tables still render but with the warning visible.
Missing Stages in Trend Charts¶
Some pipeline stages may be disabled in certain releases (e.g. gi_generation
skipped when GI is off). The trend chart handles this by:
- Showing a gap in the line for that stage at that release.
- Not interpolating across gaps.
- A footnote: "Gaps indicate stages not present in that release's profile."
Minimum Profile Count¶
- 1 profile: KPI tiles and stage breakdown render (no trend, no delta).
- 2 profiles: Delta table and quality-vs-cost scatter become available.
- 3+ profiles: Trend chart becomes meaningful (line chart with 2 points is technically renderable but not very useful; still shown).
Cross-Dataset Profiles¶
Profiles carry dataset_id. The sidebar gains a Dataset filter
(multi-select, default: all). When comparing profiles from different datasets,
a warning appears: "Profiles use different datasets — wall times are not
directly comparable across different episode counts/sizes."
Data Layer Changes (tools/run_compare/data.py)¶
New functions:
| Function | Purpose |
|---|---|
discover_profiles(profiles_root) |
Scan data/profiles/*.yaml, return List[ProfileEntry] |
load_profile(path) |
Parse one YAML into ProfileEntry |
join_releases(runs, profiles) |
Full outer join by release tag → List[JoinedRelease] |
profile_delta_rows(base, candidates) |
Stage-level delta rows for the delta table |
profile_trend_rows(profiles) |
Long-form rows for Plotly trend chart |
No changes to existing quality data loading functions.
Extension: RFC-065 monitor companions¶
ProfileEntry may include optional fields from sibling <stem>.monitor.log and
rfc065_monitor inside <stem>.stage_truth.json (same directory as the frozen YAML).
load_profile attaches: monitor_log_path, monitor_trace_lines, monitor_trace_bytes,
rfc065_monitor. Helper profile_has_rfc065_trace tests whether the UI should surface a
trace. The Performance page adds a Monitor traces section (table + download expander) and
includes Monitor traces in the Jump row when any selected profile has a trace. See
Performance Profile Guide (live monitor during freeze).
UI Layer Changes (tools/run_compare/app.py)¶
- Import new data functions and
ProfileEntry/JoinedRelease. - Add
"Performance"to_NAV_PAGESand slug mappings. - New
_render_performance_page(...)function containing the four sections. - Sidebar additions: hostname filter, dataset filter (only shown when Performance page is active).
- Baseline dropdown extended to work with
JoinedReleaseobjects.
Dependencies¶
No new dependencies. Everything needed is already in [compare]:
streamlit— UI frameworkplotly— chartspandas— data manipulationpyyaml— YAML parsing (already a core dependency)
Frozen profile YAML files from RFC-064 (data/profiles/*.yaml) are the data
source.
Implementation Plan¶
Phase 1: Data Layer¶
- Add
ProfileEntrydataclass anddiscover_profiles()todata.py. - Add
load_profile()with YAML parsing and validation. - Add
join_releases()full outer join. - Add
profile_delta_rows()andprofile_trend_rows()helpers. - Unit tests for all new data functions.
Phase 2: Performance Page¶
- Add Performance to nav bar and slug routing.
- Implement KPI tiles section.
- Implement stage delta table.
- Implement trend chart (wall time + RSS).
- Implement quality-vs-cost scatter.
Phase 3: Filters and Warnings¶
- Hostname filter in sidebar.
- Dataset filter in sidebar.
- Cross-machine warning banner.
- Cross-dataset warning banner.
- Missing-stage gap handling in trend chart.
Testing Strategy¶
- Unit tests (
tests/unit/tools/test_run_compare_profiles.py): testdiscover_profiles,load_profile, RFC-065 companions,join_releases, delta/trend row helpers with fixture YAML files. - Manual QA: run
make run-comparewith 2+ frozen profiles indata/profiles/and verify all four sections render correctly. - No E2E Playwright tests — the run comparison tool is a developer-facing Streamlit app, not the GI/KG viewer.
Open Questions (Resolved)¶
| # | Question | Resolution |
|---|---|---|
| 1 | New tab or separate app? | New nav page in existing app — consistent UX, shared sidebar filters, single make run-compare entry point. |
| 2 | Quality metric on scatter x-axis? | User-selectable dropdown (ROUGE-L F1 default, plus success rate and avg latency). |
| 3 | Cross-machine profiles? | Hostname filter in sidebar + warning banner when mixed. |
| 4 | Per-episode or stage aggregates? | Stage aggregates only — frozen profiles are per-stage; per-episode breakdown belongs in the quality pages. |
| 5 | Missing stages in trend charts? | Gaps (no interpolation) with footnote. |
| 6 | Minimum profiles for trend chart? | 1 profile = tiles only; 2 = delta + scatter; 3+ = trend lines. |
| 7 | Cross-dataset comparison? | Dataset filter in sidebar + warning when mixed. |
References¶
- Parent RFC:
docs/rfc/RFC-064-performance-profiling-release-freeze.md - Extended Tool:
tools/run_compare/app.py(RFC-047) - Quality Data:
data/eval/runs/,data/eval/baselines/ - Performance Data:
data/profiles/*.yaml - Chart Style:
docs/guides/TUFTE_CHART_CRITIQUE.md - Tracking Issue: #511