RFC-064: Performance Profiling and Release Freeze Framework¶
- Status: Completed (v2.6.0) —
data/profiles/*.yaml,scripts/eval/profile/freeze_profile.py,scripts/eval/profile/diff_profiles.py, Makefileprofile-freeze/profile-diff, and profile fields consumed by RFC-066 /tools/run_compare/. Refreshing profiles each release is operational hygiene, not an open design gap. - Authors: Podcast Scraper Team
- Stakeholders: Maintainers, release owners, operators comparing resource cost across releases
- Related PRDs:
- PRD-007: AI Quality & Experimentation Platform — benchmarking and eval context (pair with RFC-041)
- PRD-016: Operational Observability & Pipeline Intelligence — release-time visibility; sibling to live monitor (RFC-065)
- Related ADRs:
- ADR-075: Frozen YAML performance profiles for release baselines — ratifies
data/profiles/+ freeze/diff pattern - ADR-014: Codified Comparison Baselines
- ADR-027: Unified Provider Metrics Contract
- ADR-040: Explicit Golden Dataset Versioning
- Related RFCs:
- RFC-041: Podcast ML benchmarking framework — quality side; this RFC is the resource cost sibling
- RFC-047: Run comparison visual tool — Streamlit shell; RFC-066 adds the Performance page on top of frozen profiles
- RFC-057: AutoResearch optimization loop — overnight runs benefit from per-stage resource visibility (ADR-073 closure)
- RFC-063: Multi-feed corpus append & resume — v1 freeze assumes single-feed; multi-feed profiling is a follow-on
- RFC-065: Live pipeline monitor — live
--monitorview; complements frozen profiles in this RFC - RFC-066: Run compare — Performance tab — visual join
of eval runs and
data/profiles/ - Related Documents:
- Performance Profile Guide — capture workflow, methodology, E2E vs real RSS, troubleshooting
- GitHub #510 — epic / prerequisite tracking (profiles family)
- Updated: 2026-04-09 (initial), 2026-04-11 (front matter + links + Problem Statement alignment)
Abstract¶
This RFC introduces a performance profiling and release freeze framework for podcast_scraper — the resource cost sibling to the quality benchmarking system established in RFC-041.
Where RFC-041 answers "did output quality change?", this RFC answers "did resource cost change?" — specifically peak memory (RSS), wall time per stage, and CPU utilization. The framework builds on top of the project's existing per-episode timing infrastructure (workflow.metrics.Metrics, EpisodeStageTimings), the ProviderFingerprint environment capture, and the acceptance benchmark reporting (generate_performance_benchmark.py). It adds what those systems lack: a frozen, versioned, cross-release profile artifact that makes resource cost comparable across releases under identical conditions.
Operator documentation: Performance Profile Guide — capture workflow, methodology, E2E vs real RSS, troubleshooting.
The framework has two components:
data/profiles/— frozen release profiles: YAML snapshots of peak RSS, wall time, and CPU utilization per pipeline stage, committed at release time and versioned by release tag.scripts/eval/profile/freeze_profile.py— profile capture and diff tooling: A headless script that runs the reference dataset, captures resource metrics viapsutilalongside the existingMetricscollector, and writes the frozen profile. A companion diff script compares any two profiles.
Split to sibling RFCs (not part of the frozen-profile capture contract here):
- RFC-065 — live
--monitor(rich/.monitor.log), optional.[monitor](py-spy / memray); auto terminal split deferred - RFC-066 — Streamlit Performance page joining quality
and
data/profiles/by release tag
Problem Statement¶
As podcast_scraper evolves — new providers, heavier models, additional pipeline stages (GI, KG, vector search) — resource cost can drift silently. A new release might improve summary quality while doubling peak RAM, or reduce WER while tripling Whisper wall time. Without a performance baseline system, these regressions are invisible until they cause operational problems during overnight autoresearch runs or exceed memory limits on the development machine.
RFC-041 already handles quality regression detection with frozen baselines and regression gates. This RFC applies the same discipline to resource cost.
What Exists Today¶
The project already has significant performance instrumentation. This RFC builds on top of it rather than replacing it:
| Existing System | What It Captures | What It Lacks |
|---|---|---|
workflow.metrics.Metrics |
Per-stage wall time, per-episode timing, device tracking, LLM token/cost | No memory (RSS), no CPU utilization, no cross-release comparison, no frozen artifact |
EpisodeStageTimings |
Per-episode breakdown: download, transcribe, extract names, cleaning, summarize | Timing only, no resource measurement, no freeze |
ProviderFingerprint |
Model name/version/hash, device, precision, git commit, library versions | ML reproducibility oriented, not machine-level resource profiling |
generate_performance_benchmark.py |
Provider-grouped timing from acceptance runs, baseline comparison | Tied to acceptance sessions, not release versions; no memory/CPU |
profile_e2e_test_memory.py |
Per-test-case memory profiling via psutil |
Test-scoped, not pipeline-scoped; no artifact format |
generate_metrics.py |
JUnit/pytest timings, pipeline performance in HTML dashboards | CI/test metrics, not pipeline resource profiling |
The gap this RFC fills: None of the above systems produce a frozen, versioned artifact that captures memory + CPU + timing for the full pipeline under controlled conditions, comparable across releases. That is what data/profiles/ provides.
Goals¶
- Detect resource regressions (RSS memory, wall time, CPU utilization) across releases under identical conditions
- Build on existing
MetricsandEpisodeStageTimingsinfrastructure — reuse timing data, add resource dimensions - Reuse
ProviderFingerprintfor environment capture rather than inventing a new fingerprint format - Produce a simple, YAML-first, git-committable artifact per release
- Support retroactive freeze of past releases using existing reference datasets
- Provide a terminal diff tool for quick release-over-release comparison
Non-Goals¶
- Live resource dashboards during a run — covered by RFC-065
(
--monitor,rich/.monitor.log, optional.[monitor]); this RFC stays on frozen YAML + diff - Streamlit multi-release charts and run/profile joins — RFC-066 (Performance page); this RFC does not define UI
- Real-time alerting or fleet-wide dashboards
- Cloud infrastructure or persistent storage
- Profiling of LLM API calls (network latency is not a local resource cost; token/cost tracking already exists in
Metrics) - Hard CI gates on resource metrics (data collection first; enforcement is optional and comes later)
- Per-process disk I/O tracking (not available via
psutilon macOS; see Measurement Methodology)
Design¶
1. Frozen Release Profiles (data/profiles/)¶
Concept¶
A frozen profile is a YAML snapshot of resource usage captured by running the fixed reference dataset (indicator_v1 — same primary dataset used in RFC-041) through the full pipeline at a specific release. It is committed to the repo alongside the release tag and never modified after commit.
The freeze is a single entry point: make profile-freeze VERSION=v2.6.0 PIPELINE_CONFIG=config/profiles/profile_freeze.yaml. It runs headless, captures metrics, and writes the artifact.
Directory Structure¶
data/profiles/
v2.5.0.yaml
v2.4.0.yaml
v2.3.0.yaml
...
README.md
One file per release, flat structure, easy to scan and diff.
Artifact Format¶
The artifact combines three data sources:
- Resource metrics (new) — peak RSS and CPU utilization, captured via
psutilpolling during the run - Stage timings (existing) — wall time per stage, sourced from
workflow.metrics.Metricsand the savedmetrics.json - Environment fingerprint (existing) — machine and software details, sourced from
ProviderFingerprintplus additional machine-level fields
# data/profiles/v2.6.0.yaml
release: "v2.6.0"
date: "2026-04-09T14:32:00Z"
dataset_id: "indicator_v1"
episodes_processed: 20
# Environment — extends ProviderFingerprint with machine-level fields
environment:
hostname: "marko-m4pro"
cpu: "Apple M4 Pro"
cpu_cores_physical: 14
cpu_cores_logical: 14
ram_total_gb: 48
os: "macOS 15.3"
python: "3.12.2"
package_version: "2.5.0"
git_sha: "a3f8c21"
git_dirty: false
device: "mps"
precision: "fp32"
library_versions:
torch: "2.5.1"
transformers: "4.46.3"
whisper: "20231117"
spacy: "3.8.4"
# Per-stage resource profile
# Stages match the actual pipeline architecture (episode-loop, not batch)
# Timings sourced from metrics.json; RSS/CPU from psutil polling
stages:
rss_feed_fetch:
wall_time_s: 2.1
peak_rss_mb: 280
avg_cpu_pct: 12.3
media_download:
wall_time_s: 18.4
peak_rss_mb: 310
avg_cpu_pct: 8.5
audio_preprocessing:
wall_time_s: 4.2
peak_rss_mb: 312
avg_cpu_pct: 45.2
transcription:
wall_time_s: 193.5
peak_rss_mb: 3100
avg_cpu_pct: 68.4
note: "Includes model load + inference across all episodes"
speaker_detection:
wall_time_s: 8.3
peak_rss_mb: 890
avg_cpu_pct: 55.1
transcript_cleaning:
wall_time_s: 12.6
peak_rss_mb: 680
avg_cpu_pct: 22.4
summarization:
wall_time_s: 94.3
peak_rss_mb: 2200
avg_cpu_pct: 42.8
gi_generation:
wall_time_s: 45.2
peak_rss_mb: 1800
avg_cpu_pct: 38.5
kg_extraction:
wall_time_s: 32.1
peak_rss_mb: 1650
avg_cpu_pct: 35.2
vector_indexing:
wall_time_s: 3.8
peak_rss_mb: 920
avg_cpu_pct: 62.1
totals:
peak_rss_mb: 3100
wall_time_s: 414.5
avg_wall_time_per_episode_s: 20.7
Stage Model — Aligned with the Actual Pipeline¶
The pipeline processes episodes in a loop (not in batch-per-stage). The stage model in the profile reflects this reality. Each stage aggregates timing across all episodes:
rss_feed_fetch— maps toMetrics.time_scraping+time_parsingmedia_download— aggregated fromMetrics.download_media_time_by_episodeaudio_preprocessing— aggregated fromMetrics.preprocessing_timestranscription— aggregated fromMetrics.transcribe_time_by_episode; includes model load time (first episode) and inference (all episodes), because the model stays loaded across the episode loopspeaker_detection— aggregated fromMetrics.extract_names_time_by_episodetranscript_cleaning— aggregated fromMetrics.cleaning_time_by_episodesummarization— aggregated fromMetrics.summarize_time_by_episodegi_generation— aggregated fromMetrics.gi_timeskg_extraction— aggregated fromMetrics.kg_timesvector_indexing— wall time ofmaybe_index_corpus()(not currently tracked inMetrics; the freeze script measures this directly)
Wall time per stage comes from the saved metrics.json file (written by the pipeline to the output directory on every run). The freeze script adds the resource dimensions (RSS, CPU) by running a psutil polling thread alongside the pipeline and correlating samples to stages using timing boundaries.
Stages that are disabled in the reference config (e.g., if GI/KG is off) are omitted from the profile. The profile records what actually ran.
Note on totals: totals.wall_time_s is the end-to-end wall clock time of the pipeline run, not the sum of stage wall times. Stage times can overlap (e.g., concurrent download and transcription threads) or have gaps (setup, cleanup, thread synchronization), so the total may differ from the sum of stages.
Environment Fingerprint — Reusing ProviderFingerprint¶
The environment block reuses ProviderFingerprint fields (git commit, library versions, device, precision) and adds machine-level fields that ProviderFingerprint does not capture (hostname, CPU model, core count, total RAM, OS version). The freeze script calls generate_provider_fingerprint() and merges the result with platform module data.
All profiles captured on the same machine are directly comparable. Profiles from different machines are valid for trend analysis within that machine's history but should not be cross-compared for absolute values.
Retroactive Freeze¶
Past releases can be frozen retroactively by checking out the release tag and running the freeze script against the reference dataset:
for tag in v2.3.0 v2.4.0 v2.5.0; do
git checkout $tag
make profile-freeze VERSION=$tag
done
git checkout main
The artifact is honest about its provenance — the environment fingerprint tells the full story.
2. Freeze Script and Diff Tool¶
2a. Freeze Script (scripts/eval/profile/freeze_profile.py)¶
Single entry point for profile capture. Runs headless.
make profile-freeze VERSION=v2.6.0 PIPELINE_CONFIG=config/profiles/profile_freeze.yaml
Equivalent to:
.venv/bin/python3 scripts/eval/profile/freeze_profile.py \
--version v2.6.0 \
--pipeline-config config/profiles/profile_freeze.yaml \
--dataset-id indicator_v1 \
--output data/profiles/v2.6.0.yaml
Pipeline YAML files live under config/profiles/ (alongside data/profiles/ outputs). That keeps capture configs separate from quality evaluation configs under data/eval/ / config/eval/ — profiling and eval are parallel tracks (see Relationship to RFC-041).
For repeatable, offline-friendly RSS, use the same placeholder URL pattern as
config/acceptance/fragments/feeds_single.yaml / MAIN_ACCEPTANCE_CONFIG.yaml (example.invalid /
e2e-placeholder): freeze_profile.py then starts E2EHTTPServer and points
the config at fixture feed podcast1_mtb by default (override via --e2e-feed
or E2E_FEED= in the Makefile). Cloud LLM API calls are still real unless you
point provider bases at the mock server separately.
What the script does:
- Loads a
podcast_scraperConfigfrom the pipeline YAML (copy fromconfig/profiles/profile_freeze.example.yaml; you may keep multiple configs there for different machines or scenarios) - Calls
generate_provider_fingerprint()and collects machine-level metadata viaplatformandpsutil - Starts a
psutilpolling thread (1-second interval) that records RSS and CPU% for the process - Runs a warm-up pass (see Measurement Methodology)
- Calls
run_pipeline(cfg)— the real pipeline, not a mock - On pipeline completion, reads the saved
metrics.jsonfrom the output directory for per-stage wall times (the pipeline writes this file automatically viaMetrics.save_to_file()) - Correlates
psutilsamples to stages using the timing boundaries frommetrics.json - Writes the YAML artifact to
data/profiles/<version>.yaml
Accessing pipeline metrics: run_pipeline() returns Tuple[int, str] (count and summary), not the Metrics object directly. The freeze script reads timing data from the metrics.json file that the pipeline writes to the output directory on every run. This is the same file used by the dashboard and analysis scripts — no new data path is needed.
The polling thread is lightweight (one psutil.Process call per second using oneshot() for efficiency). Its overhead on timing is negligible — well under 1% of wall time for any stage longer than a few seconds.
2b. Profile Diff (scripts/eval/profile/diff_profiles.py)¶
make profile-diff FROM=v2.5.0 TO=v2.6.0
Prints a structured terminal diff of two YAML profiles:
Profile diff: v2.5.0 → v2.6.0 (indicator_v1, marko-m4pro)
Stage Metric v2.5.0 v2.6.0 Δ
─────────────────────────────────────────────────────────────────────
transcription peak_rss_mb 2980 3100 +4.0%
transcription wall_time_s 201.2 193.5 -3.8%
summarization peak_rss_mb 2200 2200 +0.0%
summarization wall_time_s 94.3 94.3 +0.0%
totals peak_rss_mb 3050 3100 +1.6%
totals wall_time_s 398.2 414.5 +4.1%
The diff tool is a pure YAML comparison — no dependencies beyond PyYAML and rich (for terminal formatting). It reads two profile files, computes deltas, and prints a table.
If data/profiles/regression_rules.yaml exists, the diff tool annotates rows that exceed thresholds. If the rules file does not exist, the diff still works — it just shows raw deltas without annotations.
3. Regression Rules (Optional)¶
Regression rules are not shipped in v1. The first priority is data collection — freeze profiles for several releases and observe natural variance before setting thresholds.
When ready, rules can be added to data/profiles/regression_rules.yaml:
# data/profiles/regression_rules.yaml
# Optional — diff tool works without this file
thresholds:
peak_rss_mb:
max_delta_pct: 20
severity: warning
wall_time_s:
max_delta_pct: 15
severity: warning
avg_wall_time_per_episode_s:
max_delta_pct: 15
severity: warning
stages:
transcription:
peak_rss_mb:
max_delta_pct: 10
severity: warning
The diff tool reads this file if present and annotates violations. Thresholds are always advisory (warnings) — never hard gates. Enforcement can be added in a future phase once baseline history demonstrates what "normal" variance looks like on the target machine.
Why deferred: Without 3-5 release profiles to establish natural variance, any threshold is arbitrary. Wall time on a developer laptop varies 5-15% between runs due to background processes, thermal throttling, and memory pressure. Setting thresholds before understanding this variance risks false positives that erode trust in the system.
4. Measurement Methodology¶
Resource profiling on a developer machine is inherently noisy. This section documents the methodology to maximize comparability.
Capture Conditions¶
- Close resource-heavy applications before running
make profile-freeze(browsers, IDEs, Docker containers not needed for the run). The profile artifact does not enforce this — it is a discipline, documented indata/profiles/README.md. - Run on AC power (laptops throttle CPU on battery).
- Single run per freeze in v1. Averaging multiple runs is a future enhancement. A single run under controlled conditions is sufficient for detecting large regressions (>15-20% delta).
Warm-Up¶
The freeze script performs a warm-up pass before the measured run. This ensures model weights are loaded into memory, filesystem caches are warm, and MPS shaders are compiled. The warm-up works as follows:
- The freeze script runs
run_pipeline(cfg)once withmax_episodes=1(overriding the reference config) - The warm-up run's output and metrics are written to a temporary directory and discarded
- The measured run then processes the full reference dataset (20 episodes for
indicator_v1) with thepsutilpolling thread active
This two-pass approach works because run_pipeline() is a standalone function call — the freeze script simply calls it twice with different configs. Model weights loaded during warm-up remain in memory (the freeze script does not unload them between passes).
What RSS Measures (and Does Not Measure)¶
psutil.Process.memory_info().rss captures the process's resident set size — physical memory pages currently in RAM. This is the correct metric for "how much RAM does this pipeline need?"
Known limitation — MPS (Metal Performance Shaders): On Apple Silicon with device=mps, GPU memory allocations for model weights and intermediate tensors are not reflected in RSS. They appear in the system's "wired" memory but not in the process's RSS. This means:
- Whisper and summarization model weights loaded on MPS are partially invisible to RSS measurement
- The RSS value underestimates true memory pressure on MPS
- For accurate GPU memory tracking on macOS,
sudo powermetricsor Instruments would be needed — out of scope for v1
The profile artifact documents the device (environment.device: "mps") so consumers know that RSS on MPS is a lower bound. On device=cpu, RSS is accurate. On device=cuda, torch.cuda.memory_allocated() could supplement RSS — a future enhancement.
CPU Measurement¶
psutil.Process.cpu_percent(interval=None) sampled at 1-second intervals. Averaged per stage. On Apple Silicon with efficiency/performance cores, this reflects the blend of core types used. It is useful for trend comparison (did CPU utilization increase?) but not for absolute core-pinning analysis.
Disk I/O — Not Available on macOS¶
psutil.Process.io_counters() is not supported on macOS — the macOS kernel does not expose per-process I/O statistics to userspace. This is a platform limitation, not a psutil bug.
Consequently, per-stage disk I/O fields (disk_read_mb, disk_write_mb) are not included in profiles captured on macOS. On Linux or Windows, where io_counters() is available, the freeze script will include these fields automatically.
System-wide disk I/O (psutil.disk_io_counters()) is available on macOS but cannot be attributed to the pipeline process specifically, so it is not used.
Relationship to Acceptance Benchmarks¶
The acceptance test infrastructure (scripts/acceptance/generate_performance_benchmark.py) produces performance reports grouped by provider/model configuration. These reports serve a different purpose than frozen profiles:
| Concern | Acceptance Benchmarks | Frozen Profiles (this RFC) |
|---|---|---|
| Purpose | Compare providers/models within a single run | Compare the same config across releases |
| Trigger | After acceptance test sessions | At release time |
| Metrics | Wall time per provider, pass/fail rates | RSS, CPU, wall time per stage |
| Grouping | By provider/model | By pipeline stage |
| Artifact | Session-scoped JSON/markdown report | Release-versioned YAML committed to git |
| Baseline | Optional session-to-session comparison | Release-to-release comparison |
The two systems are complementary. Acceptance benchmarks answer "which provider is faster for this run?" Frozen profiles answer "is this release heavier than the last one?"
A future enhancement could feed acceptance run timing data into the profile format for richer cross-validation, but v1 keeps them independent.
Relationship to RFC-041¶
| Concern | RFC-041 | RFC-064 |
|---|---|---|
| What it measures | Output quality (WER, ROUGE, gates) | Resource cost (RAM, CPU, wall time) |
| Freeze artifact | data/eval/baselines/<id>/ |
data/profiles/<version>.yaml |
| Freeze trigger | Baseline promotion workflow | make profile-freeze VERSION=x |
| Reference dataset | indicator_v1 (primary), shortwave_v1 (secondary) |
indicator_v1 (primary dataset only for consistency) |
| Fingerprinting | fingerprint.json per run |
environment: block reusing ProviderFingerprint |
| Visual comparison | RFC-047 quality tab | Terminal diff (make profile-diff); RFC-066 Performance page |
The two systems are intentionally separate runs. Profiling overhead (psutil polling) is negligible, but quality evaluation concerns (scoring, regression gates) are orthogonal to resource measurement. They share the same reference episodes and fingerprinting conventions but execute independently.
Why only indicator_v1: RFC-041 uses both indicator_v1 (clean baseline) and shortwave_v1 (medium noise) for quality evaluation because noise characteristics affect quality metrics differently. For resource profiling, the dataset content matters less — what matters is consistent episode count, audio duration, and pipeline config. Using a single dataset simplifies the profile artifact and avoids doubling the freeze time. If a second dataset proves valuable for resource profiling, it can be added later.
Directory Structure¶
config/profiles/
README.md
profile_freeze.example.yaml # Copy to e.g. profile_freeze.yaml (not necessarily committed)
freeze/*.yaml # E2E mock RSS presets (merged with freeze/_defaults.yaml)
data/profiles/
v2.3.0.yaml
v2.4.0.yaml
v2.5.0.yaml
README.md
scripts/eval/
freeze_profile.py # Profile capture script
diff_profiles.py # Profile diff tool
No new src/ modules are introduced in v1. The freeze script is a standalone script in scripts/eval/ that imports from podcast_scraper.workflow and podcast_scraper.evaluation.fingerprint. The psutil polling logic lives in the freeze script itself — it does not need to be a reusable library module until the live monitor RFC requires it.
Dependencies¶
No New Python Packages¶
psutil— already in[dev]dependencies (psutil>=5.9.0,<8.0.0)rich— already a core dependency (used for CLI output formatting)PyYAML— already a core dependency
No new optional dependency group is needed for v1.
No New Infrastructure¶
All artifacts are local YAML files committed to git. No backend services, no database, no cloud storage.
Implementation Phases¶
Phase 0: Frozen profiles + freeze script (this RFC, v1 scope)¶
data/profiles/directory andREADME.md(capture conditions, methodology)scripts/eval/profile/freeze_profile.py— headless profile capture with warm-up passscripts/eval/profile/diff_profiles.py— terminal diff toolmake profile-freezeandmake profile-diffMakefile targets- Add
vector_indexingtiming toMetrics(currently not tracked; needed for complete stage coverage) - Retroactive freeze of recent releases
- Integration with future release guide (see Forward References)
Phase 1: Regression rules (after baseline history exists)¶
data/profiles/regression_rules.yamlwith thresholds derived from observed variance- Diff tool annotates threshold violations
- Still advisory — no hard CI gates
Sibling deliverables (split from this RFC)¶
- RFC-065: Live pipeline monitor —
--monitor,richor.monitor.log, optional py-spy (f) and memray via.[monitor]; automatic terminal split deferred - RFC-066: Run compare — Performance tab — Streamlit Performance page joining quality and resource metrics by release tag
Forward References¶
Release Guide¶
This RFC assumes a release guide will be created that documents the full release checklist. The profile freeze step (make profile-freeze) is one part of that checklist, alongside:
- Full CI pass (
make ci) - Quality evaluation reports in
data/eval/ - Performance profiles in
data/profiles/ - Release notes draft (
make release-docs-prep)
The release guide is a separate deliverable — not part of this RFC. This RFC defines the profiling artifact and tooling; the release guide defines when and how to use them.
Multi-Feed Profiling (RFC-063)¶
The freeze script in v1 assumes single-feed pipeline execution (run_pipeline() called once with the reference dataset). When RFC-063 ships, multi-feed runs will call run_pipeline() N times sequentially, with potential model reloads between feeds and different resource profiles per feed. Profiling multi-feed runs — per-feed profiles, aggregate corpus-level metrics, model reload overhead — is a natural extension but is out of scope for v1. The single-feed freeze provides the baseline; multi-feed profiling can build on the same artifact format with a feed_id dimension added later.
Key Decisions¶
Build on existing infrastructure, don't replace it. The freeze script reads timing data from the pipeline's metrics.json output and environment data from ProviderFingerprint. It adds resource dimensions (RSS, CPU) via psutil polling. No existing code is modified — the freeze script is additive.
Stage model matches the episode-loop architecture. The pipeline processes episodes sequentially (or in parallel batches), not in batch-per-stage. Stages like "transcription" represent the aggregate of all episode transcriptions, including model load on first use. The profile documents this clearly.
All pipeline stages are represented. The stage list includes media download, audio preprocessing, transcription, speaker detection, transcript cleaning, summarization, GI generation, KG extraction, and vector indexing — matching the full set of per-episode operations tracked in Metrics. Stages that are disabled in the reference config are omitted from the profile.
Single entry point, headless. make profile-freeze is the only way to create a profile in v1. No interactive mode, no CLI flags on the main podcast_scraper command. This keeps the scope minimal and the artifact format unambiguous.
Metrics accessed via metrics.json, not in-process. The freeze script reads timing data from the pipeline's saved metrics.json file rather than accessing the Metrics object in-process. This avoids modifying run_pipeline() to return the metrics object and reuses the existing data path used by dashboards and analysis scripts.
No thresholds in v1. Data collection first. Thresholds are meaningless without baseline history showing natural variance. The diff tool works without regression_rules.yaml — it shows raw deltas. Rules are added when the data justifies them.
Separate runs for quality and performance. Mixing quality evaluation and resource profiling in one run would complicate both. They share reference datasets but execute independently.
RSS underestimates on MPS — documented, not solved. GPU memory on Apple Silicon is not visible to psutil. The profile documents the device so consumers know RSS is a lower bound on MPS. Solving this requires platform-specific tooling (Instruments, powermetrics) that belongs in the live monitor RFC, not here.
No per-process disk I/O on macOS. psutil.Process.io_counters() is not supported on macOS. Disk I/O fields are included only when the platform supports them (Linux, Windows). This is a platform limitation, not a design choice.
Conclusion¶
RFC-064 adds the missing resource cost dimension to the existing evaluation framework. It builds on the project's existing timing infrastructure (Metrics, EpisodeStageTimings), environment fingerprinting (ProviderFingerprint), and reference datasets (RFC-041) rather than creating parallel systems. The result is a frozen, versioned profile artifact per release that makes resource regressions visible — the first step toward a complete release-time picture of quality and performance.