RFC-025: Test Metrics and Health Tracking¶
- Status: ✅ Completed
- Related ADRs:
- ADR-022: Flaky Test Defense
- Authors:
- Stakeholders: Maintainers, developers, CI/CD pipeline maintainers
- Related PRDs:
docs/prd/PRD-001-transcript-pipeline.md(core pipeline)- Related RFCs:
docs/rfc/RFC-018-test-structure-reorganization.md(test structure - foundation)docs/rfc/RFC-019-e2e-test-improvements.md(E2E test infrastructure)docs/rfc/RFC-020-integration-test-improvements.md(integration test improvements)docs/rfc/RFC-023-readme-acceptance-tests.md(acceptance tests)docs/rfc/RFC-024-test-execution-optimization.md(test execution - complementary)docs/rfc/RFC-026-metrics-consumption-and-dashboards.md(metrics consumption - complementary)- Related Documents:
docs/architecture/TESTING_STRATEGY.md- Overall testing strategy and test categoriesdocs/guides/DEVELOPMENT_GUIDE.md- Development workflow and testing requirements.github/workflows/python-app.yml- CI test jobs
Abstract¶
This RFC defines a strategy for tracking test metrics and monitoring codebase health over time. The strategy focuses on:
- Metrics collection: What to track on every test run
- CI integration: How to collect and display metrics automatically
- Trend tracking: How to monitor health over time
- Flaky test detection: How to identify and track unstable tests
Key Principle: Metrics enable improvement. Track runtime, coverage, flakiness, and trends over time to enable data-driven optimization decisions.
System Overview¶
This RFC is part of a three-RFC system (RFC-024, RFC-025, RFC-026) that optimizes test execution, metrics collection, and consumption. The complete flow:
├─ PR: Fast tests (Tier 0 + Tier 1 fast)
├─ Main: All tests (Tier 0 + Tier 1 + Tier 2)
└─ Nightly: Full suite + comprehensive metrics
↓
Artifacts Generated
├─ JUnit XML (test results, timing)
├─ Coverage reports (XML, HTML, terminal)
└─ JSON metrics (structured data)
↓
Consumption Methods
├─ Job Summary (PR authors, 0s)
├─ metrics.json (automation, 5s)
└─ Dashboard (maintainers, 10s)
```yaml
**See also:**
- RFC-024: Test execution optimization (pytest + markers → CI tiers)
- RFC-026: Metrics consumption (consumption methods)
## Core Principles
These principles are shared across RFC-024, RFC-025, and RFC-026:
- **Developer flow > completeness** - Fast feedback loops protect developer state and enable rapid iteration
- **Metrics must be cheap to collect** - Automated collection with zero manual work required
- **Humans consume summaries, machines consume JSON** - Job summaries for quick checks, JSON API for automation
## Problem Statement
**Current Issues:**
1. **No Systematic Metrics Tracking**
- No visibility into test runtime trends
- No tracking of slowest tests over time
- No historical coverage trends
- No flaky test identification and tracking
2. **No Data-Driven Optimization**
- Difficult to identify performance regressions
- No visibility into which tests are slowing down
- No historical context for test health
**Impact:**
- Slow tests accumulate without visibility
- No data-driven decisions about test optimization
- Difficult to identify performance regressions
- Flaky tests go undetected
## Goals
### Primary Goal
**Codebase Health Tracking:**
- Monitor key metrics on every test run
- Track trends over time (runtime, coverage, flakiness)
- Identify slow tests and performance regressions
- Enable data-driven optimization decisions
### Success Criteria
- ✅ Metrics automatically collected in CI (no manual work)
- ✅ Historical trends visible (runtime, coverage, flakiness)
- ✅ Slowest tests identified and tracked
- ✅ Flaky tests automatically detected and reported
- ✅ GitHub Actions job summaries display key metrics
## Test Metrics & Monitoring
### Metrics to Track
Track these metrics on every test run to monitor codebase health. Metrics are categorized as **collected** (facts) or **derived** (interpretations) to avoid future debates about what to track.
#### Collected Metrics (Facts)
**1. Runtime Metrics** (Ownership: Test maintainers)
- Total runtime per tier (unit / integration / e2e)
- Individual test runtime
- Top 20 slowest tests
**2. Test Health Metrics** (Ownership: Test maintainers)
- Pass/fail/skip status per test
- Total passed count
- Total failed count
- Total skipped count
- Flaky test count (tests that pass on rerun)
**3. Coverage Metrics** (Ownership: Code owners)
- Overall coverage percentage
- Coverage by module
- Uncovered lines identification
**4. Resource Usage Metrics** (Ownership: CI/infrastructure owners)
- CPU usage
- Memory usage
#### Derived Metrics (Interpretations)
**1. Performance Metrics**
- Test execution speed (tests/second) - derived from runtime and test count
- Parallel execution efficiency - derived from sequential vs parallel runtime
- Runtime trends over time - derived from historical runtime data
**2. Health Trends**
- Pass rate (passed / total) - derived from pass/fail counts
- Failure rate - derived from pass/fail counts
- Coverage trends over time - derived from historical coverage data
- Flaky test rerun rate - derived from flaky test count and total runs
### Machine-Readable Artifacts
**Minimum Set (Always Emit in CI):**
1. **JUnit XML** (`--junitxml=reports/junit.xml`)
- Pass/fail/skip status
- Test timing information
- Enables aggregation and trend analysis
2. **Coverage Reports**
- Terminal summary (`--cov-report=term-missing`)
- HTML report (`--cov-report=html`) for deep inspection
- XML report (`--cov-report=xml`) for CI integration
3. **Slowest Tests** (`--durations=20`)
- Identifies performance bottlenecks
- Enables targeted optimization
**Recommended CI Command:**
```bash
pytest -m "unit or integration" \
--durations=20 \
--junitxml=reports/junit.xml \
--cov=podcast_scraper \
--cov-report=xml \
--cov-report=term \
--cov-report=html
```text
--json-report --json-report-file=reports/pytest.json \
--durations=20 \
--junitxml=reports/junit.xml \
--cov=podcast_scraper --cov-report=xml
```text
- Enables automated trend analysis
## GitHub Actions Integration
### Pull Requests
**Tier 1 Tests (Fast Confidence):**
- Run unit + integration tests
- Upload JUnit XML + coverage HTML as artifacts
- Print slowest 20 tests in logs
- Generate GitHub Actions job summary
**Job Summary Example:**
- Total tests: 250
- Passed: 248, Failed: 0, Skipped: 2
- Total runtime: 2m 15s
- Coverage: 65.3%
- Slowest test: `test_full_pipeline` (12.3s)
### Main Branch (Layer 2)
**Tier 1 + Tier 2 Tests (Full Validation):**
- Run all tests (unit + integration + E2E)
- Track flaky tests (reruns enabled)
- Generate basic metrics (JUnit XML, coverage reports)
- Upload artifacts for download
- Job summaries with key metrics
**Current Implementation:**
- `lint` job: Fast checks (1-2 min)
- `test-unit` job: Unit tests only (2-5 min)
- `test-integration` job: All integration tests with re-runs (5-10 min)
- `test-e2e` job: All E2E tests with re-runs and network guard (20-30 min)
- `docs` job: Documentation build (2-3 min)
- `build` job: Package build (1-2 min)
- All jobs run in parallel
### Nightly / Scheduled (Layer 3)
**Full Suite + Comprehensive Analysis:**
- **Complete test suite**: Everything that main branch does (lint, test-unit, test-integration, test-e2e, docs, build)
- **Comprehensive metrics collection**:
- JUnit XML for all test tiers
- Coverage XML/HTML reports
- Slowest tests identification (`--durations=20`)
- pytest-json-report for structured metrics
- **Trend tracking**:
- Append metrics to history file (CSV/JSONL)
- Store on dedicated branch or gh-pages
- Enable trend visualization
- **Reporting**:
- Generate comprehensive job summaries
- Create metrics dashboards
- Performance regression detection
- Flaky test analysis and reporting
- **Additional requirements**:
- Full artifact preservation
- Historical data aggregation
- Automated trend analysis
- Regression alerts (optional)
**Implementation Plan:**
- Scheduled workflow (nightly at 2 AM UTC)
- Runs all main branch jobs plus metrics collection
- Generates comprehensive reports
- Stores metrics for trend tracking
### GitHub Actions Job Summary
Use `$GITHUB_STEP_SUMMARY` to create automatic dashboards:
```bash
echo "## Test Results" >> $GITHUB_STEP_SUMMARY
echo "- Total: $(jq '.summary.total' reports/pytest.json)" >> $GITHUB_STEP_SUMMARY
echo "- Passed: $(jq '.summary.passed' reports/pytest.json)" >> $GITHUB_STEP_SUMMARY
echo "- Failed: $(jq '.summary.failed' reports/pytest.json)" >> $GITHUB_STEP_SUMMARY
echo "- Runtime: $(jq '.duration' reports/pytest.json)s" >> $GITHUB_STEP_SUMMARY
echo "- Coverage: $(coverage report --format=total)" >> $GITHUB_STEP_SUMMARY
```text
- No additional infrastructure required
**Pros:**
- Zero maintenance
- No external dependencies
- Historical data preserved
**Cons:**
- Manual comparison required
- No automatic trend visualization
### Option B: Lightweight History File (Recommended)
**Implementation:**
- Append one row per run to `metrics/history.csv` (or JSONL)
- Store on `gh-pages` branch or dedicated metrics branch
- Columns: `date`, `commit`, `tier_runtime`, `passed`, `failed`, `skipped`, `coverage`, `flaky_count`
**Example CSV:**
```csv
date,commit,tier0_runtime,tier1_runtime,tier2_runtime,passed,failed,skipped,coverage,flaky_count
2024-12-28T19:00:00Z,abc123,2.1,33.6,0,248,0,2,65.3,0
2024-12-28T20:00:00Z,def456,2.2,34.1,0,249,0,2,65.5,0
```text
- Minimal maintenance
- Version-controlled history
### Option C: External Metrics Service (Future)
**Future Enhancement:**
- Integrate with metrics service (e.g., Datadog, Grafana)
- Automatic dashboards and alerts
- Advanced analytics and regression detection
**When to Consider:**
- When project scales significantly
- When multiple contributors need metrics access
- When automated alerts are needed
## Flaky Test Detection
### Definition
**Flaky Test:** A test is flaky if it:
- Fails and passes on rerun without code changes, OR
- Has < 95% pass rate over last N runs
This explicit definition enables automation and clear identification of unstable tests.
### Detection Methods
**Method 1: Rerun-on-Failure (Current)**
- Use `pytest-rerunfailures` with `--reruns 2 --reruns-delay 1`
- Tests that pass on rerun are counted as flaky
- Track flaky count in metrics
**Method 2: Quarantine Marker**
- Mark known flaky tests with `@pytest.mark.flaky`
- Track number of quarantined tests
- Monitor failure rate of quarantined tests over time
**Method 3: Historical Analysis**
- Track test pass/fail history over multiple runs
- Identify tests with inconsistent results
- Flag tests with < 95% pass rate over last N runs as flaky (matches definition)
### Reporting
**Include in Metrics:**
- Total flaky test count
- Flaky test names
- Flaky test failure rate
- Trend over time (improving or degrading)
## CI Layer Strategy
### Layer 1: Pull Requests (Fast Feedback)
**Purpose:** Quick validation for PRs
**What runs:**
- Fast tests only (unit + fast integration + fast e2e)
- Basic metrics (JUnit XML, coverage)
- Job summaries
**Current Status:** ✅ Implemented
### Layer 2: Main Branch (Full Validation)
**Purpose:** Complete validation on merge to main
**What runs:**
- All tests (unit + integration + e2e, including slow/ml_models)
- Basic metrics (JUnit XML, coverage)
- Artifact uploads
- Job summaries
**Current Status:** ✅ Implemented
### Layer 3: Nightly Builds (Comprehensive Analysis)
**Purpose:** Comprehensive metrics, reporting, and trend tracking
**What runs:**
- **Everything from Layer 2** (lint, test-unit, test-integration, test-e2e, docs, build)
- **Comprehensive metrics collection**:
- JUnit XML for all test tiers
- Coverage XML/HTML reports
- Slowest tests (`--durations=20`)
- pytest-json-report for structured metrics
- **Trend tracking**:
- Metrics history file (CSV/JSONL)
- Historical data aggregation
- Trend visualization
- **Reporting**:
- Comprehensive job summaries
- Metrics dashboards
- Performance regression detection
- Flaky test analysis
- **Additional requirements**:
- Full artifact preservation
- Automated trend analysis
- Regression alerts (optional)
**Current Status:** 🚧 **To Be Implemented**
## Implementation Plan
### Phase 1: Basic Metrics Collection (Layer 2 Enhancement)
**Goal:** Add basic metrics to main branch runs
- [ ] Ensure JUnit XML generation in CI
- [ ] Ensure coverage XML/HTML generation in CI
- [ ] Add `--durations=20` to CI test commands
- [ ] Upload test artifacts in GitHub Actions
**Estimated Time:** 1-2 days
**Status:** 🚧 In Progress
### Phase 2: GitHub Actions Job Summary (Layer 2 Enhancement)
**Goal:** Display key metrics in job summaries
- [ ] Create job summary script
- [ ] Extract metrics from JUnit XML and coverage reports
- [ ] Format and display in GitHub Actions summary
- [ ] Test on PR and main branch runs
**Estimated Time:** 1-2 days
**Status:** 🚧 To Be Implemented
### Phase 3: Layer 3 - Nightly Builds (Comprehensive Analysis)
**Goal:** Implement comprehensive nightly builds with full metrics and reporting
**Tasks:**
- [ ] **Create nightly workflow**
- Scheduled trigger (nightly at 2 AM UTC)
- Runs all Layer 2 jobs plus metrics collection
- Comprehensive reporting
- [ ] **Enhanced metrics collection**
- Add `pytest-json-report` for structured metrics
- Generate comprehensive JUnit XML for all tiers
- Generate coverage XML/HTML for all tiers
- Collect slowest tests from all tiers
- [ ] **Trend tracking implementation**
- Create metrics history file (CSV/JSONL)
- Script to append metrics per run
- Store on dedicated branch or gh-pages
- Enable trend visualization
- [ ] **Comprehensive reporting**
- Generate detailed job summaries
- Create metrics dashboards
- Performance regression detection
- Flaky test analysis and reporting
- [ ] **Additional requirements**
- Full artifact preservation (extended retention)
- Historical data aggregation
- Automated trend analysis
- Regression alerts (optional)
**Estimated Time:** 3-5 days
**Status:** 🚧 To Be Implemented
### Phase 4: Enhanced Metrics (Layer 3 Enhancement)
**Goal:** Advanced metrics and analysis
- [ ] Implement flaky test detection and reporting
- [ ] Create automated trend analysis
- [ ] Set up alerts for regressions (optional)
- [ ] Advanced visualization dashboards
**Estimated Time:** 2-3 days
**Status:** 🚧 Future Enhancement
## Design Decisions
### 1. Metrics Collection Strategy
**Decision:** Start with artifacts, add history file later
**Rationale:**
- Artifacts: Zero maintenance, immediate value
- History file: Enables trends without external services
- Future: Can add external service if needed
### 2. Flaky Test Detection
**Decision:** Use rerun-on-failure + quarantine markers
**Rationale:**
- Rerun-on-failure: Automatic detection
- Quarantine markers: Explicit acknowledgment
- Historical analysis: Future enhancement
## Benefits
### Codebase Health
- ✅ **Visibility**: Key metrics tracked on every run
- ✅ **Trends**: Historical data enables improvement tracking
- ✅ **Regression detection**: Identify slow tests and performance issues
- ✅ **Data-driven decisions**: Metrics guide optimization efforts
### CI/CD Integration
- ✅ **Automated metrics**: No manual work required
- ✅ **Artifact preservation**: Historical data available
- ✅ **Job summaries**: Always-visible health snapshots
- ✅ **Trend analysis**: Track progress over time
## Related Files
- `.github/workflows/python-app.yml`: CI test jobs
- `docs/architecture/TESTING_STRATEGY.md`: Overall testing strategy
- `docs/guides/DEVELOPMENT_GUIDE.md`: Development workflow
- `pyproject.toml`: Pytest configuration and markers
## Current Implementation Status
### ✅ Layer 1 (PRs) - Implemented
- Fast feedback jobs (`test-fast` and `test`)
- Basic test execution
- Parallel execution for speed
### ✅ Layer 2 (Main Branch) - Implemented
- All test jobs (test-unit, test-integration, test-e2e)
- Complete validation
- Parallel execution for speed
### 🚧 Layer 3 (Nightly Builds) - To Be Implemented
- Comprehensive metrics collection
- Trend tracking
- Reporting and dashboards
- Performance regression detection
## Notes
- Metrics collection should be low-friction (automated, no manual work)
- Trend tracking can start simple (artifacts) and evolve (history file, external service)
- Flaky test detection improves over time with historical data
- Layer 3 (nightly builds) provides comprehensive analysis without slowing down PR or main branch CI
- See `RFC-024-test-execution-optimization.md` for test execution optimization strategy