Skip to content

RFC-027: Pipeline Metrics Improvements

  • Status: Draft
  • Authors:
  • Stakeholders: Maintainers, developers, performance analysts
  • Related Issues:
  • GitHub Issue #120 - Implementation tracking
  • Related PRDs:
  • docs/prd/PRD-001-transcript-pipeline.md (core pipeline)
  • Related RFCs:
  • docs/rfc/RFC-025-test-metrics-and-health-tracking.md (test metrics - different domain)
  • docs/rfc/RFC-026-metrics-consumption-and-dashboards.md (metrics consumption - different domain)
  • Related Documents:
  • src/podcast_scraper/metrics.py - Current metrics implementation
  • src/podcast_scraper/workflow/orchestration.py - Pipeline orchestration
  • docs/wip/METRICS_REVIEW.md - Original analysis (to be deleted after RFC creation)

Abstract

This RFC defines improvements to the pipeline metrics system that tracks application performance during podcast scraping operations. The current system provides comprehensive metrics but has issues with duplicate output, missing metrics, inconsistent formatting, and lack of export capabilities.

Key Principle: Pipeline metrics should provide clear, actionable insights without cluttering logs, and should be exportable for analysis and monitoring.

Problem Statement

Current Issues:

  1. Duplicate Metric Output
  2. Both log_metrics() and _generate_pipeline_summary() output metrics
  3. Creates redundant information in logs
  4. log_metrics() is too verbose for normal use (all metrics at INFO level)

  5. Missing Metrics

  6. No separate tracking of RSS fetch time (combined with scraping)
  7. No model loading times (Whisper, summarization models)
  8. No cache hit/miss rates for speaker detection
  9. No parallel processing efficiency metrics
  10. No memory usage tracking

  11. Inconsistent Formatting

  12. log_metrics() uses "Title Case" (e.g., "Run Duration Seconds")
  13. _generate_pipeline_summary() uses lowercase with dashes (e.g., "transcripts_saved")
  14. Makes comparison and parsing difficult

  15. No Export Capabilities

  16. Metrics only available in logs
  17. No JSON/CSV export for programmatic analysis
  18. No integration with monitoring systems (Prometheus, etc.)

Impact:

  • Logs are cluttered with duplicate metric information
  • Difficult to analyze performance trends over time
  • Missing visibility into important performance bottlenecks (model loading, cache efficiency)
  • No way to integrate metrics into monitoring dashboards

Current Implementation

Metrics Collected

Per-Run Metrics:

  • run_duration_seconds: Total pipeline execution time
  • episodes_scraped_total: Total episodes found in RSS feed
  • episodes_skipped_total: Episodes skipped (e.g., due to --skip-existing)
  • errors_total: Total errors encountered
  • bytes_downloaded_total: Total bytes downloaded

Processing Statistics:

  • transcripts_downloaded: Direct transcript downloads
  • transcripts_transcribed: Whisper transcriptions performed
  • episodes_summarized: Episodes with summaries generated
  • metadata_files_generated: Metadata files created

Per-Stage Timing:

  • time_scraping: Time spent scraping RSS feed
  • time_parsing: Time spent parsing RSS feed
  • time_normalizing: Time spent normalizing episode data
  • time_writing_storage: Time spent writing files

Per-Episode Operation Timing:

  • download_media_times: List of media download times per episode
  • transcribe_times: List of transcription times per episode
  • extract_names_times: List of speaker detection times per episode
  • summarize_times: List of summary generation times per episode

Collection Points

  1. Media Download (episode_processor.py:267)
  2. Records time when media is downloaded for transcription
  3. Only records if dl_elapsed > 0 and pipeline_metrics is available

  4. Transcription (episode_processor.py:430)

  5. Records Whisper transcription time
  6. Only records if pipeline_metrics is available

  7. Speaker Detection (workflow.py:580)

  8. Records time spent extracting speaker names
  9. Only records if pipeline_metrics is available

  10. Summary Generation (metadata.py:778)

  11. Records time spent generating summaries
  12. Only records if summary_elapsed > 0 and pipeline_metrics is available

Current Reporting

Two Separate Outputs:

  1. pipeline_metrics.log_metrics() (metrics.py:158)
  2. Logs ALL metrics from finish() dictionary
  3. Format: "Pipeline finished:" with all metrics listed
  4. Called in workflow.py:537 before summary generation
  5. Uses INFO level logging

  6. _generate_pipeline_summary() (workflow.py:1759)

  7. Generates a selective summary with key metrics
  8. Includes: counts, averages, errors, skipped episodes
  9. Returns formatted string that is logged by cli.py:718
  10. Uses different formatting (lowercase with dashes)

Goals

Primary Goals

  1. Eliminate Duplicate Output
  2. Single source of truth for metric reporting
  3. Clear separation between detailed and summary metrics
  4. Configurable verbosity

  5. Add Missing Metrics

  6. RSS fetch time (separate from parsing)
  7. Model loading times (Whisper, summarization)
  8. Cache hit rates (speaker detection)
  9. Parallel processing efficiency
  10. Memory usage (peak memory)

  11. Standardize Formatting

  12. Consistent key naming (snake_case)
  13. Consistent units (seconds, bytes, counts)
  14. Consistent number formatting (decimals, rounding)

  15. Enable Export

  16. JSON export for programmatic analysis
  17. CSV export for spreadsheet analysis
  18. Optional integration with monitoring systems

Success Criteria

  • ✅ No duplicate metric output in logs
  • ✅ Detailed metrics available at DEBUG level
  • ✅ Summary metrics at INFO level
  • ✅ All recommended metrics tracked
  • ✅ Consistent formatting across all outputs
  • ✅ JSON/CSV export available
  • ✅ Metrics can be integrated into monitoring systems

Solution: Two-Tier Output with Export

Approach

Option B: Two-Tier Output (Recommended)

  • Keep log_metrics() for detailed metrics (DEBUG level)
  • Keep _generate_pipeline_summary() for summary (INFO level)
  • Make log_metrics() conditional on log level
  • Add export capabilities (JSON/CSV)
  • Standardize formatting across both outputs

Benefits:

  • ✅ Provides flexibility without cluttering normal logs
  • ✅ Detailed metrics available when needed (DEBUG level)
  • ✅ Summary metrics always visible (INFO level)
  • ✅ No breaking changes to existing behavior
  • ✅ Enables export for analysis

Implementation Plan

Phase 1: Consolidate Output (1-2 days)

Tasks:

  • [ ] Modify log_metrics() to use DEBUG level instead of INFO
  • [ ] Update workflow.py to conditionally call log_metrics() based on log level
  • [ ] Standardize formatting in both log_metrics() and _generate_pipeline_summary()
  • [ ] Use consistent key naming (snake_case) in both outputs
  • [ ] Ensure consistent units and number formatting

Deliverables:

  • src/podcast_scraper/metrics.py - Updated log_metrics() method
  • src/podcast_scraper/workflow/orchestration.py - Conditional logging logic
  • Consistent formatting across all metric outputs

Example Changes:

# metrics.py

def log_metrics(self) -> None:
    """Log detailed metrics at DEBUG level."""
    metrics_dict = self.finish()
    summary_lines = ["Pipeline finished (detailed metrics):"]
    for key, value in metrics_dict.items():

        # Use snake_case consistently

        summary_lines.append(f"  - {key}: {value}")
    summary = "\n".join(summary_lines)
    logger.debug(summary)  # Changed from logger.info()
```python

- [ ] Add RSS fetch time tracking (separate from scraping)
- [ ] Add model loading time tracking (Whisper, summarization)
- [ ] Add cache hit/miss tracking for speaker detection
- [ ] Add parallel processing efficiency metrics
- [ ] Add memory usage tracking (peak memory)

**Deliverables:**

- `src/podcast_scraper/metrics.py` - New metric fields and methods
- `src/podcast_scraper/workflow/orchestration.py` - Metric collection points
- `src/podcast_scraper/workflow/episode_processor.py` - Model loading time tracking
- `src/podcast_scraper/speaker_detection.py` - Cache hit/miss tracking

**New Metrics:**

```python
@dataclass
class Metrics:

    # ... existing metrics ...

    # New metrics

    time_rss_fetch: float = 0.0  # RSS fetch time (separate from parsing)
    whisper_model_loading_time: float = 0.0  # Whisper model loading time
    summarization_model_loading_time: float = 0.0  # Summarization model loading time
    speaker_detection_cache_hits: int = 0  # Cache hits for speaker detection
    speaker_detection_cache_misses: int = 0  # Cache misses for speaker detection
    parallel_processing_efficiency: float = 0.0  # Efficiency of parallel processing
    peak_memory_mb: float = 0.0  # Peak memory usage in MB
```text

- [ ] Add `export_json()` method to `Metrics` class
- [ ] Add `export_csv()` method to `Metrics` class
- [ ] Add CLI flag `--export-metrics` with format option (json/csv)
- [ ] Add optional Prometheus integration (future work)

**Deliverables:**

- `src/podcast_scraper/metrics.py` - Export methods
- `src/podcast_scraper/cli.py` - CLI flag and export logic
- Export functionality for JSON and CSV formats

**Example Export:**

```python

# metrics.py

def export_json(self, filepath: str) -> None:
    """Export metrics to JSON file."""
    metrics_dict = self.finish()
    with open(filepath, 'w') as f:
        json.dump(metrics_dict, f, indent=2)

def export_csv(self, filepath: str) -> None:
    """Export metrics to CSV file."""
    metrics_dict = self.finish()
    with open(filepath, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=metrics_dict.keys())
        writer.writeheader()
        writer.writerow(metrics_dict)
```text

- [ ] Test metrics collection in dry-run mode
- [ ] Test metrics accuracy when operations are skipped
- [ ] Test metrics collection with parallel processing
- [ ] Validate metric calculations (averages, totals)
- [ ] Update documentation with new metrics and export options

**Deliverables:**

- Test coverage for new metrics
- Documentation updates
- Examples of exported metrics

## Design Decisions

### 1. Two-Tier Output vs Single Output

**Decision:** Use two-tier output (DEBUG for detailed, INFO for summary)

**Rationale:**

- Provides flexibility without cluttering normal logs
- Detailed metrics available when needed (DEBUG level)
- Summary metrics always visible (INFO level)
- No breaking changes to existing behavior

**Alternative Considered:** Single comprehensive output (Option A)

**Why Not:** Would require removing one of the existing outputs, potentially breaking existing workflows or scripts that parse logs.

### 2. Export Format

**Decision:** Support both JSON and CSV export

**Rationale:**

- JSON: Machine-readable, easy to parse programmatically
- CSV: Easy to import into spreadsheets for analysis
- Both formats are standard and widely supported

**Future:** Prometheus integration can be added later if needed

### 3. Metric Collection Timing

**Decision:** Keep conditional collection (only record if `pipeline_metrics` is available and `elapsed > 0`)

**Rationale:**

- Prevents false metrics (zero times when operations didn't occur)
- Correct behavior - only record actual operations
- Should be documented but not changed

### 4. Formatting Standardization

**Decision:** Use snake_case consistently across all outputs

**Rationale:**

- Matches Python naming conventions
- Consistent with existing codebase patterns
- Easier to parse programmatically

## Benefits

### Developer Experience

-  Cleaner logs (no duplicate metrics)
-  Detailed metrics available when needed (DEBUG level)
-  Summary metrics always visible (INFO level)
-  Consistent formatting makes logs easier to read

### Analysis Capabilities

-  Export to JSON/CSV for programmatic analysis
-  Additional metrics provide better visibility
-  Can track performance trends over time
-  Can identify bottlenecks (model loading, cache efficiency)

### Monitoring Integration

-  JSON export enables integration with monitoring systems
-  Future Prometheus integration possible
-  Metrics can be tracked in dashboards

## Related Files

- `src/podcast_scraper/metrics.py` - Metrics collection and reporting
- `src/podcast_scraper/workflow/orchestration.py` - Pipeline orchestration and metric logging
- `src/podcast_scraper/workflow/episode_processor.py` - Episode processing and metric collection
- `src/podcast_scraper/workflow/metadata_generation.py` - Metadata generation and metric collection
- `src/podcast_scraper/cli.py` - CLI interface and export options

## Notes

- This RFC focuses on **pipeline/application metrics**, not test metrics (see RFC-025 for test metrics)
- Conditional collection behavior is correct and should be documented, not changed
- Export capabilities enable future integration with monitoring systems
- Two-tier output provides flexibility without breaking existing workflows