Skip to content

PRD-016: Operational Observability & Pipeline Intelligence

  • Status: 📋 Draft
  • Authors: Podcast Scraper Team
  • Related RFCs: RFC-025 (Health), RFC-026 (Dashboards), RFC-027 (Metrics)

Summary

The Operational Observability & Pipeline Intelligence platform is a unified system for managing the health, efficiency, and predictability of the podcast processing system. It transforms CI/CD and production runs from "Black Boxes" into a source of intelligence, where every run contributes to a historical record of system performance and operational reliability.

By combining Test Health Tracking (RFC-025), Unified Dashboards (RFC-026), and Precision Pipeline Metrics (RFC-027), we establish the capability to manage the "Hidden Cost" of operations—latency, flakiness, and resource consumption.

Background & Context

As the project scales toward a production-grade service, we face three primary "observability" challenges:

  1. Silent Slowdown: Without historical tracking, the pipeline can gradually become 2x slower over several months without a clear "breaking" event. We need to drive performance by identifying these trends.
  2. Flakiness Erosion: "Flaky" tests or jobs erode trust in the engineering process. We need to manage flakiness by treating it as a first-class metric to be reduced.
  3. Data Blindness: When a job fails or slows down, management needs to know where in the pipeline the bottleneck is (e.g., RSS Parsing vs. Whisper Transcription).

The Solution: An integrated observability platform that turns raw logs into actionable intelligence. It provides the "Radar" needed to manage the system's operational health at scale.

Goals

  1. Unified Health Score: Establish a single, objective "Health Score" for the project based on test pass rates, flakiness, and runtime stability.
  2. Bottleneck Identification: Provide 100% visibility into the timing of every pipeline stage (e.g., "Downloader is taking 40% of total runtime"). When GIL (Grounded Insight Layer) is enabled, the GIL stage is included as an instrumented stage.
  3. Data-Driven Infrastructure: Use resource usage metrics to drive decisions on CI hardware allocation or API rate limit tiers.
  4. Historical Accountability: Maintain a year-long record of system performance to ensure progress isn't just a "point in time" event.

Core Pillars

1. The Sensor (Raw Intelligence)

  • RFC Reference: RFC-025, RFC-027
  • Capability: Fine-grained metrics collection across tests and orchestration stages.
  • Driving Value: Collects the high-fidelity data needed to manage the system objectively.

2. The Radar (Unified Visibility)

  • RFC Reference: RFC-026
  • Capability: Aggregated, multi-source dashboards (GitHub Pages).
  • Driving Value: Provides the management "Cockpit" to see the entire system's health in one view.

3. The Analyst (Insight Engine)

  • RFC Reference: RFC-026 (Historical Tracking)
  • Capability: Trend analysis and historical comparisons.
  • Driving Value: Drives the project by identifying long-term regressions or improvements that single runs miss.

Personas & Use Cases

UC1: Performance Budgeting (The "Efficiency" Case)

Persona: Infrastructure Lead Scenario: "Our nightly runs are now taking 2 hours, up from 1 hour last month." Action: The Lead pulls the Historical Bottleneck Report from the dashboard. Result*: The platform identifies that the summarization stage has slowed down by 50% due to a specific model version update. The Lead drives a fix or model swap.

UC2: Trust Restoration (The "Reliability" Case)

Persona: Developer Devin Scenario: "I'm tired of the CI failing randomly on my PRs." Action: Devin checks the Flakiness Leaderboard on the dashboard. Result: The platform identifies test_rss_parser_timeout as the #1 flaky test. Devin manages the fix for that specific test, restoring trust for all contributors.

UC3: Resource Planning (The "Capacity" Case)

Persona: Project Manager Scenario: "Do we need to upgrade to a faster CI runner?" Action: The Manager reviews the Resource Consumption metrics. Result: The platform shows that the bottleneck is CPU-bound during ASR, not disk or network. The Manager drives the upgrade to compute-optimized runners.

Product Requirements

PR1: High-Precision Data Collection

  • FR1.1: Test Lifecycle Metrics: Track pass/fail, retry count (flakiness), and duration for every test in the suite.
  • FR1.2: Stage-Level Instrumentation: Measure wall-clock time and success rates for every major module (RSS, Downloader, ML).

PR2: Management Dashboards

  • FR2.1: Unified View: A single GitHub Pages dashboard that consolidates metrics from PRs, Nightly runs, and Release branches.
  • FR2.2: Trend Visualization: Interactive charts showing Runtime and Flakiness over time.

PR3: Intelligent History

  • FR3.1: JSON API: Export all metrics as a versioned JSON API for integration with external analysis tools.
  • FR3.2: Baseline Selector: Allow users to compare current "Build Health" against any arbitrary point in history (e.g., "vs v2.0.0").

Success Criteria

  • Operational Awareness: 100% of pipeline stages are instrumented and visible on the dashboard (including the GIL stage when GIL is enabled).
  • Flakiness Reduction: 0 "Unknown" flaky tests; all flakiness is identified and tracked by the system.
  • Stability Gating: Ability to drive a release based on "Historical Stability" rather than just a single passing run.
  • Management Efficiency: Identifying the root cause of a CI slowdown takes < 5 minutes of analysis.