ADR-032: Unified Retry Policy with Metrics¶
- Status: Accepted
- Date: 2026-02-05
- Authors: Podcast Scraper Team
- Related Issues: #399 (Provider-level concerns to harden)
Context & Problem Statement¶
API-based providers (OpenAI, Gemini, Anthropic, etc.) can experience transient failures (429 rate limits, 5xx errors, connection resets). Each provider initially implemented its own retry logic with different behaviors, delays, and logging formats. This led to:
- Inconsistent retry behavior across providers
- "Provider A is flaky" myths when it was actually retry policy differences
- Duplicated retry logic in every provider
- No unified way to track retry counts and rate limit sleep time
- Inconsistent logging (some providers logged retries, others didn't)
Decision¶
We adopt a Unified Retry Policy with Metrics using the retry_with_metrics utility function.
- All API-based providers use
retry_with_metrics()for all API calls (transcription, summarization, speaker detection). - Exponential backoff: Starts at
initial_delay(default 1.0s), doubles each retry, capped atmax_delay(default 30.0s). - Rate limit detection: Automatically detects 429 errors and respects
retry_afterheaders when available. - Unified logging: Logs compact retry line:
provider_retry: provider=openai attempt=2 sleep=4.0 reason=429. - Metrics tracking: Tracks retry counts and rate limit sleep time via
ProviderCallMetrics.record_retry().
Rationale¶
- Consistent Behavior: All providers retry the same way, preventing "provider A is flaky" myths
- Centralized Logic: Retry logic is maintained in one place, not duplicated across providers
- Visibility: Unified logging format makes retries visible and debuggable
- Metrics Integration: Retry counts and sleep time are tracked in metrics for analysis
- Rate Limit Handling: Automatic detection and respect for
retry_afterheaders prevents unnecessary retries
Alternatives Considered¶
- Provider-Specific Retry Logic: Rejected as it leads to inconsistency and duplication.
- Library-Based Retry (tenacity, backoff): Rejected as it doesn't integrate with our metrics system and doesn't provide unified logging.
- No Retry Logic: Rejected as transient failures are common with API providers and users expect automatic retries.
Consequences¶
- Positive:
- Consistent retry behavior across all providers
- Unified logging format for easy debugging
- Metrics tracking for retry analysis
- Automatic rate limit handling
- Centralized maintenance (one place to update retry logic)
- Negative:
- All providers must use the same retry policy (can't customize per-provider)
- Slight overhead for providers that don't need retries (but minimal)
- Neutral:
- Requires one-time migration of all providers to use
retry_with_metrics
Implementation Notes¶
- Module:
src/podcast_scraper/utils/provider_metrics.py-retry_with_metricsfunction - Pattern: Decorator-like wrapper function (not a decorator to allow metrics parameter)
- Default Parameters:
max_retries=3initial_delay=1.0max_delay=30.0retryable_exceptions=(Exception,)(can be customized per call)- Rate Limit Detection: Checks for "429", "rate limit", "quota", "resource exhausted" in error messages
- All API Providers: OpenAI, Gemini, Anthropic, Mistral, Grok, DeepSeek, Ollama
- Logging Format:
provider_retry: provider={name} attempt={n} sleep={sec} reason={reason}
References¶
- Issue #399: Provider-level concerns to harden
src/podcast_scraper/utils/provider_metrics.py- Implementation- ADR-027: Unified Provider Metrics Contract - Related metrics contract