Testing Strategy¶
Document Structure:
- This document - High-level strategy, test pyramid, decision criteria
- Testing Guide - Quick reference, test execution commands
- Unit Testing Guide - Unit test mocking patterns and isolation
- Integration Testing Guide - Integration test mocking guidelines
- E2E Testing Guide - E2E server, real ML models, OpenAI mocking
- Critical Path Testing Guide - What to test and prioritization
Overview¶
This document defines the testing strategy for the podcast scraper codebase. It establishes the test pyramid approach, decision criteria for choosing test types, and high-level testing patterns.
The system supports 9 providers (1 local ML + 1 hybrid ML + 7 LLM: OpenAI,
Gemini, Anthropic, Mistral, DeepSeek, Grok, Ollama)
with per-provider unit, integration, and E2E tests.
All LLM providers use versioned Jinja2 prompt
templates managed by PromptStore (RFC-017).
GIL: The Grounded Insight Layer (GIL, PRD-017) has testing for insight/quote extraction,
grounding contract validation, and gi.json schema compliance. See GIL Testing (Implemented) below.
For detailed implementation guides per test layer, see the layer-specific guides linked above.
Problem Statement¶
Testing requirements and strategies were previously scattered across individual RFCs, making it difficult to:
- Understand the overall testing approach
- Ensure consistent testing patterns across modules
- Plan new test infrastructure
- Onboard new contributors to testing practices
- Track testing coverage and requirements
This unified testing strategy document provides a single source of truth for all testing decisions and requirements.
Test Pyramid¶
The testing strategy follows a three-tier pyramid:
/\
/E2E\ ← Few, realistic end-to-end tests
/------\
/Integration\ ← Moderate, focused integration tests
/------------\
/ Unit \ ← Many, fast unit tests
/----------------\
| Layer | Scope | Entry Point | HTTP | Fixtures | ML Models |
|---|---|---|---|---|---|
| Unit | Individual functions/modules | Function/class level | Mocked | Mocked | Mocked |
| Integration | Component interactions | Component level | Local test server (or mocked) | Test fixtures | Real (optional) |
| E2E | Complete user workflows | User level (CLI/API) | Real HTTP client (local server) | Real data files | Real (in workflow) |
Volume and default bias¶
Goal: Most tests are unit tests (~80%); integration (~14%) and E2E (~6%) are complementary layers—not copies of the same scenarios.
Current distribution (~3,770 tests total): ~3,000 unit | ~530 integration | ~230 E2E.
| Layer | Expectation |
|---|---|
| Unit | Default for new work. Cover branches, errors, config, parsing, and provider behavior with mocks. This layer should have the highest test count (~80%) and run on every PR (fast). |
| Integration | Cross-boundary validation. Integration tests validate cross-module boundaries and provider interactions (factories, stages, HTTP client against a local server). Avoid re-testing every unit case again here. |
| E2E | Non-redundant with integration. E2E tests are kept non-redundant with integration tests—each covers a distinct full workflow (CLI / run_pipeline / critical paths) that proves the stack end-to-end. Do not add E2E for every new flag or branch—use unit tests for those. |
Anti-pattern: Asserting the same behavior three times (unit + integration + E2E) without a distinct guarantee at each layer. Prefer one strong unit test plus integration or E2E only where the layer adds real value (real I/O, multi-component flow, or user entry point).
Decision Questions¶
- Am I testing a complete user workflow? (CLI command, library API call, service API call)
- YES → E2E Test
-
NO → Continue to question 2
-
Am I testing how multiple components work together? (RSS parser → Episode → Provider → File)
- YES → Integration Test
-
NO → Continue to question 3
-
Am I testing a single function/module in isolation?
- YES → Unit Test
- NO → Review test scope and purpose
Common Patterns¶
- Component workflow (RSS → Episode → Provider) → Integration Test
- Complete CLI command (
podcast-scraper <url>) → E2E Test - Library API call (
run_pipeline(config)) → E2E Test - Error handling in pipeline → Integration Test (if focused) or E2E Test (if complete workflow)
- HTTP client behavior → Integration Test (if isolated) or E2E Test (if in workflow)
Unit Tests¶
- Purpose: Test individual functions/modules in isolation
- Speed: Fast (< 100ms each)
- Scope: Single module, mocked dependencies
- Coverage: High (target: ≥70% code coverage)
- Examples: Config validation, filename sanitization, URL normalization
Integration Tests¶
- Purpose: Test interactions between multiple modules/components (component interactions, data flow)
- Speed: Moderate (< 5s each for fast tests)
- Scope: Multiple modules working together, real internal implementations
- Entry Point: Component-level (functions, classes, not user-facing APIs)
- I/O Policy:
- ✅ Allowed: Real filesystem I/O (temp directories), real component interactions
- ❌ Mocked: External services (HTTP APIs, external APIs) - mocked for speed/reliability
- ⚠️ ML models: Mocked by default for speed. Use real models with
@pytest.mark.ml_modelsfor ML workflow integration tests (excluded from fast suite, see Integration Testing Guide) - ✅ Optional: Local HTTP server for HTTP client testing in isolation
- Coverage: Critical paths and edge cases, component interactions
- Examples: Provider factory → provider implementation, RSS parser → Episode → Provider → File output, HTTP client with local test server
- Key Distinction: Tests how components work together, not complete user workflows. Mock ML
models for fast tests; use real models with
@pytest.mark.ml_modelsfor ML workflow tests.
End-to-End Tests¶
- Purpose: Test complete user workflows from entry point to final output (CLI commands, library API calls, service API calls)
- Speed: Slow (< 60s each, may be minutes for full workflows)
- Scope: Full pipeline from entry point to output, real HTTP client, real data files, real ML models
- Entry Point: User-level (CLI commands,
run_pipeline(),service.run()) - I/O Policy:
- ✅ Allowed: Real HTTP client with local HTTP server (no external network), real filesystem I/O, real data files
- ✅ Real implementations: Use actual HTTP clients (no mocking), real file operations, real model loading
- ✅ Real ML models: Use real Whisper, spaCy, and Transformers models (NO mocks)
- ✅ Real data files: RSS feeds, transcripts,
audio files from
tests/fixtures/e2e_server/ - ❌ No external network: All HTTP calls go to local server (network guard prevents external calls)
- ❌ No mocks: E2E tests use real implementations throughout (no Whisper mocks, no ML model mocks)
- Provider E2E tests: Dedicated E2E test files
exist for each of the 9 providers:
test_ml_models_e2e.py,test_hybrid_ml_provider_e2e.py,test_openai_provider_e2e.py,test_gemini_provider_e2e.py,test_anthropic_provider_e2e.py,test_mistral_provider_e2e.py,test_deepseek_provider_e2e.py,test_grok_provider_e2e.py,test_ollama_provider_e2e.py. LLM providers use mock API endpoints served by the E2E HTTP server. - Coverage: Complete user workflows, production-like scenarios
- Examples: CLI command
(
podcast-scraper <rss_url>) → Full pipeline → Output files, Library API (run_pipeline(config)) → Full pipeline → Output files - Key Distinction: Tests complete user workflows with real implementations. NO mocks allowed — tests the system as users would use it.
Decision Criteria¶
The decision questions above provide a quick way to determine test type. For critical path prioritization, see Critical Path Testing Guide. For detailed implementation guidelines, see Testing Guide.
Quick Reference:
- Unit Test: Single function/module in isolation, all dependencies mocked
- Integration Test: Multiple components working together, real internal implementations, mocked external services (including ML models for speed)
- E2E Test: Complete user workflow from entry point to output, real HTTP client, real data files, real ML models (NO mocks)
Critical Path Priority: If your test covers the critical path (RSS → Parse → Download/Transcribe → NER → Summarization → Metadata → Files), prioritize it. See Critical Path Testing Guide for details.
Test Categories¶
1. Unit Tests¶
Configuration & Validation (config.py)¶
- RFC-008: Validate coercion logic, error messages, alias handling
- RFC-007: Test argument parsing edge cases (invalid speaker counts, unknown config keys)
- Test Cases:
- Type coercion (string → int, validation failures)
- Config file loading (JSON/YAML, invalid formats)
- Default value application
- Alias resolution (
rssvsrss_url)
Filesystem Operations (filesystem.py)¶
- RFC-004: Sanitization edge cases, output derivation logic
- Test Cases:
- Filename sanitization (special characters, reserved names)
- Output directory derivation and validation
- Run suffix generation
- Path normalization across platforms
RSS Parsing (rss_parser.py)¶
- RFC-002: Varied RSS shapes, namespace differences, missing attributes
- Test Cases:
- Namespace handling (Podcasting 2.0, standard RSS)
- Relative URL resolution
- Missing optional fields
- Malformed XML handling
- Edge cases (uppercase tags, mixed namespaces)
Transcript Downloads (downloader.py, episode_processor.py)¶
- RFC-003: Extension derivation edge cases
- Test Cases:
- URL normalization (encoding, special characters)
- Extension inference (from URL, Content-Type, declared type)
- HTTP retry logic (unit test with mocked responses)
- Transcript type preference ordering
Whisper Integration (providers/ml/whisper_utils.py)¶
- RFC-005: Mock Whisper library, loading paths, error handling
- RFC-006: Screenplay formatting with synthetic segments
- Test Cases:
- Model loading (success, missing dependency, invalid model)
- Screenplay formatting (gap handling, speaker rotation, aggregation)
- Language parameter propagation
- Model selection logic (
.envariants for English)
Progress Reporting (progress.py)¶
- RFC-009: Noop factory,
set_progress_factorybehavior - Test Cases:
- Factory registration and replacement
- Progress update calls
- Context manager behavior
Speaker Detection (speaker_detection.py) - RFC-010¶
- RFC-010: NER extraction scenarios, host/guest distinction
- Test Cases:
- Title-only detection (
"Alice interviews Bob") - Description-rich detection (multiple guest names)
- Feed-level host inference
- CLI override precedence
- spaCy missing/disabled scenarios
- Name capping when too many detected
Summarization (summarizer.py) - RFC-012¶
- RFC-012: Local transformer model integration, summary generation with map-reduce strategy
- Test Cases:
- Model selection (explicit, auto-detection for MPS/CUDA/CPU)
- Model loading and initialization on different devices
- Model integration tests (marked as
@pytest.mark.slowand@pytest.mark.integration):- Verify all models in
DEFAULT_SUMMARY_MODELScan be loaded when configured - Test each model individually:
bart-large,bart-small,long,long-fast - Catch dependency issues (e.g., missing protobuf for PEGASUS models)
- Verify model and tokenizer are properly initialized
- Test model unloading after loading
- Verify all models in
- Map-reduce strategy:
- Map phase: chunking (word-based and token-based), chunk summarization
- Reduce phase decision logic: single abstractive (≤800 tokens), mini map-reduce (800-4000 tokens), extractive (>4000 tokens)
- Mini map-reduce: re-chunking combined summaries into 3-5 sections (650 words each), second map phase (summarize each section), final abstractive reduce
- Extractive fallback behavior for extremely large combined summaries
- Summary generation with various text lengths
- Key takeaways extraction
- Text chunking for long transcripts
- Safe summarization error handling (OOM, missing dependencies)
- Memory optimization (CUDA/MPS)
- Model unloading and cleanup
- Integration with metadata generation pipeline
Provider System (RFC-013, RFC-029)¶
- RFC-013/RFC-029: Protocol-based provider architecture for transcription, speaker detection, and summarization with 9 unified providers.
- Provider Coverage: Each provider has a consistent test structure:
| Provider | Unit Tests | Integration | E2E |
|---|---|---|---|
| MLProvider | test_ml_provider.py, _lifecycle.py |
test_provider_real_models.py |
test_ml_models_e2e.py |
| HybridMLProvider | test_hybrid_ml_provider.py, _lifecycle.py |
test_hybrid_ml_providers.py |
test_hybrid_ml_provider_e2e.py |
| OpenAIProvider | test_openai_provider.py, _factory.py, _lifecycle.py |
test_openai_providers.py |
test_openai_provider_e2e.py |
| GeminiProvider | test_gemini_provider.py, _factory.py, _lifecycle.py |
test_gemini_providers.py |
test_gemini_provider_e2e.py |
| AnthropicProvider | test_anthropic_provider.py, _factory.py, _lifecycle.py |
test_anthropic_providers.py |
test_anthropic_provider_e2e.py |
| MistralProvider | test_mistral_provider.py, _factory.py, _lifecycle.py |
test_mistral_providers.py |
test_mistral_provider_e2e.py |
| DeepSeekProvider | test_deepseek_provider.py, _factory.py, _lifecycle.py |
test_deepseek_providers.py |
test_deepseek_provider_e2e.py |
| GrokProvider | test_grok_provider.py, _factory.py, _lifecycle.py |
test_grok_providers.py |
test_grok_provider_e2e.py |
| OllamaProvider | test_ollama_provider.py, _factory.py, _lifecycle.py |
test_ollama_providers.py |
test_ollama_provider_e2e.py |
-
Provider Capabilities:
test_capabilities.py(unit) andtest_capabilities_e2e.py(E2E) testProviderCapabilities— which providers support which features (JSON mode, tool calls, streaming). -
Test Cases:
- Unit Tests (Standalone Provider): Each provider tested directly with all dependencies mocked. Tests: creation, initialization, protocol method implementation, error handling, cleanup, configuration validation.
- Unit Tests (Factory): Factory tests verify correct provider instantiation per config, protocol compliance, error handling.
- Unit Tests (Lifecycle): Provider lifecycle tests verify initialization, cleanup, resource management.
- Integration Tests: Real provider implementations with mocked external services. Tests: provider factory, protocol compliance, component interactions, provider switching, error handling in workflow context.
- E2E Tests: Real providers in full pipeline. LLM providers use E2E server mock endpoints. MLProvider uses real ML models (Whisper, spaCy, Transformers). Tests: complete workflows, error scenarios (API failures, rate limits).
Prompt Management (RFC-017)¶
- Prompt Store:
test_prompt_store.py(unit) tests versioned Jinja2 prompt template management. - Test Cases:
- Template loading per provider/task/version
- Jinja2 variable rendering
- Missing template error handling
- Template version selection
- All 9 providers have prompt directories
(
prompts/<provider>/ner/,summarization/) - Template validation (syntax, required variables)
Service API (service.py)¶
- Public API: Service interface for daemon/non-interactive use
- Test Cases:
ServiceResultdataclass (success/failure states, attributes)service.run()with valid Config (success path)service.run()with logging configurationservice.run()exception handling (returns failed ServiceResult)service.run_from_config_file()with JSON configservice.run_from_config_file()with YAML configservice.run_from_config_file()with missing file (returns failed ServiceResult)service.run_from_config_file()with invalid config (returns failed ServiceResult)service.run_from_config_file()with Path objectsservice.main()CLI entry point (success/failure exit codes)service.main()version flag handlingservice.main()missing config argument handling- Service API importability via
__getattr__ - ServiceResult equality and string representation
- Integration with public API (
Config,load_config_file,run_pipeline)
Reproducibility & Operational Hardening (Issue #379)¶
- Determinism: Seed-based reproducibility for
torch,numpy, andtransformers. Test that seeds are set correctly and outputs are consistent across runs. - Run Tracking: Test run manifest creation (system state capture), run summary generation (manifest + metrics), and episode index generation (episode status tracking).
- Failure Handling: Test
--fail-fastand--max-failuresflags, episode failure tracking in metrics, and exit code behavior. - Retry Policies: Test exponential backoff retry for transient errors (network failures, model loading errors), retry counts and delays, and fallback to cache clearing.
- Timeout Enforcement: Test transcription and summarization timeout enforcement, timeout error handling, and timeout configuration.
- Security: Test path validation (directory traversal prevention), model allowlist validation, safetensors format preference, and
trust_remote_code=Falseenforcement. - Structured Logging: Test JSON log formatting, log aggregation compatibility, and structured log fields.
- Diagnostics: Test
doctorcommand checks (Python version, ffmpeg, write permissions, model cache, network connectivity).
For detailed unit test execution commands, test file descriptions, fixtures, requirements, and coverage, see Unit Testing Guide.
2. Integration Tests¶
CLI Integration (cli.py + workflow.py)¶
- RFC-001: Success, dry-run, concurrency edge cases, error handling
- RFC-007: CLI happy path, invalid args, config file precedence
- Test Cases:
- CLI argument parsing → Config validation → pipeline execution
- Config file loading and precedence (CLI > config file)
- Dry-run mode (no disk writes)
- Error handling and exit codes
- Dependency injection hooks (
apply_log_level_fn,run_pipeline_fn)
Workflow Orchestration (workflow.py)¶
- RFC-001: End-to-end coordination, concurrency, cleanup
- Test Cases:
- RSS fetch → episode parsing → transcript download → file writing
- Concurrent transcript downloads (ThreadPoolExecutor)
- Whisper job queuing and sequential processing
- Temp directory cleanup
- Skip-existing and clean-output flags
Episode Processing (episode_processor.py)¶
- RFC-003: HTTP response simulation with various headers
- RFC-004: Directory management interactions
- Test Cases:
- Transcript download with various formats (VTT, SRT, JSON)
- Media download for Whisper transcription
- File naming and storage
- Error handling (network failures, missing files)
Config + CLI + Workflow¶
- RFC-008: CLI + config files → Config instantiation
- Test Cases:
- Config file loading → validation → pipeline execution
- Config override precedence
- Invalid config error handling
Whisper + Screenplay Formatting¶
- RFC-006: Screenplay flags → formatting → file output
- RFC-010: Detected speaker names → screenplay formatting
- Test Cases:
- Whisper transcription → screenplay formatting with detected names
- Speaker name override precedence
- Language-aware model selection
For detailed integration test execution commands, test file descriptions, fixtures, requirements, and coverage, see Integration Testing Guide.
3. End-to-End Tests¶
E2E Test Coverage Goals¶
Critical Path Priority: The critical path must have E2E tests for all three entry points (CLI, Library API, Service API). See Critical Path Testing Guide for details.
Every major user-facing entry point should have at least one E2E test:
- CLI Commands - Each main CLI command should have E2E tests
- Library API Endpoints - Each public API function should have E2E tests
- Critical User Scenarios - Important workflows should have E2E tests
What Doesn't Need E2E Tests:
- Not every CLI flag combination needs an E2E test
- Every possible configuration value (tested in integration/unit tests)
- Edge cases in specific components (tested in integration tests)
Rule of Thumb: E2E tests should cover "as a user, I want to..." scenarios, not every possible configuration combination.
For detailed E2E test execution commands and implementation, see E2E Testing Guide.
E2E Test Tiers (Code Quality vs Data Quality)¶
E2E tests are organized into three tiers to balance fast CI feedback with comprehensive validation:
| Tier | Purpose | Episodes | Models | When | Makefile Target |
|---|---|---|---|---|---|
| Tier 1: Fast | Code quality, critical path | 1 | Test (tiny/base) | Every PR | test-e2e-fast |
| Tier 2: Data Quality | Volume validation | 3 | Test (tiny/base) | Nightly | test-e2e-data-quality |
| Tier 3: Nightly Full | Production validation | 15 (5×3) | Production (base, BART-large, LED-large) | Nightly | test-nightly |
Tier 1: Fast E2E Tests (@pytest.mark.e2e + @pytest.mark.critical_path)
- 1 episode per test for fast feedback
- Test models (Whisper tiny.en, BART-base, LED-base)
- Run on every PR and push to main
- Focus: Does the code work correctly?
Tier 2: Data Quality Tests (@pytest.mark.e2e + @pytest.mark.data_quality)
- 3 episodes per test for volume validation
- Test models (same as Tier 1)
- Run in nightly builds only
- Focus: Does data processing work with volume?
Tier 3: Nightly Full Suite (@pytest.mark.nightly)
- 15 episodes across 5 podcasts (p01-p05)
- Production models (Whisper base, BART-large-cnn, LED-large-16384)
- Run in nightly builds only
- Focus: Production-quality validation with real models
- Memory / CI:
make test-nightlyruns sequentially (no pytest-xdist). Parallel execution was removed because exit-code mismatches triggered a fallback path that re-ran the full suite, doubling wall time from ~75 min to ~3 h. Sequential also avoids OOM onubuntu-latestwhere two workers each loading Whisper + large HF models exceeded available RAM.
Key Principle: Code quality tests (Tier 1) run on every PR. Data quality and nightly tests (Tiers 2-3) run only in nightly builds to avoid slowing down CI/CD feedback.
LLM/OpenAI Exclusion: Tests marked @pytest.mark.llm are excluded from nightly builds
(-m "not llm") to avoid API costs. See issue #183.
Test Infrastructure¶
Test Framework¶
- Primary: pytest (with unittest compatibility)
- Fixtures: pytest fixtures for test setup and teardown
- Markers: pytest markers for test categorization
Mocking Strategy¶
- Unit Tests: Mock all external dependencies (HTTP, ML models, file system, API clients)
- Integration Tests: Mock external services (HTTP APIs, external APIs) and ML models (Whisper, spaCy, Transformers), use real internal implementations (real providers, real Config, real
workflow logic)
- E2E Tests: Use real implementations (HTTP client, ML models, file system) with local test server. For API providers, use E2E server mock endpoints instead of direct API calls. ML models
(Whisper, spaCy, Transformers) are REAL - no mocks allowed.
Provider Testing Strategy (9 providers):
- Unit Tests (Standalone Provider): Each of the
9 providers tested directly with all dependencies
mocked (API clients, ML models). Dedicated test
files:
test_<provider>_provider.py,_factory.py,_lifecycle.py. - Unit Tests (Factory): Test factories create
correct unified providers for each
<provider>_providerconfig value. Verify protocol compliance, test factory error handling. - Integration Tests: Per-provider integration
files (
test_<provider>_providers.py) use real provider implementations with mocked external services. ML models mocked for speed. - E2E Tests: Per-provider E2E files
(
test_<provider>_provider_e2e.py) use E2E server mock endpoints for LLM providers, real ML models for MLProvider. - Key Principle: Always verify protocol compliance, not class names. All providers implement the same protocol interfaces.
- Prompt Templates: LLM providers load prompts
via
PromptStore. Unit tests mockPromptStore; integration/E2E tests use real templates. - Test Organization: See
docs/wip/PROVIDER_TEST_STRATEGY.mdfor detailed test organization and separation.
Test Organization¶
The test suite is organized into three main categories:
tests/unit/- Unit tests per module (fast, isolated, fully mocked)tests/integration/- Integration tests (component interactions, real internal implementations, mocked external services)tests/e2e/- E2E tests (complete workflows, real HTTP client, real ML models)
Test Markers¶
@pytest.mark.unit- Unit tests (fast, isolated)@pytest.mark.integration- Integration tests (component interactions)@pytest.mark.e2e- End-to-end workflow tests@pytest.mark.critical_path- Critical path tests (run in fast suite)@pytest.mark.serial- Tests that must run sequentially (resource conflicts)-
@pytest.mark.ml_models- Requires ML dependencies (whisper, spacy, transformers) or uses real ML models -
@pytest.mark.slow- Slow-running tests @pytest.mark.network- Tests that hit external network@pytest.mark.multi_episode- Multi-episode tests (nightly)@pytest.mark.data_quality- Data quality validation (nightly only, 3 episodes)@pytest.mark.nightly- Comprehensive nightly tests (15 episodes, production models)@pytest.mark.llm- Tests using LLM APIs (excluded from nightly to avoid costs)@pytest.mark.openai- Tests using OpenAI API specifically (subset of llm)@pytest.mark.gil- GIL extraction tests@pytest.mark.grounding- Grounding contract validation tests
Execution Pattern: Tests marked serial run first sequentially, then remaining tests run in
parallel with -n auto. All tests use network isolation via
--disable-socket --allow-hosts=127.0.0.1,localhost.
For detailed test infrastructure information, see Testing Guide.
ML Quality Evaluation¶
Beyond functional testing, the project uses objective metrics to evaluate the quality of ML-generated outputs. These evaluations are performed using dedicated scripts and a "golden" dataset.
Evaluation Layers¶
| Layer | Focus | Metrics | Tools |
|---|---|---|---|
| Cleaning | Effective removal of ads/outro | Removal %, Brand detection | Automatic (via experiment runner) |
| Summarization | Accuracy and synthesis quality | ROUGE-1/2/L, Compression ratio | Automatic (via experiment runner) |
| GIL | Insight/quote extraction quality | Grounding rate, Quote verbatim %, Insight coverage | Per-provider comparison |
Golden Datasets¶
Evaluation is performed against human-verified ground
truth data stored in data/eval/. This dataset is
versioned and frozen to provide a stable baseline for
comparison. See
ADR-040
for details.
GIL Golden Dataset: A golden dataset for GIL evaluation includes human-annotated insights, quotes, and grounding links for a representative set of episodes. This enables comparison across extraction tiers (ML-only, Hybrid, Cloud LLM).
Continuous Improvement¶
Quality evaluation is integrated into the AI Quality & Experimentation Platform (PRD-007), which uses these metrics to gate new model deployments and configuration changes. The complete evaluation loop (runner, scorer, comparator) is documented in the Experiment Guide (Step 4: Evaluate Results).
CI/CD Integration¶
Continuous Integration Strategy¶
The CI/CD pipeline (GitHub Actions) implements a multi-layered validation strategy to balance fast feedback with comprehensive quality control.
1. Every Pull Request (Per-Commit Feedback)¶
- lint: Fast formatting, code linting, markdown linting, and type checking (~2 min)
- test-unit: All unit tests with coverage, verified network isolation (~3-5 min)
- test-integration-fast: Critical path integration tests using test ML models (~5-8 min)
- test-e2e-fast: Critical path E2E tests (Tier 1) using test ML models (~8-12 min)
- build: Package build validation (~2 min)
- docs: Documentation build validation (~3 min)
2. Main Branch & Release Branches (Full Validation)¶
- Unified Coverage: Combines unit, integration, and E2E coverage reports into a single artifact.
- test-integration: Complete integration test suite.
- test-e2e: Complete E2E test suite (Tier 1).
- Automated Metrics: Collection of test health, code quality, and pipeline performance metrics.
- Unified Dashboard: Deployment of the Metrics Dashboard to GitHub Pages.
3. Nightly Comprehensive (Deep Analysis)¶
- nightly-only-tests: Full validation (Tier 3) using production-quality ML models (Whisper base, BART-large, LED-large).
- Data Quality (Tier 2): Volume validation with multiple episodes.
- Module Dependency Analysis: Automated graph generation and coupling analysis.
- Trend Tracking: Historical metrics accumulation for long-term health monitoring.
Test Execution Pattern¶
- Parallel Execution: Most jobs run in parallel with reserved cores for system stability (
auto - 2). - Flaky Test Resilience: Unit, integration, and E2E jobs in GitHub Actions use automatic reruns (
--reruns 2 --reruns-delay 1) alongsidepytest-json-reportwhere JSON is emitted. - Network Isolation: Enforced via
pytest-socketacross all test tiers. - LLM Exclusion: API-based tests (OpenAI) are excluded from nightly runs to avoid costs.
For detailed test execution commands, parallel execution, and coverage configuration, see Testing Guide.
Testing Patterns¶
Dependency Injection¶
- CLI Testing: Use
cli.main()override callables (apply_log_level_fn,run_pipeline_fn,logger) - Workflow Testing: Mock
run_pipelinefor CLI-focused tests - Benefit: Test CLI behavior without executing full pipeline
Mocking External Dependencies¶
- HTTP: Mock
requests.Sessionand responses (unit/integration tests), use E2E server for E2E tests - Whisper: Mock
whisper.load_model()andwhisper.transcribe()(unit tests), use real models (integration/E2E tests) - ML Dependencies (spacy, torch, transformers):
- Unit Tests: Mock in
sys.modulesbefore importing dependent modules - Integration Tests: Real ML dependencies required
- Verification: CI runs
scripts/tools/check_unit_test_imports.pyto ensure modules can import without ML deps - File System: Use
tempfilefor isolated test environments - API Providers (OpenAI, Gemini, Anthropic, Mistral, DeepSeek, Grok, Ollama):
- Unit Tests: Mock API clients (e.g.,
OpenAI,genai,anthropic.Anthropicclasses) - Integration Tests: Mock API clients or use E2E server mock endpoints
- E2E Tests: Use E2E server mock endpoints
(real HTTP client, mock API responses). Dedicated
mock fixtures:
openai_mock.py,gemini_mock_client.py
Test Isolation¶
- Each test uses
tempfile.TemporaryDirectoryfor output - Tests clean up after themselves
- No shared state between tests
- Mock external services (HTTP, file system)
- No network calls in unit tests (enforced by pytest plugin)
- No filesystem I/O in unit tests (enforced by pytest plugin, except tempfile operations)
Error Testing¶
- Test validation errors (invalid config, malformed RSS)
- Test network failures (timeouts, connection errors)
- Test missing dependencies (Whisper, spaCy unavailable)
- Test edge cases (empty feeds, missing transcripts, invalid URLs)
Test Requirements by Module¶
cli.py¶
- [ ] Argument parsing (valid, invalid, edge cases)
- [ ] Config file loading and precedence
- [ ] Error handling and exit codes
- [ ] Dependency injection hooks
- [ ] Version flag behavior
config.py¶
- [ ] Type coercion and validation
- [ ] Default value application
- [ ] Config file loading (JSON/YAML)
- [ ] Alias resolution
- [ ] Error messages
workflow.py¶
- [ ] Pipeline orchestration
- [ ] Concurrent downloads
- [ ] Whisper job queuing
- [ ] Cleanup operations
- [ ] Dry-run mode
rss_parser.py¶
- [ ] RSS parsing (various formats)
- [ ] Namespace handling
- [ ] URL resolution
- [ ] Episode creation
- [ ] Error handling (malformed XML)
downloader.py¶
- [ ] HTTP session configuration
- [ ] Retry logic
- [ ] URL normalization
- [ ] Streaming downloads
- [ ] Error handling
episode_processor.py¶
- [ ] Transcript download
- [ ] Media download
- [ ] File naming
- [ ] Whisper job creation
- [ ] Error handling
filesystem.py¶
- [ ] Filename sanitization
- [ ] Output directory derivation
- [ ] Run suffix generation
- [ ] Path validation
providers/ml/whisper_utils.py¶
- [ ] Model loading
- [ ] Transcription invocation
- [ ] Screenplay formatting
- [ ] Language handling
- [ ] Error handling (missing dependency)
speaker_detection.py (RFC-010)¶
- [ ] NER extraction
- [ ] Host/guest distinction
- [ ] CLI override precedence
- [ ] Fallback behavior
- [ ] Caching logic
progress.py¶
- [ ] Factory registration
- [ ] Progress updates
- [ ] Context manager behavior
summarizer.py (RFC-012)¶
- [x] Model selection logic (explicit, auto-detection for MPS/CUDA/CPU)
- [x] Model loading and initialization
- [x] Model integration tests (all models in
DEFAULT_SUMMARY_MODELScan be loaded) - [ ] Summary generation
- [ ] Key takeaways generation
- [x] Text chunking for long transcripts
- [ ] Safe summarization with error handling
- [ ] Memory optimization (CUDA/MPS)
- [x] Model unloading
- [ ] Integration with metadata generation
service.py (Public API)¶
- [x]
ServiceResultdataclass - [x]
service.run()with valid Config - [x]
service.run()with logging configuration - [x]
service.run()exception handling - [x]
service.run_from_config_file()JSON/YAML - [x]
service.run_from_config_file()error cases - [x]
service.main()CLI entry point - [x]
service.main()version/missing config - [x] Service API importability via
__getattr__ - [x] ServiceResult equality and string repr
- [x] Integration with public API
providers/ (All 9 Providers)¶
- [x] MLProvider: unit, lifecycle, integration, E2E
- [x] HybridMLProvider: unit, lifecycle, integration, E2E
- [x] OpenAIProvider: unit, factory, lifecycle, integration, E2E
- [x] GeminiProvider: unit, factory, lifecycle, integration, E2E
- [x] AnthropicProvider: unit, factory, lifecycle, integration, E2E
- [x] MistralProvider: unit, factory, lifecycle, integration, E2E
- [x] DeepSeekProvider: unit, factory, lifecycle, integration, E2E
- [x] GrokProvider: unit, factory, lifecycle, integration, E2E
- [x] OllamaProvider: unit, factory, lifecycle, integration, E2E
- [x] Provider capabilities (
test_capabilities.py) - [x] Provider params (
test_provider_params.py)
prompts/store.py (Prompt Management)¶
- [x] Template loading per provider/task/version
- [x] Jinja2 variable rendering
- [x] Missing template error handling
- [x] Template version selection
schemas/summary_schema.py¶
- [x] Summary schema validation
(
test_summary_schema.py)
GIL Modules (RFC-049 — Implemented)¶
- [x]
gi/schema.py—gi.jsonvalidation (unit:test_schema.py) - [x]
gi/io.py— gi.json read/write (unit:test_io.py) - [x] Standalone schema validation:
scripts/tools/validate_gi_schema.pyandmake validate-gi-schema [ARTIFACTS_DIR=path]; E2E tests that produce gi.json run strict validation inline (ci-fast gates on it). - [x] KG artifacts:
scripts/tools/validate_kg_schema.pyandmake validate-kg-schema [ARTIFACTS_DIR=path]; unit teststest_kg_pipeline.py,test_kg_llm_extract.py,test_kg_schema.py,test_kg_contracts.py; integrationtest_kg_integration.py,test_kg_metadata_integration.py; E2Etest_kg_cli_e2e.py(fixture + pipeline stub + provider mocks: OpenAI, Anthropic, Gemini, Grok, DeepSeek); acceptanceconfig/acceptance/kg/*.yaml(includingacceptance_planet_money_kg_grok.yamlfor Grok + provider KG). - [x]
gi/load.py— load artifact, evidence spans, find by episode/insight id (unit:test_load.py) - [x]
gi/explore.py— scan, collect, topic filter (unit:test_explore.py) - [x]
gi/pipeline.py— build_artifact stub and grounded (unit:test_pipeline.py) - [x]
gi/grounding.py— find_grounded_quotes (unit:test_grounding.py; optional integration with real models) - [x] GIL in workflow: metadata_generation writes gi.json when generate_gi true (unit: test_metadata_generation; E2E: test_gi_cli_e2e)
- [x] CLI
gi inspect,gi show-insight,gi explore(unit: test_cli TestGiSubcommand; E2E: test_gi_cli_e2e) - [x] Three-tier extraction (ML-only, Hybrid, Cloud) — implemented (transformers, hybrid_ml, LLM providers)
- [ ] GIL extraction latency per tier — benchmarking planned
Test Pyramid Status¶
Note: Test distribution numbers should be verified periodically by running test collection and counting tests by layer. Historical progress tracking was documented in
docs/wip/TEST_PYRAMID_ANALYSIS.mdanddocs/wip/TEST_PYRAMID_PLAN.md(now consolidated here).
Current State vs. Ideal Distribution¶
The test pyramid has reached a healthy shape that matches the ideal targets.
Current Distribution (~3,770 tests, verified Apr 2026):
- Unit Tests: ~3,000 (~80%) ✅ Target 70-80%
- Integration Tests: ~530 (~14%) ✅ Target 15-20%
- E2E Tests: ~230 (~6%) ✅ Target 5-10%
Visual Representation:
╱╲
╱ ╲ E2E: ~6% (~230 tests)
╱ ╲
╱ ╲ Integration: ~14% (~530 tests)
╱ ╲
╱ ╲ Unit: ~80% (~3,000 tests)
╱____________╲
Total: ~3,770 tests
Ideal Pyramid:
╱╲
╱ ╲ E2E: 5-10%
╱ ╲
╱ ╲ Integration: 15-20%
╱ ╲
╱ ╲ Unit: 70-80%
╱____________╲
The pyramid is now well-balanced. Historical issues (misclassified tests, missing unit coverage) have been addressed through the improvement phases below. Remaining maintenance items:
- Unit layer (~80%): Healthy. All core modules have unit tests. Continue adding unit tests as default for new work.
- Integration layer (~14%): Validates cross-module boundaries and provider interactions across all 9 providers.
- E2E layer (~6%): Covers complete user workflows (CLI, Library API, Service API) without duplicating integration-level scenarios.
Goals and Targets¶
Target Distribution (achieved as of Apr 2026):
- Unit Tests: 70-80% (~3,000 tests) ✅
- Integration Tests: 15-20% (~530 tests) ✅
- E2E Tests: 5-10% (~230 tests) ✅
Success Metrics:
- Unit test execution time: < 30 seconds
- Integration test execution time: < 5 minutes
- E2E test execution time: < 20 minutes
- Test coverage: Maintain ≥70%
Improvement Strategy (Completed)¶
All four phases have been executed. The pyramid now matches the target distribution.
Phase 1: Reclassify Misplaced Tests ✅ Done
- Reclassified summarizer tests from E2E to unit.
- Reviewed and reclassified E2E tests that violated testing strategy definitions.
Phase 2: Add Missing Unit Tests ✅ Done
- Added unit tests for all core modules (
workflow.py,episode_processor.py,summarizer.py,speaker_detection.py,preprocessing.py,progress.py,metrics.py,filesystem.py,cli.py,service.py).
Phase 3: Optimize Integration Layer ✅ Done
- Moved component interaction tests from E2E to integration.
- Added integration tests for cross-module boundaries.
Phase 4: Reduce E2E to True E2E ✅ Done
- E2E layer now contains only true end-to-end user workflow tests.
- Final Result: Unit: ~80%, Integration: ~14%, E2E: ~6% ✅
Unit Test Coverage by Module¶
All previously-identified coverage gaps have been addressed. Key modules and their unit test status:
Core Modules (covered):
summarizer.py: Text cleaning, chunking, validation functions ✅workflow.py: Pipeline orchestration helpers ✅episode_processor.py: Episode processing logic ✅
Supporting Modules (covered):
speaker_detection.py: Detection and scoring logic ✅cli.pyandservice.py: Argument parsing and service logic ✅preprocessing.py,progress.py,metrics.py,filesystem.py: Utility functions ✅
Future Testing Enhancements¶
E2E Test Infrastructure Improvements (Issue #14)¶
- [x] Local HTTP test server
- [x] Test audio file fixtures
- [x] Real Whisper integration tests
- [x] Test markers and CI integration
Library API Tests (Issue #16)¶
- [x]
run_pipeline()E2E tests - [x]
load_config_file()tests - [x] Error handling tests
- [x] Return value validation
GIL Testing (Implemented — PRD-017, RFC-049/050)¶
Testing for the Grounded Insight Layer follows the established test pyramid. Current coverage:
Provider-based evidence: The pipeline uses find_grounded_quotes_via_providers (in gi/grounding.py), which calls provider methods extract_quotes and score_entailment. These are unit-tested with mocked dependencies in test_grounding.py and test_pipeline.py; provider implementations (ML and LLM) are covered in test_ml_provider.py, test_openai_provider.py, and other provider test modules. Evidence stack integration (embedding, QA, NLI) is covered in test_evidence_stack_integration.py.
Unit Tests:
- [x]
gi/schema.py— JSON schema validation (test_schema.py) - [x]
gi/io.py— read/write gi.json,collect_gi_paths_from_inputs(test_io.py) - [x]
gi/quality_metrics.py— PRD-017 aggregates (test_gil_quality_metrics.py) - [x]
gi/corpus.py— NDJSON / merged export (test_gi_corpus.py) - [x]
kg/quality_metrics.py— PRD-019 aggregates (test_quality_metrics.py) - [x]
gi/load.py— load artifact, evidence spans, find by episode/insight id (test_load.py) - [x]
gi/explore.py— scan, collect, topic filter (test_explore.py) - [x]
gi/pipeline.py— build_artifact stub, grounded via providers (includingcreate_gil_evidence_providers/ injected quote + entailment providers) (test_pipeline.py) - [x]
gi/grounding.py— find_grounded_quotes with mocked QA/NLI; find_grounded_quotes_via_providers with mock extract_quotes/score_entailment; pipeline_metrics evidence call counters (test_grounding.py) - [x] Providers: extract_quotes and score_entailment (ML and LLM) unit-tested with mocked dependencies (
test_ml_provider.py,test_openai_provider.py, etc.). Provider path and evidence method behaviour are covered intest_grounding.pyandtest_pipeline.py. - [x] Workflow: generate_episode_metadata passes quote_extraction_provider and entailment_provider into build_artifact when generate_gi and gi_require_grounding true (
test_metadata_generation.py) - [x] CLI gi subcommand: parse, validate, export, inspect, show-insight, explore, query, exit codes (
test_cli.py); config logging warnings for GIL stub insights and API summary + local evidence hybrid (TestLogConfigurationGiStubWarning,TestLogConfigurationGilHybridWarning) - [x] CI fixtures:
tests/fixtures/gil_kg_ci_enforce— GIL + KG quality metrics enforce (GitHub Actions +make quality-metrics-ci)
GIL and KG CI quality gates¶
make quality-metrics-ci— Runsgil_quality_metrics.pyandkg_quality_metrics.pywith--enforce(and strict schema flags) ontests/fixtures/gil_kg_ci_enforce. Listed inmake help; depended on bymake ci-fastin the repositoryMakefile.- GitHub Actions — The
test-unitjob includes GIL and KG quality metrics on CI fixtures (PRD-017 / PRD-019) (same fixture tree; see.github/workflows/python-app.yml). - Optional local gates —
make gil-quality-metrics DIR=<run_root>andmake kg-quality-metrics DIR=<run_root>withARGS='--enforce …'for release or regression checks over a real run directory. - Critical-path E2E —
tests/e2e/test_gi_cli_e2e.pyandtests/e2e/test_kg_cli_e2e.pycarry@pytest.mark.critical_pathon GI/KG CLI scenarios (included inmake test-fast/make ci-fastper project defaults).
Integration Tests:
- [x] Evidence stack: load embedding, QA, NLI; encode, answer, entailment_score (
test_evidence_stack_integration.py) - [x] Optional: find_grounded_quotes with real models (skip when offline)
- [ ]
gi.json→ Postgres export (RFC-051) — not yet implemented
E2E Tests:
- [x] Full pipeline with generate_gi true → gi inspect, show-insight, explore (
test_gi_cli_e2e.py) - [x] Pre-built artifact dir → gi inspect, show-insight, explore (exit 0)
- [x] Invalid args: gi explore without --output-dir (exit 2); gi show-insight without --id (exit 2)
Quality Evaluation (Planned):
- Quote verbatim accuracy, grounding rate, insight coverage
- Cross-provider comparison using golden dataset
Model Registry Testing (RFC-044, Implemented)¶
ModelRegistryinitialization and model lookupModelCapabilitiesvalidation (token limits, device defaults)- Registry used by summarizer and GIL extractor
- Invalid model name error handling
Hybrid ML Platform Testing (RFC-042, Implemented)¶
- Hybrid MAP-REDUCE pipeline (LED map → FLAN-T5 reduce)
- Extractive QA for quote extraction
- NLI for grounding validation
- Sentence embeddings for topic deduplication
StructuredExtractorprotocol compliance
Performance Testing¶
- [ ] Benchmark large feed processing (1000+ episodes)
- [ ] Measure Whisper transcription performance
- [ ] Profile memory usage
- [ ] Test concurrent download limits
- [ ] GIL extraction latency per tier (benchmarking planned)
Property-Based Testing¶
- [ ] Generate random RSS feeds
- [ ] Test filename sanitization with fuzzing
- [ ] Test URL normalization with edge cases
- [ ]
gi.jsonschema fuzzing
References¶
- Critical Path Testing Guide — What to test, prioritization
- Testing Guide — Test execution, fixtures, coverage
- Unit Testing Guide — Mocking patterns and isolation
- Integration Testing Guide — Integration test mocking guidelines
- E2E Testing Guide — E2E server, real ML models, API mocking
- Test structure reorganization:
docs/rfc/RFC-018-test-structure-reorganization.md - CI workflow:
.github/workflows/python-app.yml - Related RFCs: RFC-001–RFC-018 (testing strategies), RFC-029 (unified providers), RFC-017 (prompt store)
- Implemented RFCs: RFC-042 (Hybrid ML), RFC-044 (Model Registry), RFC-049 (GIL Schema), RFC-050 (GIL Pipeline); Draft: RFC-051 (DB Projection)
- Related Issues: #14 (E2E testing), #16 (Library API E2E tests), #94 (src/ layout), #98 (Test structure reorganization)
- Architecture:
docs/architecture/ARCHITECTURE.md - Non-Functional Requirements:
docs/architecture/NON_FUNCTIONAL_REQUIREMENTS.md(quality attributes validated by tests) - Contributing guide:
CONTRIBUTING.md