Testing Guide¶

Document Structure:

Testing Strategy - High-level philosophy, test pyramid, decision criteria

This document - Quick reference, test execution commands

Unit Testing Guide - Unit test mocking patterns and isolation

Integration Testing Guide - Integration test mocking guidelines

E2E Testing Guide - E2E server, real ML models, OpenAI mocking; E2E feeds and server options (feeds per mode, error injection, URLs); chaos tests (e.g. 404 audio) assert run index records failed episodes

Critical Path Testing Guide - What to test and prioritization

Quick Reference¶

Layer	Speed	Scope	Mocking
Unit	< 100ms	Single function	All dependencies mocked
Integration	< 5s	Component interactions	External services mocked
E2E	< 60s	Complete workflow	No mocking (real everything)

Decision Tree:

Testing a complete user workflow? → E2E Test
Testing component interactions? → Integration Test
Testing a single function? → Unit Test

Running Tests¶

Default Commands¶

# Unit tests (parallel, network isolated)

make test-unit

# Integration tests (parallel, with reruns)

make test-integration

# E2E tests (parallel, with reruns)

make test-e2e

# All tests

make test

Fast Variants (Critical Path Only)¶

make test-fast              # Unit + critical path integration + e2e
make test-integration-fast  # Critical path integration tests
make test-e2e-fast          # Critical path E2E tests

Sequential (For Debugging)¶

For debugging test failures, run tests sequentially using pytest directly:

# Run all tests sequentially

pytest tests/ -n 0

# Run unit tests sequentially

pytest tests/unit/ -n 0

# Run integration tests sequentially

pytest tests/integration/ -n 0

# Run E2E tests sequentially

pytest tests/e2e/ -n 0

Specific Tests¶

# Run specific test file

pytest tests/unit/podcast_scraper/test_config.py -v

# Run with marker

pytest tests/integration/ -m "integration and critical_path" -v

# Run with coverage

pytest tests/unit/ --cov=podcast_scraper --cov-report=term-missing

Test Markers¶

Marker	Purpose
`@pytest.mark.unit`	Unit tests
`@pytest.mark.integration`	Integration tests
`@pytest.mark.e2e`	End-to-end tests
`@pytest.mark.critical_path`	Critical path tests (run in fast suite)
`@pytest.mark.nightly`	Nightly-only tests (excluded from regular CI)
`@pytest.mark.flaky`	May fail intermittently (gets reruns)
`@pytest.mark.serial`	Must run sequentially (rarely needed)
`@pytest.mark.ml_models`	Requires ML dependencies
`@pytest.mark.slow`	Slow-running tests
`@pytest.mark.network`	Hits external network
`@pytest.mark.llm`	Uses LLM APIs (excluded from nightly to avoid costs)
`@pytest.mark.openai`	Uses OpenAI specifically (subset of `llm`)

Network Isolation¶

Tests use network isolation appropriate to their layer:

Unit tests: Full socket blocking (--disable-socket --allow-hosts=127.0.0.1,localhost)
Integration/E2E: Host allowlist only (--allow-hosts=127.0.0.1,localhost)

Parallel Execution¶

Tests run in parallel by default using pytest-xdist:

All tests run in parallel by default with memory-aware worker calculation. The @pytest.mark.serial marker is registered but currently unused.

The Makefile automatically calculates the optimal number of workers based on:

Available system memory
CPU core count
Test type (unit/integration/e2e have different memory requirements)
Platform (more conservative on macOS)

See TROUBLESHOOTING.md for details on memory-aware worker calculation.

Note: The @pytest.mark.serial marker is rarely needed now. Global state cleanup fixtures in conftest.py reset shared state between tests, allowing most tests to run in parallel safely. Only use serial for tests with genuine resource conflicts.

Memory Cleanup Best Practices¶

Automatic Cleanup:

The test suite includes an automatic cleanup fixture (cleanup_ml_resources_after_test) that:

Limits PyTorch thread pools to prevent excessive thread spawning
Cleans up the global preloaded ML provider
Finds and cleans up ALL SummaryModel and provider instances created during tests (Issue #351)
Forces garbage collection after integration/E2E tests

Explicit Cleanup (Recommended):

While automatic cleanup handles most cases, explicit cleanup is recommended for clarity and immediate memory release:

from tests.conftest import cleanup_model, cleanup_provider

def test_something():
    # Create model directly
    model = summarizer.SummaryModel(...)
    try:
        # test code
    finally:
        cleanup_model(model)  # Explicit cleanup

def test_with_provider():
    # Create provider directly
    provider = create_summarization_provider(cfg)
    try:
        # test code
    finally:
        cleanup_provider(provider)  # Explicit cleanup

Why Explicit Cleanup?

Immediate memory release - Models are unloaded as soon as the test completes
Clarity - Makes it obvious that cleanup is happening
Defensive - Works even if automatic cleanup has issues
Best practice - Matches the pattern of resource management (try/finally)

Helper Functions:

cleanup_model(model) - Unloads a SummaryModel instance
cleanup_provider(provider) - Cleans up a provider instance (MLProvider, etc.)

Both functions are idempotent (safe to call multiple times) and handle None gracefully.

⚠️ Warning: `-s` Flag and Parallel Execution¶

Do not use -s (no capture) with parallel tests — it causes hangs due to tqdm progress bars competing for terminal access.

# DON'T DO THIS (hangs)

pytest tests/e2e/ -s -n auto

# DO THIS INSTEAD

pytest tests/e2e/ -v -n auto     # Use -v for verbose output
pytest tests/e2e/ -s -n 0        # Or disable parallelism
make test-e2e-sequential         # Or use sequential target

See Issue #176 for details.

Flaky Test Reruns¶

Integration and E2E tests use reruns:

pytest --reruns 3 --reruns-delay 1

Flaky Test Markers¶

Some tests are marked with @pytest.mark.flaky to indicate they may fail intermittently due to inherent non-determinism. These tests get automatic reruns.

Why Some Tests Are Flaky¶

Category	Tests	Root Cause
Whisper Transcription	4	ML inference variability - audio transcription has natural variation
Full Pipeline + Whisper	2	Whisper timing + audio file I/O
OpenAI Mock Integration	2	Mock response parsing timing
Full Pipeline + OpenAI	7	Complex multi-component timing

Current Flaky Test Count: 15¶

File	Count	Category
`test_basic_e2e.py`	7	Full pipeline with OpenAI mocks
`test_whisper_e2e.py`	4	Whisper inference variability
`test_full_pipeline_e2e.py`	2	Whisper transcription
`test_openai_provider_e2e.py`	2	OpenAI mock responses

Tests That Are NOT Flaky¶

The following categories are now stable and don't need flaky markers:

Transformers/spaCy model loading - Uses offline mode (HF_HUB_OFFLINE=1)
ML model tests - Explicit summary_reduce_model prevents cache misses
HTTP integration tests - Explicit server waits prevent timing issues
Parallel execution - Global state cleanup prevents race conditions

Reducing Flakiness¶

If you encounter a flaky test, check these common causes:

Network access - Should be blocked via pytest-socket
Model cache - Run make preload-ml-models first
Global state - Ensure cleanup fixtures reset shared state
Progress bars - TQDM_DISABLE=1 is set automatically in tests

See Issue #177 for investigation details.

E2E Test Modes¶

Set via E2E_TEST_MODE environment variable:

Mode	Episodes	Use Case
`fast`	1 per test	Quick feedback
`multi_episode`	5 per test	Full validation
`data_quality`	Multiple	Nightly only

ML Model Preloading¶

Tests require models to be pre-cached:

make preload-ml-models

See E2E Testing Guide for model defaults.

E2E Acceptance Tests¶

Acceptance tests allow you to run multiple configuration files sequentially, collect structured data (logs, outputs, timing, resource usage), and compare results against baselines. This is useful for:

Running the same configs across different code versions to detect regressions
Testing multiple configs with different RSS feeds or settings
Validating system acceptance of different provider/model configurations
Comparing performance metrics across runs

Setting up acceptance configs¶

The project expects your acceptance configs to live in a config/acceptance/. That folder is gitignored so you can keep local, feed-specific configs out of the repo.

Create the folder: mkdir -p config/acceptance (at project root).
Copy example configs: Use config/examples/config.example.yaml (or any example) as a template:
cp config/examples/config.example.yaml config/acceptance/config.my.myshow.yaml (or a name that fits your feeds).
Adjust for your definition of acceptance: Edit the copied file(s)—RSS feed URLs, providers, model names, output paths, etc.—so they match what you consider “acceptance” for your use case. You can add multiple configs (e.g. one per show or per provider) and run them all with a pattern like config/acceptance/*.yaml.

Optional: use config/playground/ (also gitignored) for ad-hoc or one-off configs; run them with e.g. make test-acceptance CONFIGS="config/playground/config.my.*.yaml".

Running Acceptance Tests¶

# Run a single config file
make test-acceptance CONFIGS="config/examples/config.example.yaml"

# Run multiple configs (using glob patterns)
make test-acceptance CONFIGS="config/acceptance/*.yaml"

# Save current runs as a baseline for future comparison
make test-acceptance CONFIGS="config/examples/config.example.yaml" SAVE_AS_BASELINE=baseline_v1

# Compare against an existing baseline
make test-acceptance CONFIGS="config/examples/config.example.yaml" COMPARE_BASELINE=baseline_v1

# Use fixture feeds (mock data) instead of real RSS feeds
make test-acceptance CONFIGS="config/examples/config.example.yaml" USE_FIXTURES=1

# Disable real-time log streaming (only save to files)
make test-acceptance CONFIGS="config/examples/config.example.yaml" NO_SHOW_LOGS=1

# Disable automatic analysis and benchmark reports
make test-acceptance CONFIGS="config/examples/config.example.yaml" NO_AUTO_ANALYZE=1 NO_AUTO_BENCHMARK=1

Understanding Sessions vs Runs¶

Session = One execution of the acceptance test tool

Triggered by a single command invocation
Can process multiple config files sequentially
Has a unique session_id (timestamp-based)
Contains a summary of all runs in that session

Run = One execution of a single config file within a session

Each config file you pass creates one run
Has its own run_id, timing, exit code, logs, and outputs
Runs execute sequentially within the session

Example:

If you run:

make test-acceptance CONFIGS="config1.yaml config2.yaml config3.yaml"

You get:

1 Session (with session_id = 20260208_101601)
3 Runs (one for each config file)
- run_20260208_101601_123 (config1.yaml)
- run_20260208_101601_456 (config2.yaml)
- run_20260208_101601_789 (config3.yaml)

Output Structure¶

Results are saved to .test_outputs/acceptance/ by default:

.test_outputs/acceptance/
├── sessions/
│   └── session_20260208_101601/          ← ONE SESSION
│       ├── session.json                   ← Summary of all runs
│       └── runs/
│           ├── run_20260208_101601_123/  ← RUN #1 (config1)
│           │   ├── config.original.yaml  ← Original config for this run
│           │   ├── config.yaml           ← Modified config used for execution
│           │   ├── run_data.json
│           │   ├── stdout.log
│           │   ├── stderr.log
│           │   └── ... (service outputs)
│           ├── run_20260208_101601_456/  ← RUN #2 (config2)
│           │   ├── config.original.yaml  ← Original config for this run
│           │   ├── config.yaml
│           │   └── ...
│           └── run_20260208_101601_789/  ← RUN #3 (config3)
│               ├── config.original.yaml  ← Original config for this run
│               ├── config.yaml
│               └── ...
└── baselines/
    └── baseline_v1/                       ← Saved baselines
        ├── baseline.json
        └── run_20260208_101601_123/      ← Copied run data

Analyzing Results¶

Use the analysis script to generate reports:

# Basic analysis
make analyze-acceptance SESSION_ID=20260208_101601

# Comprehensive analysis with baseline comparison
make analyze-acceptance SESSION_ID=20260208_101601 MODE=comprehensive COMPARE_BASELINE=baseline_v1

# Or use the script directly
python scripts/acceptance/analyze_bulk_runs.py \
    --session-id 20260208_101601 \
    --output-dir .test_outputs/acceptance \
    --mode comprehensive \
    --compare-baseline baseline_v1

Performance Benchmarking¶

Generate performance benchmarking reports that group runs by provider/model configuration:

# Generate benchmark report
make benchmark-acceptance SESSION_ID=20260208_101601

# Generate benchmark report with baseline comparison
make benchmark-acceptance SESSION_ID=20260208_101601 COMPARE_BASELINE=baseline_v1

# Or use the script directly
python scripts/acceptance/generate_performance_benchmark.py \
    --session-id 20260208_101601 \
    --output-dir .test_outputs/acceptance \
    --compare-baseline baseline_v1

The benchmark report includes:

Summary table comparing all provider/model configurations
Performance metrics per configuration (time per episode, throughput, memory)
Detailed analysis for each configuration
Performance comparison (fastest vs. slowest, memory usage)
Baseline comparison (if --compare-baseline is provided):
Performance changes vs. baseline (time, throughput, memory)
Regression detection (20% slower, 100MB more memory)
Improvement detection (10% faster, 50MB less memory)
Detailed per-configuration comparison

Baseline Comparison Features:

Compares provider/model configurations between current run and baseline
Detects regressions (performance degradation)
Detects improvements (performance gains)
Shows percentage changes for all metrics
Groups comparisons by provider/model (not just config name)

Reports are generated in both Markdown and JSON formats for easy review and programmatic analysis.

Test Organization¶

tests/
├── unit/                    # Unit tests (fast, isolated)
│   ├── conftest.py          # Network/filesystem isolation
│   └── podcast_scraper/     # Per-module tests
├── integration/             # Integration tests
│   ├── conftest.py          # Shared fixtures
│   └── test_*.py            # Component interaction tests
├── e2e/                     # E2E tests
│   ├── fixtures/            # E2E server, HTTP server
│   └── test_*.py            # Complete workflow tests
└── conftest.py              # Shared fixtures, ML cleanup

Coverage Thresholds¶

Per-tier thresholds enforced in CI (prevents regression):

Tier	Threshold	Current
Unit	70%	~74%
Integration	40%	~42%
E2E	40%	~50%
Combined	70%	~71%+

Note: Local make targets now run with coverage:

make test-unit          # includes --cov
make test-integration   # includes --cov
make test-e2e           # includes --cov

Test Count Targets¶

Unit tests: 200+
Integration tests: 50+
E2E tests: 100+

Layer-Specific Guides¶

For detailed implementation patterns:

Unit Testing Guide - What to mock, isolation enforcement, test structure
Integration Testing Guide - Mock vs real decisions, ML model usage
E2E Testing Guide - E2E server, OpenAI mocking, network guard

What to Test¶

Critical Path Testing Guide - What to test based on the critical path, test prioritization, fast vs slow tests, @pytest.mark.critical_path marker

Domain-Specific Testing¶

Provider Implementation Guide - Provider testing across all tiers (unit, integration, E2E), E2E server mock endpoints, testing checklist

References¶

Testing Strategy - Overall testing philosophy
Critical Path Testing Guide - Prioritization
CI/CD Documentation - GitHub Actions workflows
Architecture - Testing Notes section