Testing Guide¶
Document Structure:
- Testing Strategy - High-level philosophy, test pyramid, decision criteria; Browser UI E2E (Playwright) as an additive layer
- This document - Quick reference, test execution commands
- Unit Testing Guide - Unit test mocking patterns and isolation
- Integration Testing Guide - Integration test mocking guidelines
- RSS and feed ingestion - How RSS is fetched and parsed (helps when testing scraping vs transcript download)
- E2E Testing Guide - E2E server, real ML models, OpenAI mocking; E2E feeds and server options (feeds per mode, error injection, URLs); chaos tests (e.g. 404 audio) assert run index records failed episodes; download vs multi-feed resilience split (Download resilience E2E); browser E2E for the GI/KG Vue viewer (
make test-ui-e2e)- Critical Path Testing Guide - What to test and prioritization
Quick Reference¶
| Layer | Speed | Scope | ML/AI | Mocking |
|---|---|---|---|---|
| Unit | < 100ms | Single function | Mocked | All dependencies mocked |
| Integration | < 5s | Component interactions | Mocked | External services + ML/AI mocked |
| E2E | < 60s | Complete workflow | Real | No mocking (real everything) |
| Browser UI E2E | ~1-3 min (suite) | Vue viewer in Firefox (Playwright) | N/A | Vite + route/API mocks in specs |
Unit tests and pyproject extras: tests/unit/ must only depend on [dev] — never on [ml], [llm], or [compare] (and keep FastAPI route tests out of tests/unit/ per policy). CI test-unit installs .[dev] only, so any test requiring a non-[dev] extra will be silently skipped and never validated. If a test needs real torch, spaCy, faiss, etc., move it to tests/integration/ (where CI installs .[dev,ml,llm]). Do not use pytest.importorskip() in tests/unit/ to work around missing extras. See Unit Testing Guide -- Pyproject extras and Testing Strategy -- Unit tests and optional extras.
Automated policy enforcement: Two scripts run in make ci and make ci-fast to
catch testing-policy violations before they reach CI:
| Script | Make target | What it checks |
|---|---|---|
check_unit_test_imports.py |
make check-unit-imports |
Library modules import without ML deps at import time |
check_test_policy.py |
make check-test-policy |
3-tier ML/AI boundary rules (see table below) |
check_test_policy.py enforces four rules:
| Rule ID | Scope | Violation |
|---|---|---|
| U1-importorskip | tests/unit/ |
pytest.importorskip() -- move to integration/ or mock |
| U2-available-guard | tests/unit/ |
*_AVAILABLE skip guards -- mock ML deps instead of skipping |
| I1-ml-models-marker | tests/integration/ |
@pytest.mark.ml_models -- real ML belongs in tests/e2e/ |
| G1-empty-test-file | all tests/ |
Zero test_ methods -- delete or add tests |
Run make check-test-policy locally after adding or moving tests. Pass --fix-hint
for remediation suggestions. Both scripts live in scripts/tools/.
Decision Tree:
- Testing the GI/KG viewer UI in a real browser (graph, search shell, keyboard, theme)? -->
make test-ui-e2e(Playwright). See Browser E2E (GI / KG Viewer v2) and Testing Strategy -- Browser UI E2E. - Testing a complete CLI / library / service workflow? --> E2E Test (pytest)
- Testing component interactions? --> Integration Test
- Testing a single function? --> Unit Test
Running Tests¶
Default Commands¶
# Unit tests (parallel, network isolated)
make test-unit
# Same dependency set as CI test-unit (.[dev] only) — separate venv, leaves .venv unchanged
make venv-dev-init && make test-unit-dev-venv
# Integration tests (parallel, with reruns)
make test-integration
# E2E tests (parallel, with reruns)
make test-e2e
# All tests
make test
Browser E2E (GI / KG Viewer v2)¶
Playwright drives a real browser against the Vue SPA. This stack is orthogonal to pytest:
specs are not collected by pytest, and make test does not run them. Strategically, it is
documented as an additive layer on the test pyramid — see
Testing Strategy — Browser UI E2E (Playwright)
and ADR-066.
Where this lives in the repo: npm commands and package.json are under web/gi-kg-viewer/;
make test-ui / make test-ui-e2e run from the root. See
Polyglot repository guide.
E2E surface map: E2E_SURFACE_MAP.md lists
viewer surfaces, entry paths, and stable selectors. Use it when debugging Playwright failures,
manual repros, or agent-driven browser tools (same a11y vocabulary as the tests). Whenever you change viewer UX (labels,
layout, routes, tokens, a11y names, or E2E flows), update artifacts in this order: (1) the
surface map, (2) e2e/*.spec.ts / helpers.ts / fixtures.ts and run make test-ui-e2e,
(3) VIEWER_IA.md if shell IA changed, then UXS-001 and/or the relevant feature UXS if the visual / experience contract changed.
Full checklist: E2E Testing Guide — When you change viewer UX
(GitHub #509). Agent-browser loop:
Agent-Browser Closed Loop Guide.
How it fits next to pytest¶
| Concern | Tool | Location |
|---|---|---|
| Viewer TS utility logic (parsing, merge, metrics, formatting) | Vitest | web/gi-kg-viewer/src/utils/*.test.ts |
Viewer HTTP API (/api/*) -- pure logic helpers |
pytest unit tests (no FastAPI needed) | tests/unit/podcast_scraper/server/ (catalog, index staleness, pathutil) |
| Viewer HTTP API — wired app + real files | pytest integration | tests/integration/server/ (test_server_api.py, test_viewer_corpus_library.py, test_index_rebuild.py, …) |
| Viewer UI (render, click, keyboard, graph container) | Playwright | web/gi-kg-viewer/e2e/*.spec.ts |
| Full pipeline / CLI / providers | pytest E2E | tests/e2e/ |
Commands¶
make test-ui # Vitest unit tests (fast, no browser)
make test-ui-e2e # Playwright browser E2E (needs Firefox)
make test-ui runs npm run test:unit (Vitest) in web/gi-kg-viewer. Tests cover pure
TypeScript logic: artifact parsing, GI+KG merge, bridge-aware dedupe where implemented, metrics,
formatting, colors, visual groups,
and search-focus mapping. No browser or DOM required — runs in ~150 ms.
make test-ui-e2e runs npm install in web/gi-kg-viewer, installs the Firefox browser
for Playwright, and runs npm run test:e2e. Playwright’s webServer starts Vite on
127.0.0.1:5174 so it does not collide with npm run dev on 5173.
What CI proves vs full stack: That setup exercises the Vue UI in a real browser with
Vite; many specs mock fetch or rely on offline fixtures so the job stays fast.
It does not prove python -m podcast_scraper.cli serve (FastAPI + mounted dist/ +
live /api/* on corpus files). Use serve / make serve for manual smoke of the
combined server, and tests/integration/server/ (e.g. test_server_api.py) for pytest
coverage of a wired create_app and temp corpus.
For interactive debugging: cd web/gi-kg-viewer && npx playwright test --ui (see
viewer README).
CI¶
GitHub Actions jobs:
viewer-unit— runsnpm run test:unit(Vitest, fast).viewer-e2e— runsnpm run test:e2e(Playwright + Firefox) after the pytest E2E job that applies to the event (test-e2e-faston PRs,test-e2eon push tomain/release/*).coverage-unifiedwaits onviewer-e2eso the merge report runs only after browser E2E has passed.
Both viewer jobs are in .github/workflows/python-app.yml. Touching web/gi-kg-viewer/ in a
PR should include green runs for both (see
CONTRIBUTING.md).
GIL, KG, CIL, and semantic search validation¶
Use this table when you change GIL, KG, bridge.json, CIL HTTP, FAISS
indexing, or search response shape.
| Change area | Suggested checks |
|---|---|
GIL pipeline, gi.json, gi CLI |
make test-unit -k gi (or scoped paths), tests/e2e/test_gi_kg_cli_subprocess_e2e.py (gi validate smoke); see Testing Strategy — GIL Testing |
KG pipeline, kg CLI |
tests/unit/kg/, tests/e2e/test_gi_kg_cli_subprocess_e2e.py (kg validate / kg inspect) |
bridge.json builder |
tests/unit/builders/test_bridge_builder.py, tests/integration/test_bridge_integration.py |
| CIL query helpers / HTTP | tests/unit/podcast_scraper/server/test_cil_queries.py, tests/integration/server/test_cil_api.py |
| Transcript search + lift + offset math | tests/unit/podcast_scraper/search/test_transcript_chunk_lift.py, test_gil_chunk_offset_verify.py, tests/integration/server/test_viewer_search.py |
| Viewer merge / bridge TS | make test-ui; UX changes also make test-ui-e2e + E2E surface map |
Corpus-level gate (indexed run only): make verify-gil-offsets-strict runs
verify-gil-chunk-offsets with --strict and --min-overlap-rate 0.95 (override
GIL_OFFSET_VERIFY_DIR). Use after acceptance or nightly jobs that produce
search/metadata.json + GI quotes, not as a substitute for pytest on every PR without
such a tree. Rationale: GIL / KG / CIL cross-layer.
Single overview: GIL / KG / CIL cross-layer guide.
Nightly (.github/workflows/nightly.yml): nightly-viewer-unit and
nightly-viewer-e2e run the same Vitest / Playwright commands on every scheduled or
workflow_dispatch run (no path filters). Vitest sits in the post–lint+build segment with
nightly-test-unit; Playwright runs after nightly-test-e2e completes successfully.
Writing and extending tests¶
- Vitest specs:
web/gi-kg-viewer/src/utils/*.test.ts— co-located with source. Add a.test.tsfile next to any new utility. Config:vite.config.tstestblock. - Playwright specs:
web/gi-kg-viewer/e2e/*.spec.ts(e.g. offline graph, search mocks, dashboard, theme). - Playwright config:
web/gi-kg-viewer/playwright.config.ts(testDir: ./e2e, Desktop Firefox,baseURL/webServeron 5174). - Shared helpers:
e2e/fixtures.ts,e2e/helpers.ts.
More detail: E2E Testing Guide — Browser E2E (Playwright), web/gi-kg-viewer/README.md (section Browser E2E (M7)).
Fast Variants (Critical Path Only)¶
make test-fast # Unit + critical path integration + e2e
make test-integration-fast # Critical path integration tests
make test-e2e-fast # Critical path E2E tests
Sequential (For Debugging)¶
For debugging test failures, run tests sequentially using pytest directly:
# Run all tests sequentially
pytest tests/ -n 0
# Run unit tests sequentially
pytest tests/unit/ -n 0
# Run integration tests sequentially
pytest tests/integration/ -n 0
# Run E2E tests sequentially
pytest tests/e2e/ -n 0
Specific Tests¶
# Run specific test file
pytest tests/unit/podcast_scraper/test_config.py -v
# Run with marker
pytest tests/integration/ -m "integration and critical_path" -v
# Run with coverage
pytest tests/unit/ --cov=podcast_scraper --cov-report=term-missing
Test Markers¶
| Marker | Purpose |
|---|---|
@pytest.mark.unit |
Unit tests |
@pytest.mark.integration |
Integration tests |
@pytest.mark.e2e |
End-to-end tests |
@pytest.mark.critical_path |
Critical path tests (run in fast suite) |
@pytest.mark.nightly |
Nightly-only tests (excluded from regular CI) |
@pytest.mark.flaky |
May fail intermittently (gets reruns) |
@pytest.mark.serial |
Must run sequentially (rarely needed) |
@pytest.mark.ml_models |
Requires ML dependencies |
@pytest.mark.slow |
Slow-running tests |
@pytest.mark.network |
Hits external network |
@pytest.mark.llm |
Uses LLM APIs (excluded from nightly to avoid costs) |
@pytest.mark.openai |
Uses OpenAI specifically (subset of llm) |
Network Isolation¶
Tests use network isolation appropriate to their layer:
- Unit tests: Full socket blocking (
--disable-socket --allow-hosts=127.0.0.1,localhost) - Integration/E2E: Host allowlist only (
--allow-hosts=127.0.0.1,localhost)
Parallel Execution¶
Tests run in parallel by default using pytest-xdist:
All tests run in parallel by default with memory-aware worker calculation. The
@pytest.mark.serial marker is registered but currently unused.
The Makefile automatically calculates the optimal number of workers based on:
- Available system memory
- CPU core count
- Test type (unit/integration/e2e have different memory requirements)
- Platform (more conservative on macOS)
See TROUBLESHOOTING.md for details on memory-aware worker calculation.
Note: The
@pytest.mark.serialmarker is rarely needed now. Global state cleanup fixtures inconftest.pyreset shared state between tests, allowing most tests to run in parallel safely. Only useserialfor tests with genuine resource conflicts.
Memory Cleanup Best Practices¶
Automatic Cleanup:
The test suite includes an automatic cleanup fixture (cleanup_ml_resources_after_test) that:
- Limits PyTorch thread pools to prevent excessive thread spawning
- Cleans up the global preloaded ML provider
- Finds and cleans up ALL SummaryModel and provider instances created during tests (Issue #351)
- Forces garbage collection after integration/E2E tests
Explicit Cleanup (Recommended):
While automatic cleanup handles most cases, explicit cleanup is recommended for clarity and immediate memory release:
from tests.conftest import cleanup_model, cleanup_provider
def test_something():
# Create model directly
model = summarizer.SummaryModel(...)
try:
# test code
finally:
cleanup_model(model) # Explicit cleanup
def test_with_provider():
# Create provider directly
provider = create_summarization_provider(cfg)
try:
# test code
finally:
cleanup_provider(provider) # Explicit cleanup
Why Explicit Cleanup?
- Immediate memory release - Models are unloaded as soon as the test completes
- Clarity - Makes it obvious that cleanup is happening
- Defensive - Works even if automatic cleanup has issues
- Best practice - Matches the pattern of resource management (try/finally)
Helper Functions:
cleanup_model(model)- Unloads a SummaryModel instancecleanup_provider(provider)- Cleans up a provider instance (MLProvider, etc.)
Both functions are idempotent (safe to call multiple times) and handle None gracefully.
Warning: -s Flag and Parallel Execution¶
Do not use -s (no capture) with parallel tests — it causes hangs due to tqdm
progress bars competing for terminal access.
# DON'T DO THIS (hangs)
pytest tests/e2e/ -s -n auto
# DO THIS INSTEAD
pytest tests/e2e/ -v -n auto # Use -v for verbose output
pytest tests/e2e/ -s -n 0 # Or disable parallelism
make test-e2e-sequential # Or use sequential target
See Issue #176 for details.
Flaky Test Reruns¶
Integration and E2E tests use reruns:
pytest --reruns 3 --reruns-delay 1
Flaky Test Markers¶
Some tests are marked with @pytest.mark.flaky to indicate they may fail intermittently
due to inherent non-determinism. These tests get automatic reruns.
Why Some Tests Are Flaky¶
| Category | Tests | Root Cause |
|---|---|---|
| Whisper Transcription | 4 | ML inference variability - audio transcription has natural variation |
| Full Pipeline + Whisper | 2 | Whisper timing + audio file I/O |
| OpenAI Mock Integration | 2 | Mock response parsing timing |
| Full Pipeline + OpenAI | 7 | Complex multi-component timing |
Current Flaky Test Count: 15¶
| File | Count | Category |
|---|---|---|
test_basic_e2e.py |
7 | Full pipeline with OpenAI mocks |
test_whisper_e2e.py |
4 | Whisper inference variability |
test_full_pipeline_e2e.py |
2 | Whisper transcription |
test_openai_provider_e2e.py |
2 | OpenAI mock responses |
Tests That Are NOT Flaky¶
The following categories are now stable and don't need flaky markers:
- Episode selection (GitHub #521) -
tests/e2e/test_episode_selection_e2e.py(mock feedpodcast1_episode_selection, Path 1 transcripts; one test iscritical_pathfor fast E2E) - Transformers/spaCy model loading - Uses offline mode (
HF_HUB_OFFLINE=1) - ML model tests - Explicit
summary_reduce_modelprevents cache misses - HTTP integration tests (
tests/integration/rss/test_http_integration.py, markerintegration_http) — Localhttp.serveronly; autouse fixture resetsconfigure_http_policy/configure_downloaderand closes downloader sessions so urllib3 retry defaults do not stall 5xx tests. See Integration Testing Guide — Real HTTP client integration. - Parallel execution - Global state cleanup prevents race conditions
Reducing Flakiness¶
If you encounter a flaky test, check these common causes:
- Network access - Should be blocked via pytest-socket
- Model cache - Run
make preload-ml-modelsfirst - Global state - Ensure cleanup fixtures reset shared state
- Progress bars -
TQDM_DISABLE=1is set automatically in tests
See Issue #177 for investigation details.
E2E Test Modes¶
Set via E2E_TEST_MODE environment variable:
| Mode | Episodes | Use Case |
|---|---|---|
fast |
1 per test | Quick feedback |
multi_episode |
5 per test | Full validation |
nightly |
15 (p01–p05) | Nightly only |
ML Model Preloading¶
Tests require models to be pre-cached:
make preload-ml-models
See E2E Testing Guide for model defaults.
E2E Acceptance Tests¶
Acceptance tests allow you to run multiple configuration files sequentially, collect structured data (logs, outputs, timing, resource usage), and compare results against baselines. This is useful for:
- Running the same configs across different code versions to detect regressions
- Testing multiple configs with different RSS feeds or settings
- Validating system acceptance of different provider/model configurations
- Comparing performance metrics across runs
Setting up acceptance configs¶
Optional full-pipeline YAML presets may live under config/acceptance/ beside the tracked matrix. The repo tracks config/acceptance/README.md, config/acceptance/MAIN_ACCEPTANCE_CONFIG.yaml, and config/acceptance/fragments/*.yaml.
- Create the folder:
mkdir -p config/acceptance(at project root). - Copy example configs: Use
config/examples/config.example.yaml(or any example) as a template:cp config/examples/config.example.yaml config/acceptance/config.my.myshow.yaml(or a name that fits your feeds). - Adjust for your definition of acceptance: Edit the copied file(s)—RSS feed URLs, providers, model names, output paths, etc.—so they match what you consider “acceptance” for your use case. You can add multiple configs (e.g. one per show or per provider) and run them all with a pattern like
config/acceptance/*.yaml.
Multi-feed (GitHub #440): Use feeds: / rss_urls: in operator YAML or --feeds-spec with a feeds document (RFC-077 shape; see config/examples/feeds.spec.example.yaml). For manual CLI runs, combine --feeds-spec with --config and optional --profile (same as CLI.md — Quick Start). With USE_FIXTURES=1, the acceptance runner replaces each external feed URL with a distinct local E2E fixture feed so the run stays offline.
Append / resume (GitHub #444): Copy that preset, set append: true, and re-run (stable run_append_* per feed). Pytest coverage: tests/e2e/test_append_resume_e2e.py (two CLI invocations, stable run_append_*, index.json 1.1.0). See CONFIGURATION.md — Append / resume.
Episode selection (GitHub #521): Pytest E2E regression for --episode-order, --since / --until, --episode-offset, and config overrides lives in tests/e2e/test_episode_selection_e2e.py (mock server feed podcast1_episode_selection, fixture tests/fixtures/rss/p01_episode_selection.xml). One test is marked critical_path so it runs under make test-e2e-fast / the E2E slice of make test-ci-fast. Integration coverage includes tests/integration/workflow/test_workflow_stages_integration.py (prepare_episodes_from_feed) and tests/integration/infrastructure/test_e2e_server.py (TestE2EEpisodeSelectionFeed). See CONFIGURATION.md — Episode selection and E2E Testing Guide — E2E Feeds (RSS).
Corpus resolution + CLI (post–#505 / inspect hardening): Unit tests include tests/unit/podcast_scraper/utils/test_corpus_episode_paths.py (YAML metadata, rglob fallback, parent search hint), tests/unit/podcast_scraper/utils/test_corpus_lock.py, TestKgSubcommandMultiFeed and TestGiSubcommand multi-feed gi inspect / kg inspect paths in tests/unit/podcast_scraper/test_cli.py, and viewer web/gi-kg-viewer/src/stores/shell.hints.test.ts for GET /api/artifacts hints. Playwright web/gi-kg-viewer/e2e/corpus-hints.spec.ts mirrors the hint banner (requires npx playwright install firefox locally / in CI).
Full fast matrix with fixtures (smoke all acceptance presets offline): make test-acceptance-fixtures-fast materializes each enabled row from config/acceptance/MAIN_ACCEPTANCE_CONFIG.yaml into sessions/session_*/materialized/{id}.yaml and runs those configs. Uses USE_FIXTURES=1, disables auto analyze/benchmark, 900s subprocess TIMEOUT, and 600s post-run wall budget (PER_RUN_WALL_SECONDS / ACCEPTANCE_PER_RUN_WALL_SECONDS; session.json records wall_clock_seconds per run and walltime_vs_baseline when COMPARE_BASELINE=…). Same target runs on main / release pushes in CI (test-acceptance-fixtures job in python-app.yml).
Optional: use config/playground/ for ad-hoc or one-off configs; run them with e.g. make test-acceptance CONFIGS="config/playground/config.my.*.yaml".
Running Acceptance Tests¶
# Run a single config file
make test-acceptance CONFIGS="config/examples/config.example.yaml"
# Run multiple configs (using glob patterns)
make test-acceptance CONFIGS="config/acceptance/*.yaml"
# Fast matrix from MAIN_ACCEPTANCE_CONFIG.yaml (materialized YAMLs per row) + fixtures
make test-acceptance FROM_FAST_STEMS=1 USE_FIXTURES=1
# Same as above with CI-style defaults (no auto analyze/benchmark; TIMEOUT=900; PER_RUN_WALL_SECONDS=600)
make test-acceptance-fixtures-fast
# Save current runs as a baseline for future comparison
make test-acceptance CONFIGS="config/examples/config.example.yaml" SAVE_AS_BASELINE=baseline_v1
# Compare against an existing baseline
make test-acceptance CONFIGS="config/examples/config.example.yaml" COMPARE_BASELINE=baseline_v1
# Use fixture feeds (mock data) instead of real RSS feeds
make test-acceptance CONFIGS="config/examples/config.example.yaml" USE_FIXTURES=1
# Disable real-time log streaming (only save to files)
make test-acceptance CONFIGS="config/examples/config.example.yaml" NO_SHOW_LOGS=1
# Disable automatic analysis and benchmark reports
make test-acceptance CONFIGS="config/examples/config.example.yaml" NO_AUTO_ANALYZE=1 NO_AUTO_BENCHMARK=1
Understanding Sessions vs Runs¶
Session = One execution of the acceptance test tool
- Triggered by a single command invocation
- Can process multiple config files sequentially
- Has a unique
session_id(timestamp-based) - Contains a summary of all runs in that session
Run = One execution of a single config file within a session
- Each config file you pass creates one run
- Has its own
run_id, timing, exit code, logs, and outputs - Runs execute sequentially within the session
Example:
If you run:
make test-acceptance CONFIGS="config1.yaml config2.yaml config3.yaml"
You get:
- 1 Session (with
session_id = 20260208_101601) - 3 Runs (one for each config file)
run_20260208_101601_123(config1.yaml)run_20260208_101601_456(config2.yaml)run_20260208_101601_789(config3.yaml)
Output Structure¶
Results are saved to .test_outputs/acceptance/ by default:
.test_outputs/acceptance/
├── sessions/
│ └── session_20260208_101601/ ← ONE SESSION
│ ├── session.json ← Summary of all runs
│ └── runs/
│ ├── run_20260208_101601_123/ ← RUN #1 (config1)
│ │ ├── config.original.yaml ← Original config for this run
│ │ ├── config.yaml ← Modified config used for execution
│ │ ├── run_data.json
│ │ ├── stdout.log
│ │ ├── stderr.log
│ │ └── ... (service outputs)
│ ├── run_20260208_101601_456/ ← RUN #2 (config2)
│ │ ├── config.original.yaml ← Original config for this run
│ │ ├── config.yaml
│ │ └── ...
│ └── run_20260208_101601_789/ ← RUN #3 (config3)
│ ├── config.original.yaml ← Original config for this run
│ ├── config.yaml
│ └── ...
└── baselines/
└── baseline_v1/ ← Saved baselines
├── baseline.json
└── run_20260208_101601_123/ ← Copied run data
Analyzing Results¶
Use the analysis script to generate reports:
# Basic analysis
make analyze-acceptance SESSION_ID=20260208_101601
# Comprehensive analysis with baseline comparison
make analyze-acceptance SESSION_ID=20260208_101601 MODE=comprehensive COMPARE_BASELINE=baseline_v1
# Or use the script directly
python scripts/acceptance/analyze_bulk_runs.py \
--session-id 20260208_101601 \
--output-dir .test_outputs/acceptance \
--mode comprehensive \
--compare-baseline baseline_v1
Performance Benchmarking¶
Generate performance benchmarking reports that group runs by provider/model configuration:
# Generate benchmark report
make benchmark-acceptance SESSION_ID=20260208_101601
# Generate benchmark report with baseline comparison
make benchmark-acceptance SESSION_ID=20260208_101601 COMPARE_BASELINE=baseline_v1
# Or use the script directly
python scripts/acceptance/generate_performance_benchmark.py \
--session-id 20260208_101601 \
--output-dir .test_outputs/acceptance \
--compare-baseline baseline_v1
The benchmark report includes:
- Summary table comparing all provider/model configurations
- Performance metrics per configuration (time per episode, throughput, memory)
- Detailed analysis for each configuration
- Performance comparison (fastest vs. slowest, memory usage)
- Baseline comparison (if
--compare-baselineis provided): - Performance changes vs. baseline (time, throughput, memory)
- Regression detection (20% slower, 100MB more memory)
- Improvement detection (10% faster, 50MB less memory)
- Detailed per-configuration comparison
Baseline Comparison Features:
- Compares provider/model configurations between current run and baseline
- Detects regressions (performance degradation)
- Detects improvements (performance gains)
- Shows percentage changes for all metrics
- Groups comparisons by provider/model (not just config name)
Reports are generated in both Markdown and JSON formats for easy review and programmatic analysis.
Test Organization¶
Unit tests mirror the source tree (find the test for any source file mechanically). Integration tests are organized by domain subsystem (providers, workflow, GI/KG, etc.). E2E tests are flat by user scenario.
tests/
├── unit/ # Unit tests (fast, isolated)
│ ├── conftest.py # Network/filesystem isolation
│ └── podcast_scraper/ # Per-module tests — mirrors src/ tree
├── integration/ # Integration tests — by domain subsystem
│ ├── conftest.py # Shared fixtures
│ ├── providers/ # Provider factories, protocols, per-provider
│ │ ├── llm/ # LLM provider integration
│ │ ├── ml/ # ML model loading, embedding, QA, NLI
│ │ └── ollama/ # Ollama model-specific tests
│ ├── workflow/ # Orchestration, stages, metadata, resume
│ ├── gi/ # GI/KG artifacts, evidence stack
│ ├── server/ # FastAPI app, viewer API
│ ├── search/ # FAISS indexing, corpus search
│ ├── rss/ # RSS parsing, HTTP fetching
│ ├── eval/ # Evaluation framework
│ ├── infrastructure/ # Fixture mapping, infra
│ ├── tools/ # CLI tools
│ └── test_*.py # Cross-cutting (filesystem, retry, cache)
├── e2e/ # E2E tests — by user scenario
│ ├── fixtures/ # E2E server, HTTP server
│ └── test_*.py # Complete workflow tests
└── conftest.py # Shared fixtures, ML cleanup
web/gi-kg-viewer/ # Browser UI E2E (Playwright — not pytest)
├── e2e/ # *.spec.ts
├── e2e/fixtures.ts # Shared test fixtures
├── playwright.config.ts # webServer (Vite :5174), Firefox
└── package.json # test:e2e and other frontend scripts
See Integration Testing Guide for the domain-based layout rationale and per-folder contents.
Coverage Thresholds¶
Per-tier thresholds enforced in CI (prevents regression):
| Tier | Threshold | Current |
|---|---|---|
| Unit | 70% | ~74% |
| Integration | 40% | ~42% |
| E2E | 40% | Full podcast_scraper tree in denominator; add pytest E2E until this gate passes |
| Combined | 70% | ~71%+ |
Pytest E2E coverage (full package, no subtree omit)¶
pytest E2E jobs use the same pyproject.toml [tool.coverage.run] settings as unit and
integration (COVERAGE_THRESHOLD_E2E in the Makefile; --cov-fail-under in CI). There is
no separate coverage-e2e.ini that removes server/, search/, gi/, or other subtrees from
the fraction. The reported percentage is: lines hit by tests/e2e/ divided by all
measurable lines under podcast_scraper.
Why the E2E percentage is lower than unit/integration: many modules (FastAPI app, FAISS indexer, large GI helpers, eval harness) are not exercised on every pytest E2E run. That is visible in the number: it is a signal to add pytest E2E (or subprocess workflows) for key capabilities, not a reason to shrink the denominator.
pytest E2E vs HTTP integration vs Playwright (short):
- pytest E2E — Python workflows from CLI / pipeline / E2E server client; this is the tier this row’s threshold applies to.
- HTTP integration — FastAPI and routes via TestClient (and similar); fast boundary tests; does not replace maintainer pytest E2E obligations for capabilities that must be proven from the real user entry point you care about.
- Playwright — Browser UI only; additive; not an excuse to lower pytest E2E coverage or to skip pytest E2E for Python-only surfaces.
Authoritative narrative: Testing Strategy — Pytest E2E vs HTTP integration vs browser E2E.
Note: Local make targets now run with coverage:
make test-unit # includes --cov
make test-integration # includes --cov
make test-e2e # includes --cov (E2E config)
Test Count Targets¶
- Unit tests: 200+
- Integration tests: 50+
- E2E tests: 100+
Layer-Specific Guides¶
For detailed implementation patterns:
- Unit Testing Guide - What to mock, isolation enforcement, test structure
- Integration Testing Guide - Mock vs real decisions, ML model usage
- E2E Testing Guide - E2E server, OpenAI mocking, network guard
What to Test¶
- Critical Path Testing Guide - What to test based on the critical path,
test prioritization, fast vs slow tests,
@pytest.mark.critical_pathmarker
Domain-Specific Testing¶
- Provider Implementation Guide - Provider testing across all tiers (unit, integration, E2E), E2E server mock endpoints, testing checklist
References¶
- Testing Strategy - Overall testing philosophy (incl. Playwright layer)
- ADR-066: Playwright for UI E2E - Why Playwright for viewer v2
- Critical Path Testing Guide - Prioritization
- CI/CD Documentation - GitHub Actions workflows
- Architecture - Testing Notes section