RFC-054: Flexible E2E Mock Response Strategy¶
- Status: Draft
- Authors:
- Stakeholders: Maintainers, developers writing E2E tests, CI/CD pipeline maintainers
- Related Issues:
-
135: Make e2e LLM API mocks realistic and comprehensive¶
-
399: Provider-level concerns to harden¶
-
401: DevEx / Ops Improvements¶
- Related RFCs:
docs/rfc/RFC-019-e2e-test-improvements.md(E2E test infrastructure - foundation)docs/rfc/RFC-020-integration-test-improvements.md(integration test improvements)
Abstract¶
This RFC defines a flexible strategy for E2E mock responses that supports both normal functional testing and advanced error handling scenarios. The design separates non-functional concerns (retries, timeouts, rate limits) from functional responses (API response structures), allowing tests to mix and match different scenarios without duplication.
Key Principles:
- Default Normal Responses: Most tests use realistic but simple responses for happy-path testing
- Variety of Response Scenarios: Advanced tests can select specific response profiles (errors, edge cases, special behaviors)
- Centralized Non-Functional Logic: Retry behavior, timeouts, rate limits handled in one place
- Composable Design: Tests can combine different functional responses with different non-functional behaviors
- No Test Duplication: Common patterns are reusable, not repeated in every test
Problem Statement¶
Current Issues:
- All Tests See Same Responses
- Every test encounters identical mock responses
- Cannot test error scenarios, edge cases, or advanced features
-
Hard to test retry logic, rate limiting, or timeout handling
-
No Separation of Concerns
- Functional responses (API response structure) mixed with non-functional behavior (retries, delays)
- Hard to test "normal response with retry" vs "error response with retry"
-
Changes to retry logic require updating every mock
-
Limited Scenario Coverage
- Only happy-path responses are mocked
- Cannot test 429 rate limits, 5xx errors, connection failures
-
Cannot test provider-specific behaviors or edge cases
-
Duplication and Maintenance Burden
- Each test that needs a different scenario must duplicate mock setup
- Common patterns (retry behavior, error responses) repeated across tests
- Hard to maintain consistency across test suite
Goals¶
- Normal Responses by Default
- Most tests get realistic, simple responses automatically
- No configuration needed for basic happy-path testing
-
Fast test execution with minimal setup
-
Variety of Response Scenarios
- Tests can select specific response profiles (errors, edge cases, special behaviors)
- Support for all error types: 429, 5xx, connection failures, timeouts
-
Support for provider-specific behaviors and edge cases
-
Centralized Non-Functional Logic
- Retry behavior, timeouts, rate limits configured in one place
- Tests can override non-functional behavior without duplicating functional responses
-
Consistent retry logging and metrics across all tests
-
Composable Design
- Tests can mix functional responses with non-functional behaviors
- Example: "normal summarization response" + "retry on first attempt" + "rate limit after 2 retries"
- No need to create separate mocks for every combination
Design¶
Architecture Overview¶
┌─────────────────────────────────────────────────────────────┐
│ Test Request │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Mock Response Router │
│ - Determines response profile (normal/error/edge case) │
│ - Applies non-functional behavior (retries/timeouts) │
└──────────────────────┬──────────────────────────────────────┘
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Response │ │ Response │ │ Response │
│ Profiles │ │ Profiles │ │ Profiles │
│ (Functional)│ │ (Functional)│ │ (Functional)│
│ │ │ │ │ │
│ - Normal │ │ - Error │ │ - Edge Case │
│ - Success │ │ - 429 │ │ - Partial │
│ - Standard │ │ - 5xx │ │ - Degraded │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└──────────────┼──────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Non-Functional Behavior Layer │
│ - Retry logic (attempt tracking, exponential backoff) │
│ - Timeout simulation (connect, read, overall) │
│ - Rate limit handling (429 responses, retry-after headers) │
│ - Metrics tracking (retries, sleep time, tokens) │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ HTTP Response │
└─────────────────────────────────────────────────────────────┘
Core Components¶
1. Response Profiles (Functional)¶
Response profiles define the functional content of API responses (what data is returned).
Base Profile: normal
- Realistic API response structure matching provider documentation
- Standard success responses with proper fields
- Default for most tests (happy path)
Error Profiles:
- rate_limit_429: Returns 429 with Retry-After header
- server_error_500: Returns 500 Internal Server Error
- server_error_503: Returns 503 Service Unavailable
- connection_error: Simulates connection failure
- timeout_error: Simulates timeout (no response)
Edge Case Profiles:
- empty_response: Empty content (tests graceful handling)
- malformed_json: Invalid JSON structure (tests parsing)
- partial_response: Incomplete data (tests degradation)
- large_response: Very large payload (tests memory/performance)
Provider-Specific Profiles:
- openai_verbose_json: OpenAI verbose JSON format with segments
- gemini_multimodal: Gemini multimodal response structure
- anthropic_streaming: Anthropic streaming response format
2. Non-Functional Behavior (Centralized)¶
Non-functional behavior controls how responses are delivered (retries, delays, timeouts).
Retry Behavior:
- no_retries: Always succeeds on first attempt
- retry_once: Fails first attempt, succeeds on retry
- retry_twice: Fails first two attempts, succeeds on third
- retry_exhausted: Always fails (tests retry exhaustion)
- exponential_backoff: Configurable backoff delays
Timeout Behavior:
- no_timeout: Immediate response
- connect_timeout: Simulates connection timeout
- read_timeout: Simulates read timeout
- overall_timeout: Simulates overall operation timeout
Rate Limit Behavior:
- no_rate_limit: No rate limiting
- rate_limit_after_n: Rate limit after N requests
- rate_limit_retry_after: Configurable Retry-After header
Metrics Tracking: - Automatic tracking of retries, sleep time, tokens - Consistent with production metrics format - Supports validation in tests
3. Response Router¶
The response router combines functional profiles with non-functional behavior.
Configuration Methods:
# Test-level configuration (via fixture or decorator)
@pytest.mark.mock_profile("normal") # Default
@pytest.mark.mock_profile("rate_limit_429")
@pytest.mark.mock_profile("server_error_500")
# Non-functional behavior
@pytest.mark.mock_retry("retry_once")
@pytest.mark.mock_timeout("no_timeout")
@pytest.mark.mock_rate_limit("no_rate_limit")
# Composable: mix and match
@pytest.mark.mock_profile("normal")
@pytest.mark.mock_retry("retry_twice")
@pytest.mark.mock_rate_limit("rate_limit_after_2")
Programmatic Configuration:
def test_with_custom_behavior(e2e_server):
# Configure specific endpoint
e2e_server.set_response_profile(
endpoint="/v1/chat/completions",
profile="normal",
retry="retry_once",
rate_limit="no_rate_limit"
)
# Run test
result = run_pipeline(...)
assert result.success
Implementation Structure¶
tests/e2e/fixtures/
├── mock_responses/
│ ├── __init__.py
│ ├── profiles.py # Response profile definitions
│ │ ├── normal.py # Normal/success responses
│ │ ├── errors.py # Error responses (429, 5xx, etc.)
│ │ ├── edge_cases.py # Edge case responses
│ │ └── provider_specific.py # Provider-specific formats
│ ├── behavior.py # Non-functional behavior (retries, timeouts)
│ └── router.py # Response router (combines profiles + behavior)
├── e2e_http_server.py # Updated to use router
└── conftest.py # Test fixtures and markers
Example Usage¶
Example 1: Normal Response (Default)¶
def test_basic_summarization(e2e_server):
# No configuration needed - uses default "normal" profile
result = run_pipeline(...)
assert result.success
assert result.summary is not None
Example 2: Rate Limit with Retry¶
@pytest.mark.mock_profile("rate_limit_429")
@pytest.mark.mock_retry("retry_twice")
def test_rate_limit_handling(e2e_server):
# First two attempts get 429, third succeeds
result = run_pipeline(...)
assert result.success
# Verify retry metrics were logged
assert "provider_retry: provider=openai attempt=2" in logs
Example 3: Server Error with Graceful Degradation¶
@pytest.mark.mock_profile("server_error_500")
@pytest.mark.mock_retry("retry_exhausted")
def test_graceful_degradation(e2e_server):
# All retries fail, but transcript should still be saved
result = run_pipeline(...)
assert result.transcript is not None
assert result.summary is None # Summarization failed
assert result.status == "degraded"
Example 4: Provider-Specific Format¶
@pytest.mark.mock_profile("openai_verbose_json")
def test_openai_segments(e2e_server):
# Tests OpenAI-specific verbose JSON format with segments
result = run_pipeline(...)
assert result.segments is not None
assert len(result.segments) > 0
Example 5: Programmatic Configuration¶
def test_dynamic_scenario(e2e_server):
# Configure different behavior for different endpoints
e2e_server.set_response_profile(
endpoint="/v1/chat/completions",
profile="normal",
retry="retry_once"
)
e2e_server.set_response_profile(
endpoint="/v1/audio/transcriptions",
profile="normal",
retry="no_retries"
)
result = run_pipeline(...)
assert result.success
Test Structure and Organization¶
Current Test Organization¶
Existing Tests (Backward Compatible): - All existing E2E tests continue to work unchanged - They automatically use the default "normal" profile - No changes needed to existing test code - ~25-30 existing E2E test files continue as-is
Example existing test (unchanged):
# tests/e2e/test_basic_e2e.py
def test_basic_transcript_download(e2e_server):
# Uses default "normal" profile automatically
result = run_pipeline(...)
assert result.success
New Test Organization¶
New Test Files for Advanced Scenarios:
We'll add 3-5 new test files focused on error handling and advanced scenarios:
test_provider_error_handling_e2e.py(~10-15 tests)- Rate limit scenarios (429 errors)
- Server error scenarios (5xx errors)
- Connection failures
-
Timeout scenarios
-
test_provider_retry_behavior_e2e.py(~8-12 tests) - Retry on rate limits
- Retry on transient errors
- Retry exhaustion
-
Exponential backoff verification
-
test_provider_edge_cases_e2e.py(~6-10 tests) - Empty responses
- Malformed JSON
- Partial responses
-
Large responses
-
test_provider_graceful_degradation_e2e.py(~5-8 tests) - Summarization failure (transcript still saved)
- Entity extraction failure (summary still produced)
-
Provider fallback scenarios
-
test_provider_metrics_tracking_e2e.py(~5-8 tests) - Token usage tracking
- Cost calculation
- Retry metrics
- Rate limit sleep time
Total New Tests: ~34-53 tests across 5 files
Test Structure Examples¶
Example 1: Error Handling Test File¶
# tests/e2e/test_provider_error_handling_e2e.py
@pytest.mark.e2e
@pytest.mark.provider_errors
class TestProviderErrorHandling:
"""Test provider error handling scenarios."""
@pytest.mark.mock_profile("rate_limit_429")
@pytest.mark.mock_retry("retry_twice")
def test_rate_limit_with_retry_succeeds(self, e2e_server):
"""Test that rate limit errors are retried and eventually succeed."""
result = run_pipeline(...)
assert result.success
# Verify retry was logged
assert "provider_retry: provider=openai attempt=2" in caplog.text
@pytest.mark.mock_profile("rate_limit_429")
@pytest.mark.mock_retry("retry_exhausted")
def test_rate_limit_retry_exhausted(self, e2e_server):
"""Test that exhausted retries fail gracefully."""
result = run_pipeline(...)
assert not result.success
assert "rate limit" in result.error_message.lower()
@pytest.mark.mock_profile("server_error_500")
@pytest.mark.mock_retry("retry_once")
def test_server_error_with_retry(self, e2e_server):
"""Test that 500 errors are retried."""
result = run_pipeline(...)
assert result.success # Succeeds on retry
@pytest.mark.mock_profile("server_error_503")
@pytest.mark.mock_retry("retry_exhausted")
def test_service_unavailable_handling(self, e2e_server):
"""Test 503 Service Unavailable handling."""
result = run_pipeline(...)
assert not result.success
assert result.status == "degraded"
@pytest.mark.mock_profile("connection_error")
@pytest.mark.mock_retry("retry_twice")
def test_connection_error_handling(self, e2e_server):
"""Test connection error handling."""
result = run_pipeline(...)
# May succeed on retry or fail depending on scenario
assert result.success or result.status == "degraded"
Example 2: Retry Behavior Test File¶
# tests/e2e/test_provider_retry_behavior_e2e.py
@pytest.mark.e2e
@pytest.mark.provider_retries
class TestProviderRetryBehavior:
"""Test provider retry behavior and metrics."""
@pytest.mark.mock_profile("normal")
@pytest.mark.mock_retry("no_retries")
def test_no_retries_needed(self, e2e_server):
"""Test normal flow with no retries."""
result = run_pipeline(...)
assert result.success
# Verify no retry metrics
assert result.metrics.retries == 0
@pytest.mark.mock_profile("rate_limit_429")
@pytest.mark.mock_retry("retry_once")
def test_single_retry_succeeds(self, e2e_server):
"""Test single retry succeeds."""
result = run_pipeline(...)
assert result.success
assert result.metrics.retries == 1
assert result.metrics.rate_limit_sleep_sec > 0
@pytest.mark.mock_profile("rate_limit_429")
@pytest.mark.mock_retry("retry_twice")
def test_multiple_retries_succeed(self, e2e_server):
"""Test multiple retries eventually succeed."""
result = run_pipeline(...)
assert result.success
assert result.metrics.retries == 2
@pytest.mark.mock_profile("rate_limit_429")
@pytest.mark.mock_retry("exponential_backoff")
def test_exponential_backoff(self, e2e_server):
"""Test exponential backoff delays."""
result = run_pipeline(...)
assert result.success
# Verify backoff delays increase
assert result.metrics.rate_limit_sleep_sec > 0
# Verify retry logs show increasing delays
assert "sleep=1.0" in caplog.text
assert "sleep=2.0" in caplog.text
Example 3: Graceful Degradation Test File¶
# tests/e2e/test_provider_graceful_degradation_e2e.py
@pytest.mark.e2e
@pytest.mark.graceful_degradation
class TestGracefulDegradation:
"""Test graceful degradation when components fail."""
@pytest.mark.mock_profile("server_error_500")
@pytest.mark.mock_retry("retry_exhausted")
def test_summarization_failure_saves_transcript(self, e2e_server):
"""Test that transcript is saved even if summarization fails."""
result = run_pipeline(...)
# Transcript should be saved
assert result.transcript is not None
assert os.path.exists(result.transcript_path)
# Summary should be None
assert result.summary is None
# Status should indicate degradation
assert result.status == "degraded"
@pytest.mark.mock_profile("normal")
@pytest.mark.mock_profile("server_error_500", endpoint="/v1/chat/completions")
def test_partial_failure_handling(self, e2e_server):
"""Test handling when one provider fails but others succeed."""
# Transcription succeeds, summarization fails
result = run_pipeline(...)
assert result.transcript is not None
assert result.summary is None
assert result.status == "partial"
Test Execution Patterns¶
Default Execution (Most Tests):
# All existing tests + new normal-scenario tests
pytest tests/e2e/ -m e2e
# ~25-30 existing test files (unchanged)
# ~5 new test files with normal scenarios
# Total: ~30-35 test files, ~100-150 tests
Error Handling Tests (New):
# Run only error handling tests
pytest tests/e2e/ -m "e2e and provider_errors"
# ~10-15 tests in test_provider_error_handling_e2e.py
Retry Behavior Tests (New):
# Run only retry behavior tests
pytest tests/e2e/ -m "e2e and provider_retries"
# ~8-12 tests in test_provider_retry_behavior_e2e.py
All Advanced Scenarios:
# Run all advanced scenario tests
pytest tests/e2e/ -m "e2e and (provider_errors or provider_retries or edge_cases or graceful_degradation)"
# ~34-53 new tests across 5 files
CI/CD Execution:
# Fast tests (normal scenarios only)
pytest tests/e2e/ -m "e2e and not (provider_errors or provider_retries or edge_cases)"
# Slow tests (advanced scenarios)
pytest tests/e2e/ -m "e2e and (provider_errors or provider_retries or edge_cases)"
Test Count Summary¶
| Category | Test Files | Tests | Notes |
|---|---|---|---|
| Existing Tests | ~25-30 | ~100-120 | Unchanged, use default "normal" profile |
| Error Handling | 1 | ~10-15 | New: rate limits, server errors, timeouts |
| Retry Behavior | 1 | ~8-12 | New: retry logic, backoff, metrics |
| Edge Cases | 1 | ~6-10 | New: empty, malformed, partial responses |
| Graceful Degradation | 1 | ~5-8 | New: partial failures, fallbacks |
| Metrics Tracking | 1 | ~5-8 | New: tokens, costs, retry metrics |
| Total | ~30-35 | ~134-173 | Mix of existing (unchanged) + new |
Test Consumption Pattern¶
Pattern 1: Default (No Configuration)
# 90% of tests - just work with defaults
def test_basic_functionality(e2e_server):
result = run_pipeline(...)
assert result.success
Pattern 2: Decorator-Based (Simple Scenarios)
# 5-10% of tests - use decorators for specific scenarios
@pytest.mark.mock_profile("rate_limit_429")
@pytest.mark.mock_retry("retry_twice")
def test_rate_limit_handling(e2e_server):
result = run_pipeline(...)
assert result.success
Pattern 3: Programmatic (Complex Scenarios)
# 1-2% of tests - programmatic configuration for complex scenarios
def test_custom_scenario(e2e_server):
e2e_server.set_response_profile(
endpoint="/v1/chat/completions",
profile="normal",
retry="retry_once"
)
e2e_server.set_response_profile(
endpoint="/v1/audio/transcriptions",
profile="rate_limit_429",
retry="retry_twice"
)
result = run_pipeline(...)
assert result.success
Implementation Plan¶
Phase 1: Response Profile System¶
Goal: Create base response profile system with normal responses.
- [ ] Create
mock_responses/profiles.pywith base profile classes - [ ] Implement
normalprofile for all providers (OpenAI, Gemini, Anthropic, etc.) - [ ] Update
e2e_http_server.pyto use profile system - [ ] Add default profile selection (normal for all endpoints)
- [ ] Update existing tests to use profile system (backward compatible)
Deliverables: - Response profile base classes - Normal response implementations for all providers - Integration with existing e2e server
Phase 2: Non-Functional Behavior Layer¶
Goal: Centralize retry, timeout, and rate limit behavior.
- [ ] Create
mock_responses/behavior.pywith behavior classes - [ ] Implement retry behavior (no_retries, retry_once, retry_twice, etc.)
- [ ] Implement timeout behavior (no_timeout, connect_timeout, read_timeout)
- [ ] Implement rate limit behavior (no_rate_limit, rate_limit_after_n)
- [ ] Add metrics tracking (retries, sleep time, tokens)
- [ ] Integrate with response router
Deliverables: - Non-functional behavior classes - Metrics tracking integration - Retry logging (matches production format)
Phase 3: Response Router¶
Goal: Combine profiles and behavior into unified router.
- [ ] Create
mock_responses/router.pywith router class - [ ] Implement profile + behavior composition
- [ ] Add test-level configuration (pytest markers)
- [ ] Add programmatic configuration (e2e_server methods)
- [ ] Update e2e_http_server to use router
Deliverables: - Response router implementation - Pytest markers for test configuration - Programmatic configuration API
Phase 4: Error Profiles¶
Goal: Add error response profiles for comprehensive error testing.
- [ ] Implement error profiles (429, 5xx, connection, timeout)
- [ ] Add proper HTTP headers (Retry-After, etc.)
- [ ] Add error response formats matching real APIs
- [ ] Create tests for error scenarios
Deliverables: - Error response profiles - Error scenario tests - Documentation
Phase 5: Edge Case Profiles¶
Goal: Add edge case profiles for advanced testing.
- [ ] Implement edge case profiles (empty, malformed, partial, large)
- [ ] Add provider-specific profiles (OpenAI verbose JSON, Gemini multimodal, etc.)
- [ ] Create tests for edge cases
- [ ] Document edge case scenarios
Deliverables: - Edge case response profiles - Provider-specific profiles - Edge case tests
Phase 6: Integration and Documentation¶
Goal: Complete integration and comprehensive documentation.
- [ ] Update all existing tests to use new system (backward compatible)
- [ ] Create comprehensive test examples
- [ ] Document all profiles and behaviors
- [ ] Add migration guide for existing tests
- [ ] Update issue #135 with completion status
Deliverables: - Updated test suite - Comprehensive documentation - Migration guide - Test examples
Benefits¶
- Flexibility: Tests can easily select different response scenarios
- No Duplication: Common patterns centralized, not repeated
- Maintainability: Changes to retry logic in one place
- Composability: Mix and match functional and non-functional behaviors
- Realistic Testing: Error scenarios match real API behavior
- Comprehensive Coverage: Can test all error types and edge cases
- Backward Compatible: Existing tests continue to work (default to normal)
Risks and Mitigations¶
Risk: Complexity increase in test infrastructure - Mitigation: Clear separation of concerns, well-documented API, backward compatibility
Risk: Test execution time increase - Mitigation: Default behavior is fast (normal responses, no retries), advanced scenarios opt-in
Risk: Maintenance burden of response profiles - Mitigation: Profiles match real API documentation, centralized in one place
Risk: Test flakiness from complex scenarios - Mitigation: Deterministic behavior, clear configuration, comprehensive documentation
Alternatives Considered¶
Alternative 1: Per-Test Mock Setup¶
- Pros: Simple, explicit
- Cons: Duplication, hard to maintain, no centralized logic
Alternative 2: Configuration Files¶
- Pros: External configuration, easy to modify
- Cons: Less flexible, harder to programmatically configure
Alternative 3: Separate Mock Servers¶
- Pros: Complete isolation
- Cons: More complex, harder to maintain, duplication
Open Questions¶
- Should response profiles be provider-agnostic or provider-specific?
-
Proposal: Provider-specific (matches real APIs), but with common base classes
-
How to handle provider-specific error formats?
-
Proposal: Provider-specific error profiles that match real error formats
-
Should metrics tracking be optional or always enabled?
-
Proposal: Always enabled (matches production), but can be validated or ignored in tests
-
How to handle streaming responses?
- Proposal: Separate streaming profiles, handled in Phase 5
Related Work¶
- Issue #135: Make e2e LLM API mocks realistic and comprehensive
- Issue #399: Provider-level concerns to harden (retry policy, timeouts)
- Issue #401: DevEx / Ops Improvements (structured logs, graceful degradation)
- RFC-019: E2E Test Infrastructure and Coverage Improvements
Acceptance Criteria¶
- [ ] Response profile system implemented with normal responses
- [ ] Non-functional behavior layer implemented (retries, timeouts, rate limits)
- [ ] Response router implemented (combines profiles + behavior)
- [ ] Error profiles implemented (429, 5xx, connection, timeout)
- [ ] Edge case profiles implemented (empty, malformed, partial, large)
- [ ] Provider-specific profiles implemented (OpenAI, Gemini, Anthropic, etc.)
- [ ] Pytest markers for test configuration
- [ ] Programmatic configuration API
- [ ] All existing tests updated (backward compatible)
- [ ] Comprehensive documentation
- [ ] Test examples for all scenarios
- [ ] Issue #135 updated with completion status
Timeline¶
- Phase 1: 1-2 weeks (Response Profile System)
- Phase 2: 1-2 weeks (Non-Functional Behavior Layer)
- Phase 3: 1 week (Response Router)
- Phase 4: 1-2 weeks (Error Profiles)
- Phase 5: 1-2 weeks (Edge Case Profiles)
- Phase 6: 1 week (Integration and Documentation)
Total: 6-10 weeks
Status: This RFC is in draft status and open for feedback. Implementation should begin after review and approval.