Development Guide¶
Maintenance Note: This document should be kept up-to-date as linting rules, Makefile targets, pre-commit hooks, CI/CD workflows, or development setup procedures evolve. When adding new checks, tools, workflows, or environment setup steps, update this document accordingly.
This guide provides detailed implementation instructions for developing the podcast scraper. For high-level architectural decisions and design principles, see Architecture.
Testing¶
For comprehensive testing information, see the dedicated testing documentation:
- Testing Strategy - Testing philosophy, test pyramid, decision criteria
- Testing Guide - Quick reference, test execution commands
- Experiment Guide — Complete guide: datasets, baselines, experiments, and evaluation
- Unit Testing Guide - Unit test mocking patterns and isolation
- Integration Testing Guide - Integration test guidelines
- E2E Testing Guide - E2E server, real ML models
- Critical Path Testing Guide - What to test, prioritization
Quick Reference¶
| Layer | Directory | Speed | Mocking |
|---|---|---|---|
| Unit | tests/unit/ |
< 100ms | All mocked |
| Integration | tests/integration/ |
< 5s | External mocked |
| E2E | tests/e2e/ |
< 60s | No mocks |
Running Tests¶
make check-unit-imports # Verify modules can import without ML dependencies
make deps-analyze # Analyze module dependencies (with report)
make deps-check # Check dependencies (exits on error)
make analyze-test-memory # Analyze test memory usage (default: test-unit)
make test-unit # Unit tests (parallel)
make test-integration # Integration tests (parallel, reruns)
make test-e2e # E2E tests (parallel, with reruns)
make test # All tests
make test-fast # Unit + critical path integration + critical path E2E
Fast Validation for Changed Files¶
When fixing a few files to stabilize a failing PR, use make validate-files to run only impacted tests. This is much faster than running the entire test suite.
Usage:
# Validate specific files (runs all test types by default)
make validate-files FILES="src/podcast_scraper/config.py src/podcast_scraper/workflow/orchestration.py"
# Unit tests only (fastest, < 1 minute typically)
make validate-files-unit FILES="src/podcast_scraper/config.py"
# Include integration/E2E tests
make validate-files FILES="..." TEST_TYPE=all
# Fast mode (critical_path tests only)
make validate-files-fast FILES="src/podcast_scraper/config.py"
What it does:
- Linting/formatting on changed files (black, isort, flake8, mypy)
- Discovery of impacted tests via module markers
- Execution of only those tests (unit/integration/e2e based on TEST_TYPE)
Performance:
- Unit tests only: < 1 minute for typical changes
- With integration: < 2 minutes
- Full suite (all types): < 5 minutes (still faster than
make ci-fastwhich takes 6-10 minutes)
How it works:
Tests are tagged with module markers (e.g., module_config, module_workflow) that map to source modules. When you specify changed files, the system:
- Maps files to modules (e.g.,
config.py→module_config) - Finds tests tagged with those module markers
- Runs only those tests
Note: This is a development tool for fast iteration. For full validation before PR, still use make ci-fast or make ci.
ML Dependencies in Tests¶
Modules importing ML dependencies at module level will fail unit tests in CI.
Solutions:
- Mock before import (recommended):
from unittest.mock import MagicMock, patch
with patch.dict("sys.modules", {"spacy": MagicMock()}):
from podcast_scraper import speaker_detection
-
Use lazy imports: Import inside functions, not at module level
-
Verify imports work without ML deps: Run
make check-unit-importsbefore pushing - This verifies modules can be imported without ML dependencies installed
- Runs automatically in CI before unit tests
-
Use when: adding new modules, refactoring imports, or debugging CI failures
-
Run unit tests: Run
make test-unitbefore pushing
Module Dependency Analysis¶
Analyze module dependencies to detect architectural issues like circular imports and excessive coupling.
When to use:
- After refactoring modules or moving code between modules
- When adding new imports or dependencies
- Before major refactoring to understand current architecture
- When debugging circular import errors
- Before committing if you changed module structure
Usage:
make deps-analyze # Full analysis with JSON report (reports/deps-analysis.json)
make deps-check # Quick check (exits with error if issues found, CI-friendly)
What it checks:
- Circular imports: Detects cycles in the import graph (should be 0)
- Import thresholds: Flags modules with >15 imports (suggests refactoring)
- Import patterns: Analyzes import structure across all modules
Output:
- Console output with issues and summary
- JSON report (with
--reportflag) saved toreports/deps-analysis.json - Visual dependency graphs (generated separately via
make deps-graph)
Runs automatically in CI: In nightly workflow (nightly-deps-analysis job) with 90-day artifact retention for tracking architecture changes over time.
See also: Module Dependency Analysis for detailed documentation.
Test Memory Analysis¶
Analyze memory usage during test execution to identify memory leaks, excessive resource usage, and optimization opportunities.
When to use:
- Debugging memory issues (tests crash with OOM errors, system becomes unresponsive)
- Optimizing test performance (finding optimal worker count, understanding resource usage)
- Investigating memory leaks (memory growth over time, system memory decreases after tests)
- Capacity planning (determining required RAM for CI, understanding resource needs)
- Before major changes (after adding ML model tests, changing parallelism settings)
Usage:
# Analyze default test target (test-unit)
make analyze-test-memory
# Analyze specific test target
make analyze-test-memory TARGET=test-unit
make analyze-test-memory TARGET=test-integration
make analyze-test-memory TARGET=test-e2e
# Analyze with limited workers (to test memory impact)
make analyze-test-memory TARGET=test-integration WORKERS=4
What it monitors:
- Peak memory usage: Maximum memory consumed during test execution
- Average memory usage: Average memory over test duration
- Worker processes: Number of parallel test workers spawned
- Memory growth: Detects potential memory leaks (memory increasing over time)
- System resources: CPU cores, total/available memory (before/after)
Output:
- Memory usage statistics (peak, average, worker count)
- Memory usage over time (sample points every 2 seconds)
- Recommendations (warnings if thresholds exceeded)
- System resource changes (before/after comparison)
Recommendations provided:
- Warns if peak memory > 80% of total RAM
- Warns if worker count > CPU cores
- Warns if peak memory > 8 GB
- Suggests optimal worker count (CPU cores - 2)
- Detects memory growth (potential leaks)
Dependencies: Requires psutil package (pip install psutil)
See also: Troubleshooting Guide for memory issue debugging.
Quality Evaluation¶
Evaluation is handled automatically by the experiment runner. When you run an experiment with --baseline and/or --reference flags, the system automatically computes metrics and comparisons.
When to use:
- After modifying cleaning logic in
preprocessing.py - When testing new summarization models or chunking strategies
- Before major releases to ensure no regression in output quality
Usage:
# Run experiment with automatic evaluation
make experiment-run \
CONFIG=data/eval/configs/my_config.yaml \
BASELINE=baseline_prod_authority_v1 \
REFERENCE=silver_gpt52_v1
For details, see the Experiment Guide (Step 4: Evaluate Results).
Environment Setup¶
Virtual Environment¶
Quick setup:
bash scripts/setup_venv.sh
source .venv/bin/activate
Note: The setup_venv.sh script automatically installs the package in editable mode
(pip install -e .), which is required for:
- Running CLI commands:
python3 -m podcast_scraper.cli - Importing the package in Python:
from podcast_scraper import ... - Running tests that import the package
Manual setup (if not using setup_venv.sh):
If you create a virtual environment manually, you must install the package:
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,ml,llm]" # Editable mode with dev, ML, and LLM extras
Optional — spaCy wheels on disk: Run make download-spacy-wheels, then make init; if
wheels/spacy/*.whl exists, the Makefile sets PIP_FIND_LINKS for you (see
Dependencies Guide — Optional local wheel cache).
Why editable mode (-e)?
- Changes to source code are immediately available without reinstalling
- Required for development workflow
- Allows
python3 -m podcast_scraper.clito work
Updating Virtual Environment Dependencies¶
⚠️ CRITICAL: Update venv when dependency ranges change
When pyproject.toml dependency version ranges are modified (e.g., black>=23.0.0,<27.0.0), you must update your local virtual environment to match what CI installs.
Why this matters:
- CI installs fresh dependencies each run, getting the latest version in the range
- Your local venv may have an older version installed when the range was smaller
- Pip doesn't auto-upgrade packages that still satisfy the constraint
- This causes version mismatches between local and CI
When to update:
- After modifying dependency version ranges in
pyproject.toml - After pulling changes that modify
pyproject.tomldependency ranges - When CI fails with formatting/linting errors but local passes
- When you see "File would be reformatted" in CI but not locally
How to update:
# Update all dev dependencies to latest in their ranges
pip install --upgrade -e .[dev]
# Or update specific tool (e.g., black)
pip install --upgrade "black>=23.0.0,<27.0.0"
# Verify version matches CI
python -m black --version # Should show latest in range (e.g., 26.1.0)
Common symptoms of stale venv:
- ✅ Local:
make format-checkpasses - ❌ CI:
make format-checkfails with "would reformat" - ✅ Local:
make lintpasses - ❌ CI:
make lintfails with different errors - Tool versions differ:
python -m black --versionshows older version than CI logs
Prevention:
After modifying pyproject.toml dependency ranges, always run:
pip install --upgrade -e .[dev]
Environment Variables¶
Note: Setting up a .env file is optional but recommended, especially if you plan to
use OpenAI providers or want to customize logging, paths, or performance settings.
- Copy example
.envfile:
cp config/examples/.env.example .env
- Edit
.envand add your settings:
# OpenAI API key (required for OpenAI providers)
OPENAI_API_KEY=sk-your-actual-key-here
# Logging
LOG_LEVEL=DEBUG
# Paths
OUTPUT_DIR=/data/transcripts
LOG_FILE=/var/log/podcast_scraper.log
CACHE_DIR=/cache/models
# Performance tuning
WORKERS=4
TRANSCRIPTION_PARALLELISM=3
PROCESSING_PARALLELISM=4
SUMMARY_BATCH_SIZE=2
SUMMARY_CHUNK_PARALLELISM=2
TIMEOUT=60
SUMMARY_DEVICE=cpu
- The
.envfile is automatically loaded viapython-dotenvwhenpodcast_scraper.configmodule is imported.
Security notes:
- ✅
.envis in.gitignore(never committed) - ✅
config/examples/.env.exampleis safe to commit (template only) - ✅ API keys are never logged or exposed
- ✅ Environment variables take precedence over
.envfile - ✅ HuggingFace model loading uses
trust_remote_code=False; only enabletrust_remote_code=Trueif a model's documentation explicitly requires it and the source is trusted (Issue #429).
Priority order (for each configuration field):
- Config file field (highest priority) - if the field is set in the config file and not
null/empty, it takes precedence - Environment variable - only used if the config file field is
null, not set, or empty - Default value - used if neither config file nor environment variable is set
Exception: LOG_LEVEL environment variable takes precedence over config file (allows easy runtime log level control).
Note: You can define the same field in both the config file and as an environment variable. The config file value will be used if it's set. This allows config files for project defaults and environment variables for deployment-specific overrides.
See also:
docs/api/CONFIGURATION.md- Configuration API reference (includes environment variables)docs/rfc/RFC-013-openai-provider-implementation.md- API key management detailsdocs/prd/PRD-006-openai-provider-integration.md- OpenAI provider requirements
ML Model Cache Management¶
The project uses a local .cache/ directory for ML models (Whisper, HuggingFace Transformers,
spaCy). This cache can grow large (several GB) with both dev/test and production models.
Preloading Models¶
To download and cache all required ML models:
# Preload test models (small, fast models for local dev/testing)
make preload-ml-models
# Preload production models (large, quality models)
make preload-ml-models-production
Cache locations:
- Whisper:
.cache/whisper/(e.g.,tiny.en.pt,base.en.pt) - HuggingFace:
.cache/huggingface/hub/(e.g.,facebook/bart-base,allenai/led-base-16384) - spaCy:
.cache/spacy/(if using local cache)
See also: .cache/README.md for detailed cache structure and usage.
Backup and Restore¶
The cache directory can be backed up and restored for easy management:
Backup:
# Create backup (saves to ~/podcast_scraper_cache_backups/)
make backup-cache
# Dry run to preview
make backup-cache-dry-run
# List existing backups
make backup-cache-list
# Clean up old backups (keep 5 most recent)
make backup-cache-cleanup
Restore:
# Interactive restore (lists backups, prompts for selection)
make restore-cache
# Restore specific backup
python scripts/cache/restore_cache.py --backup cache_backup_20250108-120000.tar.gz
# Force overwrite existing .cache
python scripts/cache/restore_cache.py --backup 20250108 --force
What gets backed up:
- All model files (Whisper, HuggingFace, spaCy)
- Cache directory structure
- Excludes:
.lockfiles,.incompletedownloads, temporary files
See also:
scripts/cache/backup_cache.py- Backup script documentationscripts/cache/restore_cache.py- Restore script documentation.cache/README.md- Cache directory documentation
Cleaning Cache¶
To remove cached models (useful for testing or freeing disk space):
# Clean all ML model caches (user cache locations)
make clean-cache
# Clean build artifacts and caches
make clean-all
Note: make clean-cache removes models from ~/.cache/ locations, not the project-local
.cache/ directory. To remove the project-local cache, manually delete .cache/ or use the
restore script to replace it.
Semantic corpus search (RFC-061)¶
Optional vector index under <output_dir>/search/ for meaning-based retrieval over
GIL, summaries, and transcripts. Enable with vector_search: true in config (YAML
keys mirror Config: vector_index_path, vector_embedding_model,
vector_chunk_size_tokens, vector_chunk_overlap_tokens). The pipeline runs
embed-and-index after finalize when enabled; you can also run podcast index /
podcast search and get semantic gi explore --topic when the index exists.
Full guide: Semantic Search Guide.
GI / KG browser viewer (local prototype)¶
Optional static pages for inspecting Grounded Insight (*.gi.json) and Knowledge
Graph (*.kg.json) artifacts in the browser — no backend
(GitHub issue #445).
Run from repository root:
make serve-gi-kg-viz
Open http://127.0.0.1:8765/ and pick a page from the hub:
- vis-network or Cytoscape.js — interactive graph (pan/zoom/drag), filters (node types; for GIL, hide ungrounded insights), Chart.js node-type bars, CLI cheatsheet.
- JSON only — metrics + chart + raw JSON without graph libraries (fewer CDN deps).
Typical workflow: Produce artifacts with the main pipeline (generate_gi / generate_kg)
or with gi export / kg export → in the viewer, use Load artifacts and select one or
more .gi.json / .kg.json files → adjust filters and layout controls as needed.
Deep link (same server): make serve-gi-kg-viz runs scripts/gi_kg_viz_server.py, which
exposes repo files. You can open e.g.
graph-vis.html?data=.test_outputs/.../metadata&layer=both&merged=1 to load every
*.gi.json / *.kg.json under that repo-relative directory automatically (see
web/gi-kg-viz/README.md).
Implementation details, offline/CDN notes:
web/gi-kg-viz/README.md.
See also: Semantic Search Guide · Grounded Insights
Guide · Knowledge Graph Guide ·
CLI API (gi / kg / search / index subcommands).
Markdown Linting¶
For detailed information about markdown linting, including automated fixing, table formatting solutions, pre-commit hooks, and CI/CD integration, see the Markdown Linting Guide.
Quick reference:
- Before committing: Run
make fix-mdto auto-fix common issues - Format on save: Prettier is configured to format markdown files automatically
- Pre-commit hook: Automatically checks markdown files before commits
- CI/CD: All markdown files are linted in CI - errors will fail the build
Lessons learned: See the Lessons Learned section in the Markdown Linting Guide for best practices from our large-scale cleanup effort (fixed ~1,016 errors across 91 files).
AI Coding Guidelines¶
This project includes comprehensive AI coding guidelines to ensure consistent code quality and workflow when using AI assistants.
Overview¶
Primary reference: .ai-coding-guidelines.md - This is the PRIMARY source of truth for all AI actions in this project.
Purpose:
- Provides project-specific context and patterns for AI assistants
- Ensures consistent code quality and workflow
- Prevents common mistakes (auto-committing, skipping CI, etc.)
Entry Points by AI Tool¶
Different AI assistants load guidelines from different locations:
| Tool | Entry Point | Auto-Loaded |
|---|---|---|
| Cursor | .cursor/rules/ai-guidelines.mdc |
✅ Yes (modern format) |
| Claude Desktop | CLAUDE.md (root directory) |
✅ Yes |
| GitHub Copilot | .github/copilot-instructions.md |
✅ Yes |
All entry points reference .ai-coding-guidelines.md as the primary source.
Critical Workflow Rules¶
🚨 BRANCH CREATION CHECKLIST - MANDATORY BEFORE CREATING ANY BRANCH:
CRITICAL: Always check for uncommitted changes before creating a new branch.
Step 1: Check Current State
git status
What to look for:
- ❌ If you see "Changes not staged for commit" → You have uncommitted changes
- ❌ If you see "Untracked files" → You have new files
- ✅ If you see "nothing to commit, working tree clean" → You're good to go!
Step 2: Handle Uncommitted Changes (if any)
Option A: Commit to Current Branch (if changes belong to current work)
git add .
git commit -m "your message"
Option B: Stash for Later (if you want to save but not commit)
git stash
# Later: git stash pop
Option C: Discard Changes (if not needed)
git checkout .
# Or for specific files:
git checkout -- path/to/file
Quick One-Liner Check:
git status --porcelain
If you see any output, handle it first!
What happens if you don't follow this:
- ❌ Uncommitted changes from previous work get included in your new branch
- ❌ Your commit will show more files than you actually changed
- ❌ PR will show confusing diffs with unrelated changes
- ❌ Harder to review and understand what actually changed
Example: Clean Branch Creation
# 1. Check status
$ git status
On branch main
Your branch is up to date with 'origin/main'.
nothing to commit, working tree clean ✅
# 2. Pull latest
$ git pull origin main
Already up to date.
# 3. Create branch
$ git checkout -b issue-117-output-organization
Switched to a new branch 'issue-117-output-organization'
# 4. Verify clean state
$ git status
On branch issue-117-output-organization
nothing to commit, working tree clean ✅
NEVER commit without:
- Showing user what files changed (
git status) - Showing user the actual changes (
git diff) - Getting explicit user approval
- User deciding commit message
NEVER push to PR without:
- Running
make cilocally first (full validation) - Ensuring
make cipasses completely - Fixing all failures before pushing
Note: Use make ci-fast for quick feedback during development, but always run
make ci before pushing to ensure full validation.
What's in .ai-coding-guidelines.md¶
Sections include:
- Git Workflow - Commit approval, PR workflow, branch naming
- Code Organization - Module boundaries, when to create new files
- Testing Requirements - Mocking patterns, test structure
- Documentation Standards - PRDs, RFCs, docstrings
- Common Patterns - Configuration, error handling, logging
- Decision Trees - When to create modules, PRDs, RFCs
- When to Ask - When AI should ask vs. act autonomously
For Developers¶
If you're using Cursor AI:
- The guidelines are automatically loaded (no setup needed)
- AI assistants will follow project patterns and workflows
- Guidelines ensure consistent code quality
- See also:
docs/guides/CURSOR_AI_BEST_PRACTICES_GUIDE.md- Best practices for using Cursor AI effectively, including model selection, workflow
optimization, prompt templates, and project-specific recommendations
If you're using other AI assistants:
- The guidelines are automatically loaded (no setup needed)
- AI assistants will follow project patterns and workflows
- Guidelines ensure consistent code quality
If you're not using an AI assistant:
- You don't need to read these files
- They're for AI tools, not human developers
- Human contributors should follow CONTRIBUTING.md
Maintenance¶
When to update .ai-coding-guidelines.md:
- New patterns or conventions are established
- Workflow changes (e.g., new CI checks)
- Architecture decisions that affect code organization
- New tools or processes are added
Keep entry points in sync:
- When updating
.ai-coding-guidelines.md, ensure entry points (CLAUDE.md,.github/copilot-instructions.md,.cursor/rules/ai-guidelines.mdc) still reference
it correctly
See: .ai-coding-guidelines.md for complete guidelines.
Code Style Guidelines¶
Formatting Tools¶
The project uses automated formatting and quality tools:
- Black: Code formatting (line length: 100 characters)
- isort: Import statement organization
- flake8: Linting and style enforcement
- mypy: Static type checking
- radon: Cyclomatic complexity analysis
- vulture: Dead code detection
- interrogate: Docstring coverage
- codespell: Spell checking
Apply formatting automatically:
make format
Run all quality checks:
make quality # complexity, deadcode, docstrings, spelling
Naming Conventions¶
Functions and Variables: Use snake_case with descriptive names.
# Good
def fetch_rss_feed(url: str) -> RssFeed:
episode_count = len(feed.episodes)
# Bad
def fetchRSSFeed(url: str): # camelCase
x = len(feed.episodes) # non-descriptive name
Classes: Use PascalCase with descriptive nouns.
# Good
class RssFeed:
pass
# Bad
class rss_feed: # snake_case
pass
Constants: Use UPPER_SNAKE_CASE.
DEFAULT_TIMEOUT = 20
MAX_RETRIES = 3
Private Members: Prefix with underscore.
class SummaryModel:
def __init__(self):
self._device = "cpu" # Internal attribute
def _load_model(self): # Internal method
pass
Type Hints¶
All functions should have type hints:
def sanitize_filename(filename: str, max_length: int = 255) -> str:
"""Sanitize filename for safe filesystem use."""
pass
Docstrings¶
Use Google-style docstrings:
def run_pipeline(cfg: Config) -> None:
"""Run the complete podcast scraping pipeline.
Args:
cfg: Configuration object containing RSS URL and processing options.
Raises:
ValueError: If configuration is invalid.
HTTPError: If RSS feed cannot be fetched.
"""
pass
Import Order¶
Follow this order (enforced by isort):
- Standard library imports
- Third-party imports
- Local application imports
# Standard library
import os
import sys
from pathlib import Path
# Third-party
import requests
from pydantic import BaseModel
# Local
from podcast_scraper import config
from podcast_scraper.models import Episode
Every New Function Needs¶
✅ Unit test with mocks for external dependencies:
@patch("podcast_scraper.rss.downloader.requests.Session")
def test_fetch_url_with_retry(self, mock_session):
"""Test that fetch_url retries on network failure."""
mock_session.get.side_effect = [
requests.ConnectionError("Network error"),
MockHTTPResponse(content="Success", status_code=200)
]
result = fetch_url("https://example.com/feed.xml")
self.assertEqual(result, "Success")
✅ Descriptive test names:
# Good
def test_sanitize_filename_removes_invalid_characters(self):
pass
def test_whisper_model_selection_prefers_en_variant_for_english(self):
pass
# Bad
def test_config(self):
pass
def test_whisper(self):
pass
Also consider:
- Integration test (marked
@pytest.mark.integration) - Documentation update (README, API docs, or relevant guide)
- Examples if user-facing
Mock External Dependencies¶
Always mock external dependencies in tests:
- HTTP requests: Mock
requestsmodule (unit/integration tests), use E2E server for E2E tests - Whisper models:
- Unit Tests: Mock
whisper.load_model()andwhisper.transcribe()(all dependencies mocked) - Integration Tests: Mock Whisper for speed (focus on component integration)
- E2E Tests: Use real Whisper models (NO mocks - complete workflow validation)
- File I/O: Use
tempfile.TemporaryDirectoryfor isolated tests - spaCy models:
- Unit Tests: Mock NER extraction (all dependencies mocked)
- Integration Tests: Mock spaCy for speed (focus on component integration)
- E2E Tests: Use real spaCy models (NO mocks - complete workflow validation)
- API providers: Mock API clients (unit/integration tests), use E2E server mock endpoints (E2E tests)
Provider Testing Patterns:
- Unit Tests: Mock all provider dependencies (API clients, ML models)
-
Integration Tests: Use real provider implementations with mocked external services (HTTP APIs) and mocked ML models (Whisper, spaCy, Transformers)
-
E2E Tests: Use real providers with E2E server mock endpoints (for API providers) or real implementations (for local providers). ML models are REAL - no mocks allowed.
import tempfile
from unittest.mock import patch, Mock
class TestEpisodeProcessor(unittest.TestCase):
def setUp(self):
self.temp_dir = tempfile.mkdtemp()
def tearDown(self):
shutil.rmtree(self.temp_dir, ignore_errors=True)
@patch("podcast_scraper.providers.ml.whisper_utils.whisper")
def test_transcription(self, mock_whisper):
mock_whisper.load_model.return_value = Mock()
mock_whisper.transcribe.return_value = {"text": "Test transcript"}
# ... test code ...
For detailed mocking patterns, see:
- Unit Testing Guide - Unit test mocking patterns
- Integration Testing Guide - Integration test guidelines
- E2E Testing Guide - E2E testing with real ML
Network isolation: All tests use --disable-socket --allow-hosts=127.0.0.1,localhost.
E2E test modes (E2E_TEST_MODE env var):
fast: 1 episode (quick)multi_episode: 5 episodes (full validation)data_quality: All mock data (nightly)
Flaky Test Reruns (integration/E2E only):
# Automatic in make targets, or manually:
pytest --reruns 2 --reruns-delay 1
When to Create PRD (Product Requirements Document)¶
Create a PRD for:
- New user-facing features
- Changes that affect user workflows
Template: docs/prd/PRD-XXX-feature-name.md
Examples:
- PRD-004: Metadata Generation
- PRD-005: Episode Summarization
When to Create RFC (Request for Comments)¶
Create an RFC for:
- Architectural changes
- Breaking API changes
- Design decisions that need discussion
- Technical implementation approaches
Template: docs/rfc/RFC-XXX-feature-name.md
Examples:
- RFC-010: Speaker Name Detection
- RFC-012: Episode Summarization
When to Skip PRD/RFC¶
You can proceed without PRD/RFC for:
- Bug fixes
- Small enhancements (< 100 lines of code)
- Internal refactoring that doesn't affect API
- Documentation-only updates
- Test improvements
Always Update¶
README if:
- CLI flags change
- New features are user-facing
- Installation requirements change
- Usage examples need updates
docs/architecture/ARCHITECTURE.md if:
- Module responsibilities change
- New modules are added
- Data flow changes
- Design decisions are made
docs/architecture/TESTING_STRATEGY.md if:
- Testing approach changes
- New test categories are added
- Test infrastructure is updated
API docs if:
- Public API changes (functions, classes, parameters)
- New public modules are added
- API contracts change
⚠️ Before Pushing Documentation Changes¶
Always check mkdocs.yml and verify all links when adding, moving, or deleting documentation files:
- [ ] New files added? → Add to
navconfiguration inmkdocs.yml - [ ] Files moved? → Update path in
navconfiguration - [ ] Files deleted? → Remove from
navconfiguration - [ ] Links updated? → Use relative paths (e.g.,
rfc/RFC-019.mdnotdocs/rfc/RFC-019.md) - [ ] All links verified? → Check that all internal links point to existing files
- [ ] No broken links? → Run
make docsto catch broken links before CI - [ ] Test locally? → Run
make docsto verify build succeeds
Common issues:
- Missing files in
nav→ Build will warn about pages not in nav - Broken links → Build will fail if links point to non-existent files
- Wrong path format → Use relative paths from
docs/directory
Why this matters:
- Broken links waste CI build time (~3-5 min per failed build)
- Fixing locally with
make docstakes seconds vs. waiting for CI - Prevents unnecessary CI failures and re-runs
Example: When adding a new RFC:
# mkdocs.yml
nav:
- RFCs:
- RFC-023 README Acceptance Tests: rfc/RFC-023-readme-acceptance-tests.md
CI/CD Integration¶
See also: CI/CD Documentation for complete CI/CD pipeline documentation with visualizations.
What Runs in CI¶
The GitHub Actions workflows use intelligent path-based filtering to run only when necessary. This means:
- Documentation-only changes: Only the docs workflow runs (~3-5 min)
- Python code changes: All workflows run for full validation (~15-20 min)
- README changes: Only the docs workflow runs (~3-5 min)
Python Application Workflow (4 parallel jobs) - Runs only when Python/config files change:
- Lint Job (2-3 min, no ML deps):
- Black/isort formatting checks
- Flake8 linting
- Markdownlint for docs
- Mypy type checking
- Bandit + pip-audit security scanning
-
Code quality analysis (complexity, dead code, docstrings, spelling)
-
Test Job (10-15 min, full ML stack):
- Full pytest suite with coverage
-
Integration tests (mocked)
-
Docs Job (3-5 min):
- MkDocs build (strict mode)
-
API documentation generation
-
Build Job (2-3 min):
- Build source distribution
- Build wheel distribution
Documentation Deployment (sequential) - Runs when docs or Python files change:
- Build MkDocs site
- Deploy to GitHub Pages (on push to main)
CodeQL Security (parallel language analysis) - Runs only when code/workflow files change:
- Python security scanning
- GitHub Actions security scanning
Path-Based CI Optimization¶
Workflows are configured to skip when irrelevant files change:
| Files Changed | Python App | Docs | CodeQL | Time Savings |
|---|---|---|---|---|
Only docs/ |
❌ Skip | ✅ Run | ❌ Skip | ~18 minutes |
Only .py |
✅ Run | ✅ Run | ✅ Run | - |
Only README.md |
❌ Skip | ✅ Run | ❌ Skip | ~18 minutes |
pyproject.toml |
✅ Run | ❌ Skip | ❌ Skip | ~5 minutes |
Dockerfile |
✅ Run | ❌ Skip | ❌ Skip | ~5 minutes |
This optimization provides fast feedback for documentation updates while maintaining full validation for code changes.
CI Failure Response¶
If CI fails on your PR:
- Check the CI logs to identify the failure
- Reproduce locally: Run
make cito see the same failure - Fix the issue and test locally
- Push the fix - CI will re-run automatically
CI Command Differences:
make ci: Full CI suite- Runs
test(unit + integration + e2e tests, excludes slow/ml_models) - Full validation matching GitHub Actions
-
Use before commits/PRs
-
make ci-fast: Fast CI checks - Runs
test-fast(unit + critical path integration + critical path e2e, no coverage) - Skips
coverage-enforce(the main difference frommake ci) - Quick feedback during development
-
Use for rapid iteration, but always run
make cibefore pushing -
make ci-clean: Complete CI suite (clean start) - Runs
clean-all format-check lint lint-markdown type security preload-ml-models test docs build - Starts fresh by removing build artifacts + ML caches, then runs the full validation pipeline
- Use before releases or when you need full test coverage from a clean state
Common failures:
| Issue | Solution |
|---|---|
| Formatting issues | Run make format to auto-fix |
| Linting errors | Fix code style issues or run make format |
| Type errors | Add missing type hints |
| Test failures | Fix or update tests |
| Coverage drop | Add tests for new code |
| Markdown linting | Run make fix-md to auto-fix markdown issues |
Prevent failures with pre-commit hooks:
# Install once
make install-hooks
# Now linting failures are caught before commit!
Release checklist¶
Use this checklist before tagging a release (e.g. v2.6.0). Until make pre-release exists (see ADR-031), follow these steps manually.
1. Pre-flight¶
- Branch & tree: Work from
main(or your release branch). Ensure a clean working tree:git status --porcelainshould be empty, or only include files you intend to commit for the release. - Version: Decide the release version using Semantic Versioning (see Releases index): major (breaking), minor (new features), patch (fixes).
2. Version bump¶
pyproject.toml: Setversion = "X.Y.Z"in the[project]section.src/podcast_scraper/__init__.py: Set__version__ = "X.Y.Z"so the package and CLI report the same version. Keep both in sync.
3. Release docs prep¶
Why this matters: Architecture diagrams are not generated in CI. The docs site and all CI jobs use the committed docs/architecture/diagrams/*.svg files. If you release without updating them, the published docs will show outdated architecture, and subsequent PRs may fail checks until you run make visualize and commit updated SVGs. Running release-docs-prep before every release keeps diagrams in sync with the code you are releasing.
- Run
make release-docs-prep. This: - Regenerates architecture diagrams (
docs/architecture/diagrams/*.svg). - Creates a draft
docs/releases/RELEASE_vX.Y.Z.mdfor the current version (frompyproject.toml) if it does not exist. - Review and commit:
git add docs/architecture/diagrams/*.svg docs/releases/RELEASE_*.mdgit commit -m "docs: release docs prep (visualizations and release notes)"
4. Release notes¶
- Edit
docs/releases/RELEASE_vX.Y.Z.md: fill in Summary, Key Features, Upgrade Notes (if any), and Full Changelog link (e.g.https://github.com/chipi/podcast_scraper/compare/vPREVIOUS...vX.Y.Z). - Update
docs/releases/index.md: add the new version to the table and update the "Latest Release" section (remove any "upcoming" wording once the version is tagged and published).
5. Quality and validation¶
Run all of the following in each release cycle before releasing so the codebase meets project standards.
- Format & lint:
make formatthenmake lintandmake type. Fix any issues. - Markdown:
make fix-md(ormake lint-markdown) so docs and markdown pass. - Docs build:
make docs(MkDocs build must succeed). - Code hygiene: Run
make quality(complexity, dead code, docstrings, spelling). Resolve or document any findings so that: - Complexity (radon) and maintainability index are acceptable or exceptions documented.
- Docstring coverage meets the configured
fail-under(see[tool.interrogate]inpyproject.toml). - Dead code (vulture) and spelling (codespell) findings are triaged (fixed or whitelisted/ignored).
- Test coverage meets the combined threshold (see Issue #432 for background and targets).
- Tests: Run the full CI gate:
make ci(format-check, lint, type, security, complexity, docstrings, spelling, tests, coverage-enforce, docs, build). For maximum confidence (e.g. major release), runmake ci-cleanor runmake testthenmake coverage-enforce,make docs,make build. - Diagrams (required for release): If diagrams are stale, run
make visualizeand commitdocs/architecture/diagrams/*.svg. Before release,make release-docs-prepregenerates diagrams and drafts release notes—do not skip it or the published docs site will ship with outdated architecture. - Build: Ensure
make buildsucceeds (sdist/wheel in.build/dist/ordist/).
6. Commit and push¶
- Commit all release changes (version bumps, release notes, index, diagram updates) with a clear message, e.g.
chore: release vX.Y.Z. - Push the branch:
git push origin <branch>(never push tomainwithout a reviewed PR unless your workflow allows it).
7. Tag and GitHub release¶
- Create an annotated tag:
git tag -a vX.Y.Z -m "Release vX.Y.Z"(use the same version as inpyproject.tomland__init__.py). - Push the tag:
git push origin vX.Y.Z. - On GitHub: open Releases → Draft a new release, choose tag
vX.Y.Z, paste the contents ofdocs/releases/RELEASE_vX.Y.Z.mdas the release description, and publish.
8. Post-release (optional)¶
- If you use a "next dev" version, bump to it (e.g.
X.Y.(Z+1)orX.Y.Z-dev) inpyproject.tomland__init__.pyand commit so the next build is not stuck on the release version.
See also: ADR-031: Mandatory Pre-Release Validation, Architecture visualizations, Releases index.
Modularity¶
- Single Responsibility: Each module should have one clear purpose
- Loose Coupling: Modules should depend on abstractions, not concrete implementations
- High Cohesion: Related functionality should be grouped together
Configuration¶
All runtime options flow through the Config model:
from podcast_scraper import Config
# Good - centralized configuration
cfg = Config(
rss="https://example.com/feed.xml",
output_dir="./output",
transcribe_missing=True
)
run_pipeline(cfg)
# Bad - scattered configuration
fetch_rss(url, timeout=30)
download_transcripts(episodes, workers=8)
transcribe_missing(jobs, model="base")
Adding new configuration options:
- Add to
Configmodel inconfig.py - Add CLI argument in
cli.py - Document in README options section
- Update config examples in
config/examples/
Error Handling¶
Follow these patterns:
# Recoverable errors - log warnings, continue
try:
transcript = download_transcript(url)
except requests.RequestException as e:
logger.warning(f"Failed to download transcript: {e}")
return None
# Unrecoverable errors - raise specific exceptions
if not cfg.rss:
raise ValueError("RSS URL is required")
# Validation errors - use ValueError with clear message
if cfg.workers < 1:
raise ValueError(f"Workers must be >= 1, got: {cfg.workers}")
# Graceful degradation for optional features
try:
import whisper
WHISPER_AVAILABLE = True
except ImportError:
WHISPER_AVAILABLE = False
logger.warning("Whisper not available, transcription disabled")
CLI exit codes (Issue #429)¶
The main pipeline command uses the following exit code policy:
- 0 – Run completed. The pipeline ran to the end (config valid, no run-level exception). Some episodes may have failed; partial results and run index still reflect failures.
- 1 – Run-level failure. Configuration error, dependency missing (e.g. ffmpeg), or an unhandled exception during the run.
So exit code 0 means "the run finished", not "every episode succeeded". Use the run index (index.json) or run.json to see per-episode status. Flags --fail-fast and --max-failures stop processing after the first or after N episode failures but still exit 0 if the run completed without a run-level error.
CLI subcommands and startup (Issue #429)¶
- Subcommands: The first argument can be
doctororcache. When you runpython -m podcast_scraper.cli doctor(orpodcast-scraper doctor), the rest of the arguments are passed to that subcommand. If you omit arguments, the CLI usessys.argv[1:]so subcommands work when invoked from the shell. - Startup validation: Before the main pipeline runs, the CLI checks Python version (3.10+) and that
ffmpegis on PATH. These checks are skipped fordoctorandcacheso you can run doctor even if ffmpeg is missing. - Doctor (
podcast-scraper doctor): Runs environment checks (Python, ffmpeg, write permissions, model cache, ML imports). Use--check-networkto test connectivity and--check-modelsto load default Whisper and summarizer once (slow). See Troubleshooting - Doctor command. - Cache (
podcast-scraper cache --status/--clean): Manages ML model caches. No Python/ffmpeg validation.
Log Level Guidelines¶
Use logger.info() for:
- High-level operations that users care about
- Important state changes and milestones
- User-facing progress updates
- Important results (e.g., "Summary generated", "saved transcript")
- Episode processing start/completion
- Major pipeline stages (e.g., "Starting Whisper transcription", "Processing summarization")
Use logger.debug() for:
- Detailed internal operations
- Model loading/unloading details
- Configuration details and parameter values
- Per-item processing details
- Technical implementation details
- Validation metrics and statistics
- Chunking, mapping, and reduction details
- File handle management and cleanup
- Fallback attempts and retries
Use logger.warning() for:
- Recoverable errors
- Degraded functionality
- Missing optional dependencies
- Non-critical failures
Use logger.error() for:
- Unrecoverable errors
- Critical failures
- Validation failures
Examples¶
# Good - INFO for high-level operation
logger.info("Processing summarization for %d episodes in parallel", len(episodes))
# Good - DEBUG for detailed technical info
logger.debug("Pre-loading %d model instances for thread safety", max_workers)
# Good - INFO for important results
logger.info("Summary generated in %.1fs (length: %d chars)", elapsed, len(summary))
# Bad - INFO for technical details (should be DEBUG)
logger.info("Loading summarization model: %s on %s", model_name, device)
Module-Specific Guidelines:
- Workflow: INFO for episode counts, major stages; DEBUG for cleanup
- Summarization: INFO for generation start/completion; DEBUG for model loading
- Whisper: INFO for "transcribing with Whisper"; DEBUG for model loading
- Episode Processing: INFO for file saves; DEBUG for download details
- Speaker Detection: INFO for results; DEBUG for model download
Rationale¶
This approach ensures:
- Service/daemon logs remain focused and readable
- Production monitoring shows high-level progress without noise
- Debugging still has access to detailed information when needed
- Log file sizes stay manageable during long runs
When in doubt, prefer DEBUG over INFO - it's easier to promote a log level than to demote it.
Progress Reporting¶
Use the progress.py abstraction:
from podcast_scraper.utils.progress import progress_context
# Good - uses progress abstraction
with progress_context(
total=len(episodes),
description="Downloading transcripts"
) as reporter:
for episode in episodes:
process_episode(episode)
reporter.update(1)
# Bad - direct tqdm usage
from tqdm import tqdm
for episode in tqdm(episodes):
process_episode(episode)
Lazy Loading Pattern¶
For optional dependencies:
# At module level
_whisper = None
def load_whisper():
"""Lazy load Whisper library."""
global _whisper
if _whisper is None:
try:
import whisper
_whisper = whisper
except ImportError:
raise ImportError(
"Whisper not installed. "
"Install with: pip install openai-whisper"
)
return _whisper
Module Responsibilities¶
cli.py: CLI only, no business logicservice.py: Service API, structured results for daemon useworkflow.orchestration: Orchestration only, no HTTP/IO detailsconfig.py: Configuration models and validationrss.downloader: HTTP operations onlyutils.filesystem: File system utilities onlyrss.parser: RSS parsing, episode creationworkflow.episode_processor: Episode-level processing logicproviders.ml.whisper_utils: Whisper transcription utilitiesproviders.ml.speaker_detection: NER-based speaker extractionproviders.ml.summarizer: Transcript summarizationworkflow.metadata_generation: Metadata document generationutils.progress: Progress reporting abstractionmodels/: Shared data models (RssFeed, Episode, TranscriptionJob inentities.py)
Keep concerns separated - don't mix HTTP calls in CLI, don't put business logic in config, etc.
When to Create New Files¶
Create new modules when:
- Implementing a new major feature (e.g., new provider implementation)
- A module has distinct responsibility following Single Responsibility Principle
- An existing module exceeds ~1000 lines and can be logically split
Modify existing files when:
- Fixing bugs
- Enhancing existing functionality
- Refactoring within the same module
Provider Implementation Patterns¶
The project uses a protocol-based provider system for transcription, speaker detection, and summarization. When implementing new providers:
- Understand the Protocol: Read the protocol definition in
{capability}/base.py - Implement Provider Class: Create
{capability}/{provider}_provider.py - Register in Factory: Update
{capability}/factory.pyto include new provider - Add Configuration: Update
config.pyto support provider selection - Add CLI Support: Update
cli.pywith provider arguments (if needed) - Add E2E Server Mocking: For API providers, add mock endpoints
- Write Tests: Create unit, integration, and E2E tests
For complete implementation guide, see Provider Implementation Guide.
Third-Party Dependencies¶
For detailed information about third-party dependencies, see the Dependencies Guide.
Summarization Implementation¶
For detailed information about the summarization system, see the ML Provider Reference.