Quick Reference¶
One-page cheat sheet for common podcast_scraper commands.
Setup¶
# Initial setup
git clone https://github.com/chipi/podcast_scraper.git && cd podcast_scraper
python3 -m venv .venv && source .venv/bin/activate
make init # Install all dependencies (auto-uses wheels/spacy if *.whl present)
# Slow ML downloads once: make download-spacy-wheels → then make init picks it up
# ML model preloading (for tests)
make preload-ml-models # Download Whisper, spaCy, transformers models
# Cache management
make backup-cache # Backup .cache directory (saves to ~/podcast_scraper_cache_backups/)
make restore-cache # Restore cache from backup (interactive)
make backup-cache-list # List available backups
Polyglot repo: Python and Makefile at the root; GI/KG viewer UI under web/gi-kg-viewer/
(npm). See Polyglot repository guide for env file locations and viewer
commands.
Twelve-factor style config: Prefer environment variables for secrets and deploy-specific values; use YAML/JSON for shared defaults. See Configuration API — Twelve-factor alignment.
Pipeline run (profile + operator + feeds)¶
python -m podcast_scraper.cli \
--profile cloud_balanced \
--config config/manual/operator_defaults.yaml \
--feeds-spec config/manual/feeds.spec.registry_10.yaml
Details: README.md, CLI.md — Quick Start, CONFIGURATION.md — Multi-feed compose.
Diagnostic Commands (Issue #379)¶
# Run diagnostic checks
python -m podcast_scraper.cli doctor
# Include network connectivity check
python -m podcast_scraper.cli doctor --verbose
# LLM pricing YAML: show path, metadata, staleness (see docs/api/CLI.md)
python -m podcast_scraper.cli pricing-assumptions
make check-pricing-assumptions
Daily Development¶
# Before coding
source .venv/bin/activate
# After making changes
make format # Auto-format code (black + isort)
make lint # Check style (flake8)
make type # Type check (mypy)
make quality # Code quality (complexity, docstrings, dead code, spelling)
# Before committing
make ci-fast # Fast CI gate (~6-10 min, default before push)
make ci # Full CI suite (+ Playwright, coverage enforce)
Testing¶
| Command | What it does | Time |
|---|---|---|
make test-unit |
Unit tests (parallel, network blocked) | ~30s |
make test-integration |
Integration tests (serial) | ~2min |
make test-e2e |
E2E tests (serial, real ML) | ~5min |
make test-fast |
Critical path only | ~1min |
make test |
All tests | ~8min |
make test-nightly |
Nightly tests (production models) | ~4hrs |
make test-ui |
Vitest unit tests for web/gi-kg-viewer TS utils (no browser) |
~1s |
make test-ui-e2e |
Playwright E2E for web/gi-kg-viewer (Firefox; installs browsers) |
~1–3 min |
make test-acceptance-fixtures-fast |
Full-pipeline acceptance fast matrix + E2E fixtures (materializes MAIN_ACCEPTANCE_CONFIG.yaml rows) |
slow (CI-scale) |
Full-pipeline acceptance presets: make test-acceptance CONFIGS="…" or FROM_FAST_STEMS=1 USE_FIXTURES=1 — see Testing Guide — E2E Acceptance Tests and scripts/acceptance/README.md.
# Run specific test file
pytest tests/unit/test_config.py -v
# Run specific test
pytest tests/unit/test_config.py::test_config_validation -v
# Run with output visible
pytest tests/ -v --no-header
# Debug failing test
pytest tests/path/to/test.py -x -v --tb=short
Documentation¶
make docs # Build docs site
make fix-md # Auto-fix markdown issues
make lint-markdown # Check markdown style
# Local preview
mkdocs serve # http://localhost:8000 (docs site; same default port as `podcast serve` — use `-a 127.0.0.1:8001` for one of them if both run)
Hosting and always-on VPS¶
| Doc | Use when |
|---|---|
| Hosting and infrastructure | Big picture: Tailscale, OpenTofu, GitHub Actions, Compose on prod, CI gates vs deploy |
| Prod runbook | Operator commands for prod |
| DR drill runbook | Drill-only GitHub workflows |
| WORKFLOWS | Workflow file names and triggers |
GI / KG Viewer (v2, RFC-062)¶
| Command | What it does |
|---|---|
make serve SERVE_OUTPUT_DIR=… |
FastAPI + Vite dev (API 8000, UI 5173) |
make serve-api SERVE_OUTPUT_DIR=… |
API only on 8000 |
make serve-ui |
Vite only on 5173 (proxies /api → 8000) |
make test-ui |
Vitest unit tests for TS utils (fast, no browser) |
make test-ui-e2e |
Playwright tests (Firefox; Vite on 5174 inside config) |
Prerequisites: pip install -e ".[dev]"; once per clone, cd web/gi-kg-viewer && npm install && npm run build to serve the built SPA from serve.
Docs:
Server Guide (/api/*, /docs)
· web/gi-kg-viewer/README.md
· Development Guide
Provider Selection¶
| Need | Guide |
|---|---|
| Compare providers (cost, quality, speed, privacy) | AI Provider Comparison |
| Per-provider specs, benchmarks, magic quadrant | Provider Deep Dives |
| Implement a new provider | Provider Implementation |
| Quick provider config | Provider Configuration |
| Ollama setup | Ollama Provider Guide |
Evaluation and Profiles¶
# Run an experiment against a baseline
make experiment-run \
CONFIG=data/eval/configs/my_config.yaml \
BASELINE=baseline_prod_authority_v1
# Promote a successful run to baseline
make run-promote RUN_ID=run_xxx \
--as baseline PROMOTED_ID=baseline_v2 \
REASON="New production baseline"
# Capture a performance profile
make profile-freeze VERSION=v2.6-openai \
PIPELINE_CONFIG=config/profiles/freeze/openai.yaml
# Compare two profiles
make profile-diff FROM=v2.6-wip-openai TO=v2.6-wip-gemini
Docs:
Experiment Guide
· Performance Profile Guide
· data/eval/README.md
· data/profiles/README.md
CLI Usage¶
# Basic transcript download
python3 -m podcast_scraper.cli https://example.com/feed.xml
# With Whisper transcription
python3 -m podcast_scraper.cli https://example.com/feed.xml \
--transcribe-missing --whisper-model base
# Full processing
python3 -m podcast_scraper.cli https://example.com/feed.xml \
--transcribe-missing \
--generate-metadata \
--generate-summaries \
--output-dir ./output
# From config file
python3 -m podcast_scraper.cli --config config.yaml
Git Workflow¶
# Create feature branch
git checkout -b feature/my-feature
# Make changes, then:
make format && make ci-fast # Format and verify (ci-fast is the default gate)
# Commit
git add <specific-files>
git commit -m "feat: add my feature"
# Push and create PR
git push -u origin feature/my-feature
Debugging¶
# Enable debug logging
export LOG_LEVEL=DEBUG
# Check test coverage
make test-unit
open htmlcov/index.html
# Check for linting issues
make lint 2>&1 | head -50
# Validate config file
python -c "import yaml; yaml.safe_load(open('config.yaml'))"
Grounded Insights¶
Grounded insights are key takeaways linked to verbatim quotes (evidence). When the GIL pipeline is enabled:
- Config:
generate_gi: true; optional evidence stack:embedding_model,extractive_qa_model,nli_model(and*_device). - Output: One
gi.jsonper episode (co-located with transcript/summary). - CLI:
gi inspect,gi show-insight,gi explore(see Grounded Insights Guide). - Browser viewer (v2):
python -m podcast_scraper.cli serve --output-dir …with builtweb/gi-kg-viewer/dist/— ormake servefor dev (Development Guide).
See the Grounded Insights Guide and GIL Ontology.
Key Files¶
| File | Purpose |
|---|---|
pyproject.toml |
Dependencies, pytest config |
Makefile |
Development commands |
.pre-commit-config.yaml |
Pre-commit hooks |
tests/conftest.py |
Shared test fixtures |
.markdownlint.json |
Markdown linting rules |
Key Directories¶
| Directory | Purpose |
|---|---|
src/podcast_scraper/ |
Main source code |
tests/unit/ |
Unit tests |
tests/integration/ |
Integration tests |
tests/e2e/ |
End-to-end tests |
docs/ |
Documentation |
config/examples/ |
Config examples |
Common Issues¶
| Issue | Fix |
|---|---|
| spaCy wheels re-download often | make download-spacy-wheels then export PIP_FIND_LINKS="$(pwd)/wheels/spacy" (see Dependencies Guide) |
| Tests skip "model not cached" | make preload-ml-models |
| Import errors | pip install -e ".[dev,ml,llm]" |
| Whisper fails | brew install ffmpeg |
make visualize / dependency graphs fail |
brew install graphviz (macOS) or apt install graphviz (Linux) |
| CI fails locally | make ci |
Docker¶
# Build variants
make docker-build-llm # LLM-only variant (~200MB)
make docker-build # ML-enabled variant (~1-3GB)
make docker-build-fast # ML variant, no model preloading
# Test both variants
make docker-test
# Run container
docker run -v ./config.yaml:/app/config.yaml \
-v ./output:/app/output \
podcast-scraper:latest
# Docker Compose (recommended end-to-end stack: viewer + API + on-demand pipeline)
make stack-test-build # build api + viewer + pipeline (ml) images
make stack-test-up # bring up the stack on http://127.0.0.1:8090
make stack-test-seed # seed the corpus volume (default ml variant)
make stack-test-down # tear down (STACK_TEST_DOWN_VOLUMES=1 also drops the corpus)
On main, GitHub Actions runs stack-test.yml against the same topology (ADR-085). The always-on VPS uses the same compose stack with prod overlays; see Hosting and infrastructure.
See: Docker Compose Guide (recommended), Docker Service Guide (single-container), Docker Variants Guide (image tiers)
Links¶
- Development Guide - Full development workflow
- Testing Guide - Detailed test information
- AI Provider Comparison - Provider decision guide
- Provider Deep Dives - Per-provider benchmarks
- Experiment Guide - Eval datasets and baselines
- Performance Profile Guide - Release timing snapshots
- Docker Service Guide - Docker usage and deployment
- Hosting and infrastructure - Always-on VPS, CI, Tailscale, OpenTofu narrative
- Docker Variants Guide - LLM-only vs ML-enabled
- CLI Reference - All CLI options
- Configuration - Config file options