Guides¶

Practical guides for using and developing Podcast Scraper.

Quick Start¶

Guide	Description
Installation Guide	Install paths (see README), first `--profile` + `--config` + `--feeds-spec` run
Quick Reference	Common commands cheat sheet
Troubleshooting	Common issues and solutions
Glossary	Key terms and concepts

Reference	Description
Configuration API	`Config`, env vars, YAML — Twelve-factor (config), Download resilience (retries, Issue #522, presets, CLI parity), `failure_summary` in `run.json`
CLI	Flags; Quick Start (`--profile`, `--config`, `--feeds-spec`); `--http-retry-total`, `--episode-retry-max`, etc.

Guide	Description
Development Guide	Development environment setup, workflow, and GI/KG viewer — `make serve` / `serve-api` / `serve-ui`, `make test-ui-e2e`
Release Playbook	Standing plan before a public tag: eval/profiles policy (major vs minor), docs gates, release notes pattern, alignment with `vX.Y.Z` tags
Prod operator cheat sheet	Deploy, health, incidents, rollback, credentials; `PODCAST_CORPUS_HOST_PATH` validation and manual topic clusters
VPS multi-app onboarding	Add other Docker Compose apps on the same Tailscale VPS without new IaaC; isolation, GitOps, ports
Polyglot repository guide	Python root vs `web/gi-kg-viewer/`, env files, Makefile targets for the viewer
Server Guide	FastAPI: `/api/` (artifacts, CIL, search with optional `lifted`*, explore, Corpus Library, index rebuild), OpenAPI `/docs`, static SPA, tests under `tests/integration/server/`
Pipeline and Workflow Guide	Pipeline flow, module roles, quirks, run tracking
Git Worktree Guide	Git worktree-based development workflow
Dependencies Guide	Third-party dependencies and rationale
Markdown Linting	Markdown style and linting practices

Guide	Description
SRE book infra critique	Reliability rubric (SRE themes): SLIs/SLOs, error budget, toil, alerting, change risk, incidents — for reviewing runbooks, workflows, and ops design
Hosting and infrastructure	Narrative: Tailscale, OpenTofu, GitHub Actions, Compose on the VPS, how CI and prod align; ADR spine (079–085, 082, 093)
Stack contract	Cross-surface audit table, steady vs recovery playbooks (ADR-093)
Prod runbook	Always-on Hetzner VPS: bootstrap, deploy, backups, observability, DR
Prod operator cheat sheet	Short daily ops: `gh` deploy/backup, health curls, incident triage
DR drill runbook	Drill-only GitHub workflows, typed confirms, orchestrator vs piecemeal paths
Corpus snapshot manifest and restore	Single hub: local `make` vs GitHub Actions (prod, pre-prod, DR) for `snapshot.manifest.json` — RFC-084 / ADR-092

Guide	Description
Testing Strategy	Pyramid, pytest layers, and Playwright as additive browser UI E2E
Testing Guide	Commands, markers, and Browser E2E (`make test-ui-e2e`)
Unit Testing Guide	Unit test patterns and mocking
Integration Testing Guide	Integration test guidelines; FastAPI / CIL / bridge / lift
E2E Testing Guide	pytest E2E server/ML; Playwright for the viewer
Critical Path Testing Guide	Test prioritization

Guide	Description
AI Provider Comparison	Compare all 9 providers: cost, quality, speed, privacy
Provider Deep Dives	Per-provider reference cards, benchmarks, and magic quadrant
ML Model Comparison	Compare ML models: Whisper, spaCy, Transformers (BART/LED)
Provider Configuration	Quick provider configuration reference
Ollama Provider Guide	Ollama installation, setup, troubleshooting, and testing
Provider Implementation	Implementing new providers
ML Provider Reference	Technical reference for local ML models
Protocol Extension	Extending protocols

Guide	Description
GIL / KG / CIL cross-layer	RFC-072 map: `bridge.json`, CIL HTTP routes, semantic lift, offset verification, CLI/Make, and test entry points
RSS and feed ingestion	How RSS URLs become `RssFeed` and `Episode` objects: HTTP, caches, parsing, selection, multi-feed; entry point for future non-RSS ingestion docs
Semantic Search	RFC-061 corpus vector index; `GET /api/search`; chunk-to-Insight lift and `verify-gil-chunk-offsets` when GIL + index share transcript space
Grounded Insights	GIL: `gi.json`, quotes, schema, CLI; `bridge.json` sibling for canonical ids; optional browser viewer
Knowledge Graph	KG: `kg.json`, entities/topics/relationships; bridge aligns KG with GIL for APIs; same browser viewer
Preprocessing Profiles	Preprocessing profiles (`cleaning_v4`, `cleaning_hybrid_after_pattern`, …) for transcript cleaning and hybrid_ml MAP input (RFC-042 / Issue #419)
Docker Compose Guide	Recommended end-to-end stack (viewer + API + on-demand pipeline jobs); same compose shape on prod VPS — Hosting and infrastructure
Docker Service Guide	Running podcast_scraper as a single-container service (supervisor / systemd / scheduler-driven)
Docker Variants Guide	LLM-only vs ML-enabled pipeline image tiers

Guide	Description
Chip Huyen ML / AI critique	ML/AI rubric (seven themes); experiments vs production lenses and optional short output tables — inspired by Designing Machine Learning Systems and AI Engineering
Experiment Guide	Datasets, baselines, experiments, promotion, metrics, and quality evaluation (RFC-041)
Evaluation Reports	Quality sweeps: ROUGE, embeddings, report library
Performance Guide	Performance considerations, optimization, and troubleshooting
Performance Profile Guide	Frozen release profiles: RSS, CPU%, wall time per stage (RFC-064)
Optimization Workflow	Data-driven process for investigating and solving performance/cost problems
Live Pipeline Monitor	Dev tooling: `--monitor`, RSS/CPU/stage dashboard or `.monitor.log`, `.pipeline_status.json`; optional `.[monitor]` memray + py-spy (RFC-065, #512)
Performance Reports	Published profile snapshots (tables, caveats)

Guide	Description
Cursor AI Best Practices	AI-assisted development
Agent-Browser Closed Loop	Browser loops: automated E2E (`make test-ui-e2e`) + live co-development (Chrome DevTools MCP); user-reported UI bugs: symmetry — re-validate in the same channel you used to reproduce (MCP ≠ replaced by pytest alone); plus tests (obligatory validation)
Agent-Pipeline Feedback Loop	Python pipeline loops: CI diagnosis, acceptance testing, `--monitor` real-time feedback, `metrics.json` post-mortem
Documentation Agent Guide	Documentation workflows