Skip to content

Guides

Practical guides for using and developing Podcast Scraper.

Quick Start

Guide Description
Installation Guide Install paths (see README), first --profile + --config + --feeds-spec run
Quick Reference Common commands cheat sheet
Troubleshooting Common issues and solutions
Glossary Key terms and concepts

Configuration

Reference Description
Configuration API Config, env vars, YAML — Twelve-factor (config), Download resilience (retries, Issue #522, presets, CLI parity), failure_summary in run.json
CLI Flags; Quick Start (--profile, --config, --feeds-spec); --http-retry-total, --episode-retry-max, etc.

Development

Guide Description
Development Guide Development environment setup, workflow, and GI/KG viewermake serve / serve-api / serve-ui, make test-ui-e2e
Release Playbook Standing plan before a public tag: eval/profiles policy (major vs minor), docs gates, release notes pattern, alignment with vX.Y.Z tags
Prod operator cheat sheet Deploy, health, incidents, rollback, credentials; PODCAST_CORPUS_HOST_PATH validation and manual topic clusters
VPS multi-app onboarding Add other Docker Compose apps on the same Tailscale VPS without new IaaC; isolation, GitOps, ports
Polyglot repository guide Python root vs web/gi-kg-viewer/, env files, Makefile targets for the viewer
Server Guide FastAPI: /api/* (artifacts, CIL, search with optional lifted, explore, Corpus Library, index rebuild), OpenAPI /docs, static SPA, tests under tests/integration/server/
Pipeline and Workflow Guide Pipeline flow, module roles, quirks, run tracking
Git Worktree Guide Git worktree-based development workflow
Dependencies Guide Third-party dependencies and rationale
Markdown Linting Markdown style and linting practices

Hosting and production

Guide Description
SRE book infra critique Reliability rubric (SRE themes): SLIs/SLOs, error budget, toil, alerting, change risk, incidents — for reviewing runbooks, workflows, and ops design
Hosting and infrastructure Narrative: Tailscale, OpenTofu, GitHub Actions, Compose on the VPS, how CI and prod align; ADR spine (079–085, 082, 093)
Stack contract Cross-surface audit table, steady vs recovery playbooks (ADR-093)
Prod runbook Always-on Hetzner VPS: bootstrap, deploy, backups, observability, DR
Prod operator cheat sheet Short daily ops: gh deploy/backup, health curls, incident triage
DR drill runbook Drill-only GitHub workflows, typed confirms, orchestrator vs piecemeal paths
Corpus snapshot manifest and restore Single hub: local make vs GitHub Actions (prod, pre-prod, DR) for snapshot.manifest.jsonRFC-084 / ADR-092

Testing

Guide Description
Testing Strategy Pyramid, pytest layers, and Playwright as additive browser UI E2E
Testing Guide Commands, markers, and Browser E2E (make test-ui-e2e)
Unit Testing Guide Unit test patterns and mocking
Integration Testing Guide Integration test guidelines; FastAPI / CIL / bridge / lift
E2E Testing Guide pytest E2E server/ML; Playwright for the viewer
Critical Path Testing Guide Test prioritization

Provider System

Guide Description
AI Provider Comparison Compare all 9 providers: cost, quality, speed, privacy
Provider Deep Dives Per-provider reference cards, benchmarks, and magic quadrant
ML Model Comparison Compare ML models: Whisper, spaCy, Transformers (BART/LED)
Provider Configuration Quick provider configuration reference
Ollama Provider Guide Ollama installation, setup, troubleshooting, and testing
Provider Implementation Implementing new providers
ML Provider Reference Technical reference for local ML models
Protocol Extension Extending protocols

Features

Guide Description
GIL / KG / CIL cross-layer RFC-072 map: bridge.json, CIL HTTP routes, semantic lift, offset verification, CLI/Make, and test entry points
RSS and feed ingestion How RSS URLs become RssFeed and Episode objects: HTTP, caches, parsing, selection, multi-feed; entry point for future non-RSS ingestion docs
Semantic Search RFC-061 corpus vector index; GET /api/search; chunk-to-Insight lift and verify-gil-chunk-offsets when GIL + index share transcript space
Grounded Insights GIL: gi.json, quotes, schema, CLI; bridge.json sibling for canonical ids; optional browser viewer
Knowledge Graph KG: kg.json, entities/topics/relationships; bridge aligns KG with GIL for APIs; same browser viewer
Preprocessing Profiles Preprocessing profiles (cleaning_v4, cleaning_hybrid_after_pattern, …) for transcript cleaning and hybrid_ml MAP input (RFC-042 / Issue #419)
Docker Compose Guide Recommended end-to-end stack (viewer + API + on-demand pipeline jobs); same compose shape on prod VPS — Hosting and infrastructure
Docker Service Guide Running podcast_scraper as a single-container service (supervisor / systemd / scheduler-driven)
Docker Variants Guide LLM-only vs ML-enabled pipeline image tiers

Evaluation and baselines

Guide Description
Chip Huyen ML / AI critique ML/AI rubric (seven themes); experiments vs production lenses and optional short output tables — inspired by Designing Machine Learning Systems and AI Engineering
Experiment Guide Datasets, baselines, experiments, promotion, metrics, and quality evaluation (RFC-041)
Evaluation Reports Quality sweeps: ROUGE, embeddings, report library
Performance Guide Performance considerations, optimization, and troubleshooting
Performance Profile Guide Frozen release profiles: RSS, CPU%, wall time per stage (RFC-064)
Optimization Workflow Data-driven process for investigating and solving performance/cost problems
Live Pipeline Monitor Dev tooling: --monitor, RSS/CPU/stage dashboard or .monitor.log, .pipeline_status.json; optional .[monitor] memray + py-spy (RFC-065, #512)
Performance Reports Published profile snapshots (tables, caveats)

AI Coding

Guide Description
Cursor AI Best Practices AI-assisted development
Agent-Browser Closed Loop Browser loops: automated E2E (make test-ui-e2e) + live co-development (Chrome DevTools MCP); user-reported UI bugs: symmetry — re-validate in the same channel you used to reproduce (MCP ≠ replaced by pytest alone); plus tests (obligatory validation)
Agent-Pipeline Feedback Loop Python pipeline loops: CI diagnosis, acceptance testing, --monitor real-time feedback, metrics.json post-mortem
Documentation Agent Guide Documentation workflows