Guides
Practical guides for using and developing Podcast Scraper.
Quick Start
Configuration
| Reference |
Description |
| Configuration API |
Config, env vars, YAML — Twelve-factor (config), Download resilience (retries, Issue #522, presets, CLI parity), failure_summary in run.json |
| CLI |
Flags; Quick Start (--profile, --config, --feeds-spec); --http-retry-total, --episode-retry-max, etc. |
Development
| Guide |
Description |
| Development Guide |
Development environment setup, workflow, and GI/KG viewer — make serve / serve-api / serve-ui, make test-ui-e2e |
| Release Playbook |
Standing plan before a public tag: eval/profiles policy (major vs minor), docs gates, release notes pattern, alignment with vX.Y.Z tags |
| Prod operator cheat sheet |
Deploy, health, incidents, rollback, credentials; PODCAST_CORPUS_HOST_PATH validation and manual topic clusters |
| VPS multi-app onboarding |
Add other Docker Compose apps on the same Tailscale VPS without new IaaC; isolation, GitOps, ports |
| Polyglot repository guide |
Python root vs web/gi-kg-viewer/, env files, Makefile targets for the viewer |
| Server Guide |
FastAPI: /api/* (artifacts, CIL, search with optional lifted, explore, Corpus Library, index rebuild), OpenAPI /docs, static SPA, tests under tests/integration/server/ |
| Pipeline and Workflow Guide |
Pipeline flow, module roles, quirks, run tracking |
| Git Worktree Guide |
Git worktree-based development workflow |
| Dependencies Guide |
Third-party dependencies and rationale |
| Markdown Linting |
Markdown style and linting practices |
Hosting and production
| Guide |
Description |
| SRE book infra critique |
Reliability rubric (SRE themes): SLIs/SLOs, error budget, toil, alerting, change risk, incidents — for reviewing runbooks, workflows, and ops design |
| Hosting and infrastructure |
Narrative: Tailscale, OpenTofu, GitHub Actions, Compose on the VPS, how CI and prod align; ADR spine (079–085, 082, 093) |
| Stack contract |
Cross-surface audit table, steady vs recovery playbooks (ADR-093) |
| Prod runbook |
Always-on Hetzner VPS: bootstrap, deploy, backups, observability, DR |
| Prod operator cheat sheet |
Short daily ops: gh deploy/backup, health curls, incident triage |
| DR drill runbook |
Drill-only GitHub workflows, typed confirms, orchestrator vs piecemeal paths |
| Corpus snapshot manifest and restore |
Single hub: local make vs GitHub Actions (prod, pre-prod, DR) for snapshot.manifest.json — RFC-084 / ADR-092 |
Testing
Provider System
| Guide |
Description |
| AI Provider Comparison |
Compare all 9 providers: cost, quality, speed, privacy |
| Provider Deep Dives |
Per-provider reference cards, benchmarks, and magic quadrant |
| ML Model Comparison |
Compare ML models: Whisper, spaCy, Transformers (BART/LED) |
| Provider Configuration |
Quick provider configuration reference |
| Ollama Provider Guide |
Ollama installation, setup, troubleshooting, and testing |
| Provider Implementation |
Implementing new providers |
| ML Provider Reference |
Technical reference for local ML models |
| Protocol Extension |
Extending protocols |
Features
| Guide |
Description |
| GIL / KG / CIL cross-layer |
RFC-072 map: bridge.json, CIL HTTP routes, semantic lift, offset verification, CLI/Make, and test entry points |
| RSS and feed ingestion |
How RSS URLs become RssFeed and Episode objects: HTTP, caches, parsing, selection, multi-feed; entry point for future non-RSS ingestion docs |
| Semantic Search |
RFC-061 corpus vector index; GET /api/search; chunk-to-Insight lift and verify-gil-chunk-offsets when GIL + index share transcript space |
| Grounded Insights |
GIL: gi.json, quotes, schema, CLI; bridge.json sibling for canonical ids; optional browser viewer |
| Knowledge Graph |
KG: kg.json, entities/topics/relationships; bridge aligns KG with GIL for APIs; same browser viewer |
| Preprocessing Profiles |
Preprocessing profiles (cleaning_v4, cleaning_hybrid_after_pattern, …) for transcript cleaning and hybrid_ml MAP input (RFC-042 / Issue #419) |
| Docker Compose Guide |
Recommended end-to-end stack (viewer + API + on-demand pipeline jobs); same compose shape on prod VPS — Hosting and infrastructure |
| Docker Service Guide |
Running podcast_scraper as a single-container service (supervisor / systemd / scheduler-driven) |
| Docker Variants Guide |
LLM-only vs ML-enabled pipeline image tiers |
Evaluation and baselines
| Guide |
Description |
| Chip Huyen ML / AI critique |
ML/AI rubric (seven themes); experiments vs production lenses and optional short output tables — inspired by Designing Machine Learning Systems and AI Engineering |
| Experiment Guide |
Datasets, baselines, experiments, promotion, metrics, and quality evaluation (RFC-041) |
| Evaluation Reports |
Quality sweeps: ROUGE, embeddings, report library |
| Performance Guide |
Performance considerations, optimization, and troubleshooting |
| Performance Profile Guide |
Frozen release profiles: RSS, CPU%, wall time per stage (RFC-064) |
| Optimization Workflow |
Data-driven process for investigating and solving performance/cost problems |
| Live Pipeline Monitor |
Dev tooling: --monitor, RSS/CPU/stage dashboard or .monitor.log, .pipeline_status.json; optional .[monitor] memray + py-spy (RFC-065, #512) |
| Performance Reports |
Published profile snapshots (tables, caveats) |
AI Coding