Prod operator cheat sheet¶
Fast companion to the full Prod runbook. Use this for day-to-day operations.
Scope¶
- Prod host: Hetzner VPS
- Access path: tailnet only (
prod-podcast.<tailnet>.ts.net) - Deploy model:
mainis the source of truth; Stack test onmaingates the Dockerized path; GHCR publishesapi/viewer/pipeline-llmtags includingsha-<short>;deploy-prod.ymlapplies images on the VPS (today oftenworkflow_dispatchuntilworkflow_run: Stack testis merged — see ADR-082). Infra (infra/**) applies only via manualinfra-apply.ymlafter PR plan review. - Full procedures and edge cases: see Prod runbook
- Architecture narrative (diagrams): Hosting and infrastructure
Golden rules¶
- Never expose prod publicly by opening Hetzner TCP 80/443.
- Prefer tailnet URL or in-container health checks.
- Keep infra changes in
infra/and app changes in normal PR flow. - Always keep
/srv/podcast-scraper/.envnames exact (for exampleOPENAI_API_KEY, notOPENAI_KEY).
Core endpoints and paths¶
- Prod URL:
https://prod-podcast.<tailnet>.ts.net - Health:
https://prod-podcast.<tailnet>.ts.net/api/health - Repo on host:
/srv/podcast-scraper - Runtime env:
/srv/podcast-scraper/.env - Viewer basic auth file:
/srv/podcast-scraper/.htpasswd - Host corpus directory: take from
PODCAST_CORPUS_HOST_PATHin/srv/podcast-scraper/.env(do not assume a path until you check). Typical value on a standard deploy:/srv/podcast-scraper/corpus.
grep '^PODCAST_CORPUS_HOST_PATH=' /srv/podcast-scraper/.env
- Same data on the host vs in containers: SSH and host Python use
$PODCAST_CORPUS_HOST_PATH. Theapi(and pipeline) containers mount that tree at/app/output(seedocker/api/Dockerfileserve --output-dir).
Topic clusters (manual maintenance)¶
A successful Configuration → Run pipeline job runs full_incremental_pipeline
only. It may rebuild the vector index when vector_search is true in the
profile, but it does not run topic-clusters. The viewer/API read
search/topic_clusters.json if present; Missing file means clustering was never
built for that corpus. See RFC-075
and Prod runbook — FAQ.
Prerequisite: search/vectors.faiss under the corpus (needs a profile with
vector_search: true for indexing, e.g. cloud_balanced).
Option A — host venv (no container for Python):
ssh deploy@prod-podcast.<tailnet>.ts.net
cd /srv/podcast-scraper
HOST_CORPUS=$(grep '^PODCAST_CORPUS_HOST_PATH=' .env | cut -d= -f2-)
# stock Debian images often need: sudo apt install python3.12-venv
python3 -m venv .venv
.venv/bin/pip install -U pip
.venv/bin/pip install -e ".[search]"
.venv/bin/python -m podcast_scraper.cli topic-clusters --output-dir "$HOST_CORPUS"
ls -la "$HOST_CORPUS/search/topic_clusters.json"
Option B — use Python inside the already-running api container (no
compose run; replace container name if docker ps shows a different one):
ssh deploy@prod-podcast.<tailnet>.ts.net
docker exec compose-api-1 python -m podcast_scraper.cli topic-clusters \
--output-dir /app/output
Verify API (optional): GET /api/corpus/topic-clusters?path=<host corpus path>
If the JSON file is owned by root after an in-container write, fix ownership so
deploy can manage backups: sudo chown deploy:docker …/search/topic_clusters.json
(adjust group to match your layout).
Reprocess from existing transcripts (no re-transcription)¶
Use this when pipeline logic changed and you want fresh metadata / GI / KG / index without paying transcription cost again.
Keep: corpus transcripts/ files.
Rebuild: derived artifacts (metadata/, search/, and run outputs that depend
on old prompts/parsers).
Host path (recommended, no compose run):
ssh deploy@prod-podcast.<tailnet>.ts.net
cd /srv/podcast-scraper
HOST_CORPUS=$(grep '^PODCAST_CORPUS_HOST_PATH=' .env | cut -d= -f2-)
# Optional: keep a rollback copy of derived artifacts only.
STAMP=$(date +%Y%m%d-%H%M%S)
mkdir -p "$HOST_CORPUS/.reprocess-backup-$STAMP"
cp -a "$HOST_CORPUS/search" "$HOST_CORPUS/.reprocess-backup-$STAMP/" 2>/dev/null || true
cp -a "$HOST_CORPUS/run_"* "$HOST_CORPUS/.reprocess-backup-$STAMP/" 2>/dev/null || true
# Remove only derived layers; keep transcripts/.
rm -rf "$HOST_CORPUS/search"
find "$HOST_CORPUS" -type d -name metadata -prune -exec rm -rf {} +
# Re-run from existing transcripts; never transcribe again.
.venv/bin/python -m podcast_scraper.cli \
--config "$HOST_CORPUS/viewer_operator.yaml" \
--feeds-spec "$HOST_CORPUS/feeds.spec.yaml" \
--output-dir "$HOST_CORPUS" \
--skip-existing \
--no-transcribe-missing
# Optional post-step when vectors exist
.venv/bin/python -m podcast_scraper.cli topic-clusters --output-dir "$HOST_CORPUS"
If this run creates root-owned files (for example after mixed host/container history), fix ownership before backup jobs:
sudo chown -R deploy:docker "$HOST_CORPUS"
Daily commands¶
# Confirm deploy history
gh run list --workflow deploy-prod.yml --repo chipi/podcast_scraper --limit 5
# Manual deploy (latest main image tags)
gh workflow run deploy-prod.yml --repo chipi/podcast_scraper
# Confirm backup freshness
gh run list --workflow backup-corpus-prod.yml --repo chipi/podcast_scraper --limit 5
gh release list --repo chipi/podcast_scraper-backup --limit 10 | rg snapshot-prod-
Health checks by context¶
# Laptop or CI (authoritative external check)
curl -fsS https://prod-podcast.<tailnet>.ts.net/api/health | jq .
# VPS host (run check inside api container)
ssh deploy@prod-podcast.<tailnet>.ts.net \
'cd /srv/podcast-scraper && docker compose --env-file /srv/podcast-scraper/.env -f compose/docker-compose.stack.yml -f compose/docker-compose.prod.yml -f compose/docker-compose.vps-prod.yml exec -T api curl -fsS http://127.0.0.1:8000/api/health'
# Viewer path on host loopback
ssh deploy@prod-podcast.<tailnet>.ts.net \
'curl -fsS http://127.0.0.1:${VIEWER_PORT:-8080}/api/health'
Fast incident playbook¶
Deploy red¶
- Open latest
deploy-prod.ymlrun logs. - If pull failed, rerun workflow.
- If container start failed, pin previous
sha-<short>image tag andup -d.
Viewer up but pipeline failing¶
- Check latest job logs under corpus jobs.
- Verify key env names in
/srv/podcast-scraper/.env. - Verify api container has non-empty provider env vars.
Host unreachable over tailnet¶
- Check
tailscale statuson your laptop. - Try hostname with numeric suffix (
prod-podcast-1,prod-podcast-2). - Use Hetzner console for recovery if needed.
Most important secrets¶
HCLOUD_TOKEN: Hetzner API (infra apply).OPERATOR_SSH_PUBLIC_KEY: operator laptop pubkey for OpenTofu (repo secret so CI logs mask it).TS_AUTHKEY: tailnet join auth for workflows and machine registration.TS_API_KEY: tailnet management API for Terraform provider.TFSTATE_AGE_KEY: decrypts encrypted OpenTofu state.- Host
.env: provider API keys, Grafana credentials, Sentry DSNs.
Rotation rhythm¶
- Tailscale keys on Free plan: rotate before 90-day expiry.
- Hetzner token: rotate after personnel/device changes or suspicious activity.
- Age key: rotate carefully with state re-encryption and backup first.
- Provider keys and auth passwords: rotate on incident or access changes.
Rollback and DR shortcuts¶
# Roll back app images by known-good short sha
ssh deploy@prod-podcast.<tailnet>.ts.net
cd /srv/podcast-scraper
PODCAST_IMAGE_TAG=sha-<previous-good-short-sha> \
docker compose -f compose/docker-compose.stack.yml \
-f compose/docker-compose.prod.yml \
-f compose/docker-compose.vps-prod.yml \
up -d --remove-orphans
# Restore corpus from backup release (prod layout)
# Preferred: prod-restore-corpus.yml in GitHub Actions (confirm PROD_RESTORE).
# On-host Make only: make restore-corpus-prod, then recycle api + viewer (see PROD_RUNBOOK).
make restore-corpus-prod
When to use full runbook¶
- First bootstrap of prod host.
- Credential rotation procedures.
- Grafana/Sentry setup and validation.
- Tailscale suffix drift and ACL edits.
- Disaster recovery drill and full re-provision.