Prod runbook (always-on Hetzner VPS)¶
Operator-facing runbook for the production deploy defined in RFC-082. The RFC describes what we decided; this runbook describes what to do today.
Need the short version for daily ops? Use Prod operator cheat sheet. Other Docker Compose apps on the same VPS: see VPS multi-app onboarding.
How hosting fits together (diagrams + planes): Hosting and infrastructure. Immutable decisions: ADR-079–ADR-083 (OpenTofu, state, drill workspace, app GitOps contract, tailnet ingress), ADR-082 (stack-test gate and deploy nuance), ADR-084–ADR-085 (Compose + CI stack-test), ADR-093 (stack contract vs adapters). Surface audit table: STACK_CONTRACT.md. CI workflow names: WORKFLOWS.md.
Steady-state playbook (routine prod)¶
For day-to-day prod (not DR drill, not manual corpus restore): preflight (secrets and
PROD_TAILNET_FQDN) → deploy (deploy-prod.yml → infra/deploy/deploy.sh) → health
(in-container /api/health on the api service — API health checks by context)
→ behavioral gate on main (stack-test.yml before GHCR publish). Restoring corpus from
snapshot.tgz is not part of this path; see Disaster recovery and
STACK_CONTRACT.md.
For the prerequisites checklist (Hetzner account + Tailscale credentials — auth key + API access token on Free plan, see "Tailscale credentials" below for the why — sops/age + GHA secrets), see #714. All commands below assume those are done.
Sections¶
- First-time bootstrap — includes API health checks by context (GH-745)
- Daily operations
- VPS access control (no HTTP Basic Auth)
- Corpus migration from pre-prod (Codespace) to prod (VPS)
- Rollback (deploy went red mid-way)
- Disaster recovery (VPS gone)
- Credential rotation
- Environment variable reference
- Observability setup walkthrough — includes Sentry Slack routing (GH-725) and Grafana env filter (GH-726)
- Tailscale operations
- Hetzner operations
- Operator hot-fix workflow
- FAQ / Troubleshooting — includes corpus path (host vs
/app/output), topic clusters, reprocess from transcripts, and Cursor or automation cannotssh deploy@prod - Constraints to know
First-time bootstrap¶
One-shot setup that takes prod from "nothing" to "viewer reachable on the tailnet". ~30–45 min wall.
Pre-bootstrap (one-time, on operator's laptop)¶
# 1. Tools
brew install opentofu sops age actionlint shellcheck
gh auth login
# 2. age keypair for sops state encryption
age-keygen -o ~/.config/sops/age/keys.txt
# Copy the public key (the `# public key:` line) into infra/.sops.yaml,
# replacing the `age1PLACEHOLDER…` value. Save the PRIVATE key contents to
# your password manager as `sops/podcast-scraper/tofu-state-age-key`.
# 3. Stage GHA secrets (from #714) — Free-plan auth: see "Tailscale credentials"
# section below for why TS_AUTHKEY + TS_API_KEY (two separate creds) instead
# of OAuth client ID/secret (which would be on Premium+ plans).
gh secret set HCLOUD_TOKEN --repo chipi/podcast_scraper --app actions --body '<token>'
gh secret set TS_AUTHKEY --repo chipi/podcast_scraper --app actions --body '<tskey-auth-...>'
gh secret set TS_API_KEY --repo chipi/podcast_scraper --app actions --body '<tskey-api-...>'
gh secret set TFSTATE_AGE_KEY --repo chipi/podcast_scraper --app actions --body "$(cat ~/.config/sops/age/keys.txt)"
gh secret set BACKUP_REPO_TOKEN --repo chipi/podcast_scraper --app actions --body '<backup-repo-pat>'
# PROD_SSH_PRIVATE_KEY — see [GitHub Actions SSH to prod](#github-actions-ssh-to-prod-prod_ssh_private_key)
# 4. Stage GHA repo secrets and variables
gh secret set OPERATOR_SSH_PUBLIC_KEY --repo chipi/podcast_scraper --body "$(cat ~/.ssh/id_ed25519.pub)"
gh variable set TAILNET_NAME --repo chipi/podcast_scraper --body 'tail-xxxxx.ts.net'
# If OPERATOR_SSH_PUBLIC_KEY was previously a repo variable, add the same value as the secret above,
# then delete the variable so infra workflows use only the secret (GitHub masks secrets in logs).
# PROD_TAILNET_FQDN is set after first apply (it depends on the assigned hostname);
# default value is "prod-podcast.<TAILNET_NAME>".
GitHub Actions SSH to prod (PROD_SSH_PRIVATE_KEY)¶
deploy-prod.yml, backup-corpus-prod.yml, and prod-restore-corpus.yml run OpenSSH as deploy@<prod> after
the runner joins the tailnet. The Tailscale ACL allows reachability to port 22;
OpenSSH still requires a private key whose public half is listed in
~deploy/.ssh/authorized_keys on the VPS (Tailscale SSH is intentionally not used;
see tailscale/policy.hujson comments).
One-time setup
- On a trusted machine, generate a CI-only Ed25519 key (empty passphrase):
ssh-keygen -t ed25519 -f ./gha-prod-deploy -N "" -C "github-actions-prod-deploy"
-
SSH to prod using your operator key (the same public key OpenTofu passed as
TF_VAR_ssh_public_key/secrets.OPERATOR_SSH_PUBLIC_KEY). Asdeploy, append exactly one line (the contents ofgha-prod-deploy.pub) to~/.ssh/authorized_keys. Directory~/.sshshould be mode700andauthorized_keysmode600. -
Verify key auth before touching GitHub:
ssh -i ./gha-prod-deploy -o BatchMode=yes deploy@<prod-tailnet-host> 'echo ok'
Use the same hostname the workflows resolve (see suffix drift if unsure).
- Store the private PEM in a repo Actions secret (entire file, including
BEGIN OPENSSH PRIVATE KEY/ENDlines):
gh secret set PROD_SSH_PRIVATE_KEY --repo chipi/podcast_scraper --app actions < ./gha-prod-deploy
- Remove or securely archive
gha-prod-deployon disk; do not commit it.
Workflows load this via the composite action .github/actions/prod-ssh-key, which exports
SSH_PROD_IDENTITY for ssh -i "$SSH_PROD_IDENTITY" -o IdentitiesOnly=yes …. Any future workflow
that SSHes to prod as deploy@ should reuse secrets.PROD_SSH_PRIVATE_KEY and the same flags so
GitHub never relies on keys baked into the runner image.
Rotating the CI key: generate a new keypair, append the new public key to authorized_keys (keep
the old line until one green workflow run), update PROD_SSH_PRIVATE_KEY, re-run deploy-prod.yml or
backup-corpus-prod.yml, then delete the superseded public key line on the VPS.
GitHub Actions deploy to DR drill (DRILL_DEPLOY_SSH_PRIVATE_KEY)¶
Drill workflow matrix, typed confirms, orchestrator, restore/destroy, and drill-only host checks:
DR_DRILL_RUNBOOK.md. This section stays focused on drill-deploy and
DRILL_DEPLOY_SSH_PRIVATE_KEY setup shared with prod-style deploys.
RFC-082 / #752: drill-deploy.yml mirrors deploy-prod.yml but targets the drill Hetzner
stack. After tailscale/github-action joins as tag:gha-deployer, the job SSHes
deploy@<resolved-drill-fqdn>, appends PODCAST_RELEASE=sha-<short> to
/srv/podcast-scraper/.env, runs /srv/podcast-scraper/infra/deploy/deploy.sh, then curls
https://<resolved>/api/health. Resolver: scripts/ops/resolve_drill_tailnet_host.sh with
vars.DRILL_TAILNET_FQDN. GitHub Environment: drill.
The composite .github/actions/prod-ssh-key is invoked with identity_env_name: SSH_DRILL_IDENTITY
so the workflow uses $SSH_DRILL_IDENTITY (prod workflows keep the default SSH_PROD_IDENTITY).
One-time drill host setup (same authorized_keys story as prod above):
- SSH to the drill VPS with your operator key (only that key is present from cloud-init until you extend
authorized_keys). - Append exactly one line (contents of
gha-prod-deploy.pub, or a drill-only CI public key) todeploy@~/.ssh/authorized_keys(modes700/600). - Stage
/srv/podcast-scraper/.env, thensudo rm /srv/podcast-scraper/.bootstrap-needs-envso Docker Compose can start (see cloud-initfinal_messageon first boot).
GitHub secret (can reuse the same PEM file as prod if the same public key is on drill deploy@):
gh secret set DRILL_DEPLOY_SSH_PRIVATE_KEY --repo chipi/podcast_scraper --app actions < ./gha-prod-deploy
Dispatch — the workflow file must exist on main (merge your branch first). There is no
make target; use gh:
gh workflow run drill-deploy.yml -R chipi/podcast_scraper
gh run watch -R chipi/podcast_scraper "$(gh run list -R chipi/podcast_scraper --workflow=drill-deploy.yml -L1 --json databaseId -q '.[0].databaseId')"
Local operator check (optional; use ssh-add if your key is not already in an agent — see
the section Cursor or automation cannot ssh deploy@prod in this runbook for ssh-add -t 30m):
ssh-add -t 30m ~/.ssh/id_ed25519
ssh -o IdentitiesOnly=yes deploy@<your-drill-magicdns-host> 'echo ok'
Related infra-only cleanup (local): make delete-drill-hetzner-orphans sources infra/.env.drill.local and deletes orphan Hetzner objects by name after a failed drill tofu apply; see make drill-env. Not used for normal deploys.
First tofu apply (operator's laptop)¶
cd infra
export HCLOUD_TOKEN=$(op read 'op://Personal/Hetzner Cloud/podcast-scraper-prod/api-token')
export TF_VAR_hcloud_token="$HCLOUD_TOKEN"
export TF_VAR_tailscale_api_key=$(op read 'op://Personal/Tailscale/podcast-scraper/api-key')
export TF_VAR_tailscale_tailnet="tail-xxxxx.ts.net" # your tailnet
export TF_VAR_ssh_public_key="$(cat ~/.ssh/id_ed25519.pub)"
./tofu init
./tofu plan
./tofu apply
# Outputs: server_id, ipv4_address, tailnet_url, ssh_target.
Expected resources (per #716):
1 server + 1 firewall + 1 network + 1 SSH key + 1 Tailscale auth key, plus an
optional Volume + attachment if volume_size_gb > 0.
After apply, set the GHA variable:
gh variable set PROD_TAILNET_FQDN --repo chipi/podcast_scraper \
--body "prod-podcast.tail-xxxxx.ts.net"
When the live hostname has a numeric suffix (-1, -2, …)¶
After a failed replace or a stale machine record, Tailscale may keep the
MagicDNS name prod-podcast.<tailnet> on an offline orphan while the
live VPS registers as prod-podcast-1.<tailnet> (or -2, and so on). That
breaks copy-paste SSH and curl until the name lines up again.
GitHub Actions: deploy-prod.yml, backup-corpus-prod.yml, and prod-restore-corpus.yml join the
tailnet, run scripts/ops/resolve_prod_tailnet_host.sh, and use the resolved
FQDN for SSH (and the /api/health probe where applicable). Workflows still require
vars.PROD_TAILNET_FQDN as the operator’s canonical intent; the resolver
falls back to prod-podcast-1.<tailnet>, … when the canonical name is not
online.
Local laptop (on the tailnet): print the live host the repo workflows would pick:
export PROD_TAILNET_FQDN='prod-podcast.tail-xxxxx.ts.net'
bash scripts/ops/resolve_prod_tailnet_host.sh
For tests without tailscaled, set TAILSCALE_STATUS_JSON_PATH to a saved
tailscale status --json file.
When the printed name differs from the repo variable, update the variable so logs and docs match reality, and remove stale machines in the Tailscale admin machines list if you want the unsuffixed name back. See GitHub issue 744.
Stage the host-side .env (one-time, post-apply)¶
The .env holds runtime secrets (provider API keys, Grafana credentials,
Sentry DSN). Cloud-init drops a sentinel file /srv/podcast-scraper/.bootstrap-needs-env;
the systemd unit refuses to start while it exists.
See Environment variable reference below for the complete list with intent + format for each var. The heredoc here is the minimum to boot — anything missing from it must be added before the relevant subsystem (Sentry, Grafana, the LLM pipeline, etc.) will function. The exact variable names matter — typos like
OPENAI_KEYinstead ofOPENAI_API_KEYcost an hour of debugging the first time around.
ssh deploy@prod-podcast.tail-xxxxx.ts.net # over Tailscale
# `deploy` user has no sudo (cloud-init: sudo: false) but owns
# /srv/podcast-scraper, so write directly — no `sudo install` needed.
install -m 600 /dev/stdin /srv/podcast-scraper/.env <<'ENV'
# === Required: ingress + paths ===
PODCAST_DOCKER_PROJECT_DIR=/srv/podcast-scraper
PODCAST_CORPUS_HOST_PATH=/srv/podcast-scraper/corpus
# Pre-fills the viewer status bar corpus path (container view of the bind mount).
PODCAST_DEFAULT_CORPUS_PATH=/app/output
PODCAST_ENV=prod
PODCAST_AVAILABLE_PROFILES=cloud_balanced,cloud_thin
PODCAST_DEFAULT_PROFILE=cloud_balanced
# === LLM provider keys (cloud_balanced uses openai + gemini) ===
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=AIza...
# Optional providers — set only if you intend to use them
ANTHROPIC_API_KEY=sk-ant-...
# === Grafana Cloud — separate Prom + Loki user IDs ===
# (Single GRAFANA_CLOUD_USER no longer works; see env var reference.)
GRAFANA_CLOUD_PROM_URL=https://prometheus-prod-NN-prod-REGION.grafana.net/api/prom/push
GRAFANA_CLOUD_LOKI_URL=https://logs-prod-NNN.grafana.net/loki/api/v1/push
GRAFANA_CLOUD_PROM_USER=NNNNNNN
GRAFANA_CLOUD_LOKI_USER=NNNNNNN
GRAFANA_CLOUD_API_KEY=glc_eyJ...
# === Sentry DSNs — one per project ===
PODCAST_SENTRY_DSN_API=https://...@o....ingest.de.sentry.io/...
PODCAST_SENTRY_DSN_PIPELINE=https://...@o....ingest.de.sentry.io/...
ENV
# Release the systemd gate.
sudo rm /srv/podcast-scraper/.bootstrap-needs-env
sudo systemctl restart podcast-scraper.service
API health checks by context (GH-745)¶
Use the right URL for each layer. Mixing them causes false alarms (for
example curl http://127.0.0.1:8000/api/health on the VPS host can fail
while the API container and the tailnet URL are healthy, because compose
does not publish api port 8000 to the host — only expose for the
Docker network).
| Context | Authoritative check | Notes |
|---|---|---|
Compose / inside api container |
curl -fsS http://127.0.0.1:8000/api/health |
Same as compose/docker-compose.stack.yml healthcheck and infra/deploy/deploy.sh after #745. |
| On the VPS host over SSH | docker compose -f compose/docker-compose.stack.yml -f compose/docker-compose.prod.yml -f compose/docker-compose.vps-prod.yml exec -T api curl -fsS http://127.0.0.1:8000/api/health |
Runs the check inside the api netns. From /srv/podcast-scraper, a short form is docker compose exec -T api curl -fsS http://127.0.0.1:8000/api/health if your shell already exports the same -f list as systemd. |
| Host loopback via viewer (nginx → api) | curl -fsS http://127.0.0.1:${VIEWER_PORT:-8080}/api/health |
VIEWER_PORT defaults to 8080 (compose/docker-compose.stack.yml). Same nginx path as the SPA shell. |
| Laptop / CI on the tailnet | curl -fsS https://prod-podcast.<tailnet>/api/health |
MagicDNS + tailscale serve; this is what deploy-prod.yml probes after deploy. |
Prefer tailnet or container-local checks when triaging production. Treat
host :8000 alone as invalid unless you have added an explicit ports: map
for api (not in the stock compose files).
Smoke validation¶
Use the health-check table above so each step hits the intended layer.
# 1. Tailnet reachability
curl -fsS https://prod-podcast.tail-xxxxx.ts.net/api/health | jq .
# Expected: {"status":"ok","feeds_api":true,...}
# 2. Viewer shell (no HTTP Basic Auth on the VPS; tailnet is the gate)
curl -fsS https://prod-podcast.tail-xxxxx.ts.net/ | head
curl -fsS https://prod-podcast.tail-xxxxx.ts.net/welcome | head
# 3. grafana-agent shipping
ssh deploy@prod-podcast.tail-xxxxx.ts.net 'docker logs compose-grafana-agent-1 --tail 20'
# 4. Sentry validation ping
ssh deploy@prod-podcast.tail-xxxxx.ts.net \
'docker exec compose-api-1 python -c "
from podcast_scraper.utils.sentry_init import init_sentry
import sentry_sdk; init_sentry(\"api\")
sentry_sdk.capture_message(\"prod bootstrap validation ping\", level=\"info\")
"'
# Check sentry.io within ~1 min for the event under environment=prod.
# 5. Grafana Cloud query (~30 s after agent's first scrape)
# https://<org>.grafana.net → Explore → Prometheus
# Query: up{component="api",env="prod"}
# Expected: 1 series, value=1
Trigger the first deploy¶
gh workflow run deploy-prod.yml --repo chipi/podcast_scraper
gh run watch --repo chipi/podcast_scraper
Once a few green deploys have passed and you trust the loop, file the
follow-up PR that flips deploy-prod.yml to also auto-trigger on
workflow_run: ["Stack test"] (RFC-082 Decision 6 GitOps loop).
Daily operations¶
Where to look first¶
| Symptom | Where |
|---|---|
| Viewer slow / unreachable | gh run list --workflow deploy-prod.yml --limit 3 then Sentry → environment=prod |
| Pipeline run failing | Sentry → environment=prod, component=pipeline; viewer Library → Job logs |
| Deploy went red | GHA UI → Deploy to prod VPS → most recent run; api logs are dumped on health-check failure |
| "Did the alert fire because of X?" | Grafana Cloud → podcast-scraper folder → filter env=prod |
Manual deploy¶
gh workflow run deploy-prod.yml --repo chipi/podcast_scraper \
-f override_image_sha= # blank = deploy current main
# or pin to a specific image:
gh workflow run deploy-prod.yml --repo chipi/podcast_scraper \
-f override_image_sha=abc1234
Pipeline run via the viewer¶
Standard flow — open the viewer, hit Configuration → Run pipeline. Same
control plane as pre-prod. Profile dropdown is restricted to
cloud_balanced,cloud_thin (no ML profiles in prod, per RFC-082).
Backup status¶
gh run list --workflow backup-corpus-prod.yml --repo chipi/podcast_scraper --limit 5
gh release list --repo chipi/podcast_scraper-backup --limit 10 | grep snapshot-prod-
To download the latest matching snapshot-prod-* asset, print tarball
stats, and unpack under .tmp_backup_verify/ (gitignored):
./scripts/ops/verify_prod_backup_snapshot.sh
Use ./scripts/ops/verify_prod_backup_snapshot.sh --help for a specific tag
or --no-extract (list only, no unpack). Releases after RFC-084 also ship
snapshot.manifest.json; default restore picks the newest compatible
snapshot-prod-* (fail closed if none match — pin backup_tag / PODCAST_BACKUP_TAG).
See Corpus snapshot manifest and restore.
VPS access control (no HTTP Basic Auth)¶
The VPS nginx overlay (docker/viewer/nginx-prod.conf.template via
compose/docker-compose.vps-prod.yml) does not enable auth_basic.
Reachability is tailnet-only (Hetzner firewall has no public TCP 80/443).
If you ever open public ingress, add a separate edge layer (for example
OAuth at a reverse proxy), not only Basic Auth on this nginx template.
Corpus migration¶
One-time migration on cutover day. Use the newest compatible snapshot from the backup repo rather than streaming files between hosts (more reliable; matches RFC-082 Decision 4).
Preferred on prod: run prod-restore-corpus.yml in GitHub Actions (confirm
PROD_RESTORE). On-host rehearsal or migration-only SSH:
ssh deploy@prod-podcast.tail-xxxxx.ts.net
cd /srv/podcast-scraper
make restore-corpus-prod # newest-compatible snapshot-prod-* → corpus/
# Pin a specific backup release (DR drills / audits):
# PODCAST_BACKUP_TAG=<release-tag> make restore-corpus-prod
See Corpus snapshot manifest and restore.
# Verify
ls -la corpus/feeds/ | head
find corpus -name '*.gi.json' | wc -l
docker compose -f compose/docker-compose.stack.yml \
-f compose/docker-compose.prod.yml \
-f compose/docker-compose.vps-prod.yml \
restart api viewer
# In the viewer (over Tailscale): Library tab should now show all
# episodes from the snapshot.
After cutover, future backups come from the VPS via
backup-corpus-prod.yml.
Rollback¶
deploy-prod.yml runs infra/deploy/deploy.sh, which does
git pull && docker compose pull && docker compose up -d. Three failure modes:
Failure 1: image pull failed (network blip / GHCR auth issue)¶
Symptom: workflow's deploy step exits non-zero on compose pull.
Recovery: re-run the workflow. No state changed; old containers still
running with old images.
Failure 2: pull succeeded but a container won't start¶
Symptom: workflow exits non-zero on compose up. New images on disk; old
container gone (compose recreates by default).
Recovery: SSH in, manually pin the previous image:
ssh deploy@prod-podcast.tail-xxxxx.ts.net
cd /srv/podcast-scraper
PODCAST_IMAGE_TAG=sha-<previous-good-short-sha> \
docker compose -f compose/docker-compose.stack.yml \
-f compose/docker-compose.prod.yml \
-f compose/docker-compose.vps-prod.yml \
up -d --remove-orphans
The :sha-<short> tags from the publish job are the rollback target — they're
never garbage-collected.
Failure 3: stack up but functionally broken¶
Symptom: workflow green, but operator notices the bug from the viewer. Recovery: same as Failure 2. Then file a bug + ship a fix-forward via the hotfix path.
Disaster recovery¶
If the Hetzner instance is irrecoverable (account issue, hardware failure,
accidental tofu destroy):
# 1. Re-provision (~5–10 min)
cd infra
./tofu apply # same hostname, same Tailscale registration
# (cloud-init re-runs the bootstrap)
# 2. Restore corpus (~3–5 min for ~20 MB snapshot)
# Preferred: prod-restore-corpus.yml (confirm PROD_RESTORE).
# On-host: ssh deploy@prod-podcast.tail-xxxxx.ts.net \
# 'cd /srv/podcast-scraper && make restore-corpus-prod'
# 3. Re-stage host-side `.env` (operator's responsibility)
# See "First-time bootstrap → Stage the host-side `.env`".
# 4. Verify (~5 min)
# See "Smoke validation".
Total wall time: ~15–20 min assuming the operator knows the runbook.
Corpus is recoverable to within ~24 h of pre-disaster state (last
backup-corpus-prod.yml run).
#724 tracks the end-to-end DR drill that calibrates these numbers against reality. Complete readiness (#751) and use DR drill runbook before scheduling that drill.
Credential rotation¶
Hetzner API token¶
- Hetzner console → Security → API Tokens → Generate new (Read+Write).
gh secret set HCLOUD_TOKEN --repo chipi/podcast_scraper --app actions --body '<new>'- Update your password manager entry
Hetzner Cloud / podcast-scraper-prod / api-token. - Revoke the old token in the Hetzner console.
- Trigger one workflow_dispatch run of
infra-apply.ymlto confirm the new token works (will be a no-op apply if no infra changes).
Tailscale credentials (Free-plan workaround)¶
OAuth clients are gated to Tailscale Premium+ tiers. On Personal Free we use two separate credentials instead, both rotated independently:
| Credential | Purpose | Where it's used | Where to generate |
|---|---|---|---|
TS_AUTHKEY |
Joins the GHA runner / VPS to the tailnet (device-level auth) | tailscale/github-action@v2 in deploy-prod.yml + backup-corpus-prod.yml; cloud-init's tailscale up |
admin/settings/keys → Auth keys tab |
TS_API_KEY |
Authenticates terraform's tailscale provider to the management API (creates per-server auth keys, syncs ACL) |
infra-ci.yml + infra-apply.yml's TF_VAR_tailscale_api_key |
admin/settings/keys → API access tokens tab |
Both expire at most every 90 days on Free plan — Tailscale doesn't allow non-expiring keys. Set a calendar reminder.
Rotating TS_AUTHKEY:
- admin/settings/keys → Auth keys → Generate new (Reusable, Ephemeral, Pre-approved, Tags:
tag:gha-deployer). gh secret set TS_AUTHKEY --repo chipi/podcast_scraper --app actions --body '<tskey-auth-...>'- Update your password manager (entry:
Tailscale / podcast-scraper / authkey). - Revoke the old auth key in the Tailscale admin.
- Trigger one workflow_dispatch run of
deploy-prod.ymlto confirm.
Rotating TS_API_KEY:
- admin/settings/keys → API access tokens → Generate new.
gh secret set TS_API_KEY --repo chipi/podcast_scraper --app actions --body '<tskey-api-...>'- Update your password manager (entry:
Tailscale / podcast-scraper / api-key). - Revoke the old token in the Tailscale admin.
- Trigger one workflow_dispatch run of
infra-apply.ymlto confirm (no-op apply if no infra changes).
age key (sops state)¶
Caution: the age key encrypts the OpenTofu state. Losing it without a backup means the encrypted state in
infra/terraform/terraform.tfstate.encis unrecoverable; you'd need to import all live Hetzner resources into a fresh state. Always back the new key up before retiring the old one.
- Generate new keypair:
age-keygen -o ~/.config/sops/age/keys.txt.new - Decrypt current state with the OLD key, re-encrypt with the NEW key:
cd infra
./tofu state pull > /tmp/state.json # via the wrapper, OLD key
# Update infra/.sops.yaml to the NEW public key
sops -e /tmp/state.json > infra/terraform/terraform.tfstate.enc
shred /tmp/state.json
gh secret set TFSTATE_AGE_KEY --repo chipi/podcast_scraper --app actions --body "$(cat ~/.config/sops/age/keys.txt.new)"- Update your password manager (keep the OLD key for ~30 days as a safety net).
- After ~30 days of green CI, delete the OLD key.
Host .env secrets (provider API keys, Sentry DSN, etc.)¶
ssh deploy@prod-podcast.tail-xxxxx.ts.net
sudo sed -i 's|^OPENAI_API_KEY=.*|OPENAI_API_KEY=<new>|' /srv/podcast-scraper/.env
sudo systemctl restart podcast-scraper.service
For multiple keys, edit the .env directly: sudo -e /srv/podcast-scraper/.env.
Environment variable reference¶
The host-side /srv/podcast-scraper/.env file is the single source of
truth for runtime config on the VPS. Owned by deploy:deploy, mode
600. Loaded by:
- the systemd unit (
EnvironmentFile=— runs on boot + onsystemctl restart) - the explicit
--env-fileflag we pass when runningdocker composedirectly (see Operator hot-fix workflow) - the api code's nested
docker compose run pipeline-llm(auto-passes--env-fileif the project root has a.env— seepipeline_docker_factory.py)
Required vars¶
| Var | Format | Purpose | Notes if missing |
|---|---|---|---|
PODCAST_DOCKER_PROJECT_DIR |
absolute path | Repo path inside the api container (must match host because of bind mount) | api refuses to start; volume interpolation fails |
PODCAST_CORPUS_HOST_PATH |
absolute path | Where the corpus lives on the host | api spawn fails: volumes.corpus_data.driver_opts.device: required variable PODCAST_CORPUS_HOST_PATH is missing a value |
PODCAST_DEFAULT_CORPUS_PATH |
path (optional) | Viewer-only: nginx injects /app/output so the SPA matches the api corpus mount |
status bar empty on first load; catalog APIs not called until you paste /app/output |
PODCAST_ENV |
prod |
Tags Sentry + Grafana labels with env=prod |
events tagged env=preprod (default) — wrong dashboard filtering |
PODCAST_AVAILABLE_PROFILES |
csv | Profile dropdown allowlist | dropdown shows ALL on-disk profiles, including ones whose images aren't published in prod (run will fail mid-job) |
PODCAST_DEFAULT_PROFILE |
string | Preselected profile in the viewer | dropdown opens unselected; api 400s if Run hit before Save |
LLM provider keys (Docker job mode)¶
These are read by the pipeline container, not the api. The api passes
them through via its environment: block in
compose/docker-compose.prod.yml so the nested docker compose run
inherits them.
| Var | Required when | Format |
|---|---|---|
OPENAI_API_KEY |
profile uses openai (cloud_balanced default) | sk-... |
GEMINI_API_KEY |
profile uses gemini (cloud_balanced default for diarization + summary) | AIza... |
ANTHROPIC_API_KEY |
profile uses anthropic | sk-ant-... |
MISTRAL_API_KEY |
profile uses mistral | provider-specific |
DEEPSEEK_API_KEY |
profile uses deepseek | provider-specific |
GROK_API_KEY |
profile uses grok | provider-specific |
Variable name pitfall: the code expects OPENAI_API_KEY and
GEMINI_API_KEY — with _API_ in the middle. Names like OPENAI_KEY
or GEMINI_KEY silently resolve to empty defaults at compose parse
time, then the pipeline crashes on first provider call with
OpenAI API key required for OpenAI providers. Use grep -E
'^(OPENAI|GEMINI)_API_KEY=' /srv/podcast-scraper/.env to confirm.
Observability vars (Grafana Cloud + Sentry)¶
| Var | Where to get it | Notes |
|---|---|---|
GRAFANA_CLOUD_PROM_URL |
Grafana stack page → Prometheus → Send Metrics → "Remote Write Endpoint" | URL ends in /api/prom/push |
GRAFANA_CLOUD_LOKI_URL |
Grafana stack page → Loki → Send Logs → "Endpoint" | URL ends in /loki/api/v1/push |
GRAFANA_CLOUD_PROM_USER |
Same Prometheus page → "Username / Instance ID" (numeric) | Distinct from Loki user — Grafana issues a separate instance ID per service |
GRAFANA_CLOUD_LOKI_USER |
Same Loki page → "User" (numeric) | Same pattern as Prom user |
GRAFANA_CLOUD_API_KEY |
Grafana Cloud → Access Policies → New policy with metrics:write + logs:write → Add token |
Single token authenticates both services; format glc_eyJ... |
PODCAST_SENTRY_DSN_API |
Sentry → api project → Settings → Client Keys (DSN) | URL like https://<key>@o<org>.ingest.de.sentry.io/<project> |
PODCAST_SENTRY_DSN_PIPELINE |
Sentry → pipeline project → Settings → Client Keys (DSN) | Different project from api so issues stay separable |
Operator note: if your existing prod VPS is already running and Grafana shipping is healthy, harmonizing Grafana env var names in repo templates is a no-op for that live host. It only takes effect on bootstrap/re-provision paths (for example new VPS from cloud-init, DR rebuild, or explicit host-metrics Alloy bootstrap updates).
Optional / advanced vars¶
| Var | Default | Effect when set |
|---|---|---|
PODCAST_DEFAULT_PIPELINE_INSTALL_EXTRAS |
llm (in prod overlay) |
Host-wide fallback when operator YAML omits pipeline_install_extras. Must be ml or llm |
PODCAST_IMAGE_TAG |
main |
Pin image to a specific tag (sha-<short> for rollback) |
PODCAST_RELEASE |
empty | Sentry release tag (defaults to image SHA via build-time injection) |
PODCAST_METRICS_ENABLED |
1 |
Set to 0 to disable api /metrics exposition |
VIEWER_PORT |
8080 |
Host port for the viewer container; only matters before tailscale serve is set up |
Verifying what's actually loaded¶
# Names only (no values), sorted, deduped
ssh deploy@prod-podcast.<tailnet> \
"awk -F= '/^[A-Z_][A-Z0-9_]*=/ {print \$1}' /srv/podcast-scraper/.env | sort -u"
# Confirm a specific var made it into the running api container
ssh deploy@prod-podcast.<tailnet> \
'docker exec compose-api-1 sh -c "for k in OPENAI_API_KEY GEMINI_API_KEY PODCAST_CORPUS_HOST_PATH; do
v=\$(printenv \"\$k\"); [ -z \"\$v\" ] && echo \"\$k=<EMPTY>\" || echo \"\$k=<set, length=\${#v}>\"
done"'
Observability setup walkthrough¶
Grafana Cloud (one-time, per-stack)¶
- Create a free Grafana Cloud account at https://grafana.com/auth/sign-up.
- Default stack is created on signup. Note the region (
us,eu-west-2, etc.) — it's baked into every endpoint URL. - Get Prometheus credentials: stack page → click Prometheus → Send Metrics:
- Copy "Remote Write Endpoint" →
GRAFANA_CLOUD_PROM_URL - Copy "Username / Instance ID" (numeric, ~7 digits) →
GRAFANA_CLOUD_PROM_USER - Get Loki credentials: stack page → Loki → Send Logs:
- Copy "Endpoint" →
GRAFANA_CLOUD_LOKI_URL - Copy "User" (numeric) →
GRAFANA_CLOUD_LOKI_USER - Generate the write token: grafana.com top nav → Access Policies → Create access policy:
- Name:
podcast-scraper-agent-prod-write(or similar — easy to revoke later) - Realm: your stack
- Scopes: check
metrics:writeANDlogs:write - Save → click into the policy → Add token → name it →
Generate → copy the
glc_eyJ...value →GRAFANA_CLOUD_API_KEY - The token is shown ONCE. Save to your password manager immediately.
- After staging vars + recreating
grafana-agent(see Operator hot-fix workflow), verify within 2 min: - Grafana Cloud → Explore → Prometheus → query
up{component="api",env="prod"}→ expect a single series, value1 - Grafana Cloud → Explore → Loki → query
{env="prod"}→ expect log lines fromapi,viewer,grafana-agent
Grafana dashboards and alert rules (env filter)¶
When prod and pre-prod both ship metrics and logs to the same Grafana
Cloud stack, any panel or alert that omits env will aggregate
both environments. Track importable JSON and operator steps in GitHub
issue #726.
Repo dashboards (config/grafana/): Re-import or overwrite panels
from git after changes. Each JSON dashboard in the podcast-scraper set
includes a template variable env (default prod; choose
preprod for Codespaces — these values match PODCAST_ENV in
compose/grafana-agent.yaml external_labels, not the prose spelling
"pre-prod"). Prometheus panels use env="$env" in selectors; Loki
panels already used the same pattern.
Grafana Cloud alert rules (not stored in this repo): Edit every
rule whose query touches podcast-scraper metrics and add env="prod"
to the PromQL selector so pre-prod traffic cannot fire prod alerts.
Log-based alerts should filter on {env="prod", ...} the same way.
Sentry (one-time, per project)¶
- Create free Sentry account at https://sentry.io/signup/.
- Create two projects:
podcast-scraper-api— platform Python (FastAPI auto-detected)podcast-scraper-pipeline— platform Python- For each project: Settings → Client Keys (DSN) → copy the
DSN value (full
https://<key>@o<org>.ingest.<region>.sentry.io/<project>). - Stage as
PODCAST_SENTRY_DSN_APIandPODCAST_SENTRY_DSN_PIPELINEin.env. - After api recreate, verify with the Sentry validation ping in
the Smoke validation block above. Expect the event in the api
project under
environment=prodwithin ~1 min.
Sentry Slack routing (prod vs pre-prod, GH-725)¶
Decision (RFC-082 Open Question 3): Option
B — keep a single
DSN per component and split noise in Sentry using the
environment tag already set by init_sentry() from PODCAST_ENV
(prod on the VPS .env; preprod in the default Codespace — see
.devcontainer/devcontainer.json). Option A (separate prod DSNs) is
documented in the issue for teams that prefer separate Sentry projects.
Prod-only Slack path (Sentry UI, per project: api and pipeline):
- Ensure Slack is installed under Settings → Integrations → Slack for the Sentry org (or use the project-level Slack integration).
- Open Alerts → Create Alert (Issue alert) for
podcast-scraper-api. - Set When to the issue volume you want (e.g. new issue, or regressed).
- Under If, add a filter on event environment (wording varies by
Sentry version: e.g. The event's environment attribute or An
event's tags with key
environment) equalsprod(must matchPODCAST_ENVon the VPS exactly). - Under Then, choose Send a notification via an integration →
Slack → channel
#podcast-prod-alerts(or your prod channel). - Repeat for
podcast-scraper-pipelineif pipeline issues should also notify Slack.
Pre-prod path: create a second alert (or use the same rule with a
different Then branch if your Sentry plan supports it) with
If environment equals preprod targeting #podcast-preprod-alerts,
or do not attach Slack (pre-prod stays Sentry-email / UI only).
Acceptance checks (operator):
- From prod: run the Sentry validation ping in Smoke validation and confirm only the prod Slack route fires.
- From a Codespace (default
PODCAST_ENV=preprod): send a test event (same Python snippet withinit_sentry("api")) and confirm it does not hit the prod-only rule (it should match the pre-prod rule or stay unrouted).
Sentry product reference: Issue alerts and filter conditions on event attributes.
Viewer Sentry (build-time DSN, runtime env)¶
The viewer SPA's Sentry DSN is baked into the bundle at build time
(VITE_SENTRY_DSN_VIEWER build-arg). The env= tag is injected at
request time by nginx sub_filter reading the container's
PODCAST_ENV env (so one viewer image serves prod + preprod with
correct tags).
To wire viewer Sentry (one-time):
- Create a third Sentry project:
podcast-scraper-viewer(platform JavaScript / Vue). - Copy the DSN.
- Set GHA repo secret:
gh secret set VITE_SENTRY_DSN_VIEWER --repo chipi/podcast_scraper --app actions --body 'https://...'. - Wait for the next push to
main—stack-test.yml's viewer publish step bakes the DSN into the new image. - Pull + restart on prod:
gh workflow run deploy-prod.yml.
DSNs are write-only public tokens designed to ship with frontend code — baking into the public GHCR image is the standard pattern.
Tailscale operations¶
Reaching the VPS¶
The VPS joins the tailnet at provision time (cloud-init runs
tailscale up --auth-key=$TS_AUTHKEY --hostname=prod-podcast). On
your laptop:
tailscale status # confirm laptop is on the tailnet
ssh deploy@prod-podcast.<tailnet>.ts.net
# -OR by tailnet IP (rarely needed; MagicDNS handles names):
ssh deploy@$(tailscale ip -4 prod-podcast)
If the hostname doesn't resolve: a prior failed deploy may have left
an orphan device on the tailnet, with the live VPS auto-named
prod-podcast-1 (or -2, etc.). Check tailscale status | grep prod-podcast
and use whatever the actual current name is, or run
bash scripts/ops/resolve_prod_tailnet_host.sh with
PROD_TAILNET_FQDN set to your canonical value (see When the live hostname
has a numeric suffix).
Clean up orphans in the Tailscale admin
console to reclaim the canonical
name.
HTTPS over the tailnet¶
tailscale serve exposes the in-container viewer (on port 8080) as
HTTPS port 443 with an auto-issued TLS cert from Tailscale's CA.
On new or reprovisioned VPS hosts, cloud-init installs
/usr/local/sbin/podcast-tailscale-serve.sh and podcast-scraper.service
runs it after docker compose up (and clears serve on compose down).
The canonical script source is infra/cloud-init/podcast-tailscale-serve.sh
(injected by OpenTofu into prod.user-data); deploy and drill CI can copy it
onto the host with sudo install … when /etc/sudoers.d/99-podcast-deploy-tailscale-serve
includes that install rule.
Existing servers created before that change keep whatever serve state they
had; run the manual steps below once if MagicDNS HTTPS still fails.
Set up manually on older hosts (or to verify):
ssh deploy@prod-podcast.<tailnet>
sudo tailscale serve --bg 8080
sudo tailscale serve status # confirm: https://prod-podcast.<tailnet> → 127.0.0.1:8080
After this, https://prod-podcast.<tailnet>/ works from any tailnet
device, with a real cert (no browser warnings).
Editing the ACL¶
ACL lives in tailscale/policy.hujson in the repo. infra-apply.yml
syncs it to the live tailnet via the terraform provider. Edit, open
PR, merge → next infra-apply run pushes the change.
For ad-hoc / urgent ACL changes (e.g. adding a new device), the admin
console at https://login.tailscale.com/admin/acls allows direct edits
— but those will be overwritten by the next infra-apply run unless
you also commit them to the file.
Tailscale auth key vs API key¶
Both expire ≤ 90 days on Free plan. See Credential rotation → Tailscale credentials for the rotation flow. Reminder: set a calendar event for ~80 days out.
Hetzner operations¶
Console access¶
If the VPS becomes unreachable over Tailscale (tailscaled crashed, network down, etc.), Hetzner's web console gives serial access:
- https://console.hetzner.cloud → Servers →
podcast-scraper-prod - Console tab (top right) → opens a noVNC session
- Log in with the
deployuser via SSH key (your local~/.ssh/id_ed25519was injected by cloud-init) — you'll need to upload it via "Send key" button or use root login if the deploy user is broken - Last-resort: Rescue mode boots a recovery system to
chrootthe real disk
Volume management¶
Optional Hetzner Volume for the corpus is created when
volume_size_gb > 0 in infra/terraform/variables.tf. cloud-init
auto-detects /mnt/HC_Volume_* and symlinks it as
/srv/podcast-scraper/corpus. Verify via mount | grep HC_Volume.
If you upsize the volume:
- Hetzner console → Volumes → resize (online, no downtime)
- SSH in:
sudo resize2fs /dev/disk/by-id/scsi-0HC_Volume_<id> - Confirm with
df -h /srv/podcast-scraper/corpus
Firewall¶
Hetzner's cloud firewall (managed via terraform) allows ONLY:
- inbound UDP 41641 (Tailscale WireGuard)
- inbound ICMP (ping)
- outbound: all (for image pulls, package updates, provider API calls)
Note: there's NO public TCP 80/443 rule. The viewer is reachable
ONLY over the tailnet (via tailscale serve). Adding public ingress
requires editing infra/terraform/main.tf and an explicit edge security
design (do not rely on nginx Basic Auth alone).
API token rotation¶
See Credential rotation → Hetzner API token.
Operator hot-fix workflow¶
When you've made a local change to a compose file or non-image-baked
config (e.g. compose/docker-compose.prod.yml,
compose/grafana-agent.yaml, nginx-prod.conf.template) and want to
test it on prod without waiting for a full main-push +
publish + deploy cycle (~25 min).
Limit: files COPY'd into the published image (api/viewer/pipeline Python source, baked nginx config, Vue bundle) require an image rebuild. This workflow only handles bind-mounted / overlay-defined files.
# 1. From your laptop, scp the updated file(s) into the right place on the VPS:
scp compose/docker-compose.prod.yml \
compose/grafana-agent.yaml \
docker/viewer/nginx-prod.conf.template \
deploy@prod-podcast.<tailnet>:/srv/podcast-scraper/compose/
# (paths must match what the bind mounts in the YAML expect)
# 2. Recreate the affected service so it picks up the change.
# --env-file is REQUIRED when running compose directly as deploy user;
# see "Why --env-file?" in the FAQ.
ssh deploy@prod-podcast.<tailnet> 'bash -s' <<'REMOTE'
set -euo pipefail
cd /srv/podcast-scraper
COMPOSE="docker compose --env-file /srv/podcast-scraper/.env \
-f compose/docker-compose.stack.yml \
-f compose/docker-compose.prod.yml \
-f compose/docker-compose.vps-prod.yml"
$COMPOSE up -d --force-recreate api grafana-agent viewer
$COMPOSE ps
REMOTE
After verifying the fix works on prod, commit the same change to a
branch + open a PR — otherwise the next time someone re-runs cloud-init
or git pulls on the VPS, your hot-fix gets blown away. There's no
auto-pull on prod, so you have a window, but don't trust it.
FAQ / Troubleshooting¶
"curl http://127.0.0.1:8000/api/health fails on the VPS but the app works"¶
The api service listens on 8000 inside the container only. Stock
compose uses expose, not ports, so nothing listens on the host's
127.0.0.1:8000. Use a check from the API health checks by
context table (container exec, viewer
port, or tailnet HTTPS). See GitHub issue
745.
"Why --env-file?"¶
docker compose -f compose/docker-compose.stack.yml ... resolves the
project directory to dirname(first -f file) = /srv/podcast-scraper/compose/,
not /srv/podcast-scraper/. Compose's auto-load of .env searches
the project dir, so it looks at compose/.env (which doesn't exist)
and skips the actual file at /srv/podcast-scraper/.env.
The systemd unit avoids this because EnvironmentFile=/srv/podcast-scraper/.env
loads the env into the SERVICE environment, which the spawned compose
process inherits regardless of project-dir resolution.
When running compose directly (deploy / debugging / hot-fix), always
pass --env-file /srv/podcast-scraper/.env explicitly.
Corpus directory (host vs /app/output in containers)¶
Source of truth on disk: the host directory in
PODCAST_CORPUS_HOST_PATH inside /srv/podcast-scraper/.env. Always
confirm before runbooks, backups, or one-off CLI:
grep '^PODCAST_CORPUS_HOST_PATH=' /srv/podcast-scraper/.env
The cheat sheet must not hard-code a corpus
path without telling operators to run that grep.
Inside api / pipeline containers the same tree is mounted at
/app/output (default serve --output-dir in docker/api/Dockerfile).
Compose may implement corpus_data as a named volume whose data
directory still reflects that host path from the initial bind
definition in compose/docker-compose.prod.yml — validate with
docker inspect <api-container> mounts if you need to see Docker's
view.
Topic clusters missing after a successful pipeline run¶
Expected behaviour: POST /api/jobs runs
full_incremental_pipeline only. Finalize calls maybe_index_corpus
when vector_search is true in the profile (see
src/podcast_scraper/workflow/orchestration.py), but nothing in that
path runs topic-clusters. So search/topic_clusters.json can be
absent even when search/vectors.faiss exists.
Profiles: cloud_thin sets vector_search: false — no FAISS index
from the pipeline; clustering cannot run until an indexing-capable
profile has built vectors.faiss. See profile YAMLs under
config/profiles/.
What to run: follow Prod operator cheat sheet — Topic clusters
(host venv with .[search], or docker exec into the running api
container). Reference clustering design in
RFC-075 and the API in
Server guide — GET /api/corpus/topic-clusters.
Troubleshooting:
topic-clustersexits with missing FAISS: ensure$HOST_CORPUS/search/vectors.faissexists (or/app/output/search/vectors.faissin-container).- 404 on
GET /api/corpus/topic-clusters: artifact not built or wrongpathquery versus server default. - Host
python3 -m venvfails (ensurepip): installpython3.12-venv(or distribution equivalent) with sudo once, or use thedocker execroute in the cheat sheet. - Artifact owned by root after an in-container write:
chownback todeploy(and corpus group) so backups and host tools stay consistent.
Reprocess a corpus without re-transcribing audio¶
Use this when extraction/summarization/index logic changed and you want to recompute derived outputs from existing transcript files.
Safe rule: keep transcripts/; rebuild downstream artifacts only.
- Resolve corpus root from host env:
grep '^PODCAST_CORPUS_HOST_PATH=' /srv/podcast-scraper/.env
-
Optional rollback snapshot (derived outputs only), then remove stale derived layers (
search/, oldmetadata/dirs / old run outputs). Keeptranscripts/intact. -
Re-run pipeline with both flags:
-
--skip-existing(reuse transcript files already on disk) --no-transcribe-missing(do not invoke Whisper/audio transcription)
Example (host Python route):
ssh deploy@prod-podcast.<tailnet>.ts.net
cd /srv/podcast-scraper
HOST_CORPUS=$(grep '^PODCAST_CORPUS_HOST_PATH=' .env | cut -d= -f2-)
.venv/bin/python -m podcast_scraper.cli \
--config "$HOST_CORPUS/viewer_operator.yaml" \
--feeds-spec "$HOST_CORPUS/feeds.spec.yaml" \
--output-dir "$HOST_CORPUS" \
--skip-existing \
--no-transcribe-missing
- Optional post-step: rebuild topic clusters if the index exists:
.venv/bin/python -m podcast_scraper.cli topic-clusters --output-dir "$HOST_CORPUS"
For a copy/paste operator sequence (including backup and cleanup commands), see Prod operator cheat sheet — Reprocess from existing transcripts (no re-transcription).
"Pipeline fails with PODCAST_CORPUS_HOST_PATH is missing a value"¶
The api spawned a nested docker compose run pipeline-llm but couldn't
resolve ${PODCAST_CORPUS_HOST_PATH} from compose/docker-compose.prod.yml.
Causes (in order of likelihood):
.envdoesn't havePODCAST_CORPUS_HOST_PATH=...— check withgrep PODCAST_CORPUS_HOST_PATH /srv/podcast-scraper/.env- The api container is running an OLD image that doesn't pass
--env-fileto the nested compose AND doesn't have the var in its env passthrough — pull/recreate api with the latest image - The api passthrough block in
compose/docker-compose.prod.ymldoesn't includePODCAST_CORPUS_HOST_PATH:— verify the file on prod matches the latest main
"Pipeline fails with OpenAI API key required for OpenAI providers"¶
The pipeline container's OPENAI_API_KEY resolved to empty. Causes:
- Wrong variable name in
.env— the most common cause. Code expectsOPENAI_API_KEY(with_API_); typos likeOPENAI_KEYsilently resolve to the empty default. Check withgrep -E '^OPENAI(_API)?_KEY=' /srv/podcast-scraper/.env - The api container env doesn't have the key (passthrough missing / stale image) — see hot-fix workflow + recreate api
- The key value itself is malformed (e.g. base64-encoded by accident)
"Pipeline fails with Docker pipeline jobs require top-level pipeline_install_extras"¶
The corpus's viewer_operator.yaml doesn't declare
pipeline_install_extras: ml or : llm, AND the host has no
PODCAST_DEFAULT_PIPELINE_INSTALL_EXTRAS env fallback set.
Fix (durable): in the viewer, Sources → Operator tab, add
pipeline_install_extras: llm to the YAML, Save. Now the field
is in the file.
Fix (host-wide): set PODCAST_DEFAULT_PIPELINE_INSTALL_EXTRAS=llm
in /srv/podcast-scraper/.env and recreate the api. Falls back for
every corpus that omits the field.
The prod overlay sets the env default to llm out of the box;
this error only appears if someone explicitly unset it.
"Viewer container shows unhealthy but the page loads fine"¶
The healthcheck probe (wget against /) accepts any HTTP/ status
line as alive (compose/docker-compose.stack.yml). If you still see
unhealthy, you're on an old viewer image — pull the latest.
"Grafana agent restarting / no data in Grafana Cloud"¶
In order of likelihood:
- Wrong username for one of Prom / Loki — they have separate
instance IDs in Grafana Cloud. Check both
GRAFANA_CLOUD_PROM_USERandGRAFANA_CLOUD_LOKI_USERare set distinctly. - API key missing the right scope — must include both
metrics:writeANDlogs:writeto feed both endpoints with one token - URL has the wrong region —
prometheus-prod-65-prod-eu-west-2vsprometheus-prod-13-prod-us-central-0etc. Re-copy from the stack page
Diagnose with:
ssh deploy@prod-podcast.<tailnet> \
'docker logs --tail 50 compose-grafana-agent-1 | grep -iE "error|401|403|forbidden"'
"Cursor or automation cannot ssh deploy@prod"¶
First confirm Tailscale on the machine that runs the command:
tailscale status | head -5
You should see prod-podcast-1 (or similar) as active. SSH to the VPS is
intended over the tailnet only; inbound SSH on the public Hetzner IPv4 is
not exposed in the default firewall.
Then confirm OpenSSH has a usable key for that shell:
ssh-add -l
If this prints "The agent has no identities.", no key is loaded for that
process. Keys added only in another app (for example Terminal.app) do not
always end up on the same SSH_AUTH_SOCK that a Cursor agent subprocess
inherits.
Time-limited agent load (recommended for IDE / agent ssh to prod): in
Cursor’s integrated terminal for this workspace, load the operator private
key into ssh-agent with a lifetime so it is not left loaded indefinitely:
ssh-add -t 30m ~/.ssh/id_ed25519
ssh-add -l
After 30 minutes the key is removed from the agent again; ssh-add -l will
go back to empty and agent-driven ssh will fail until you re-run ssh-add
-t …. That matches “it worked earlier today, then stopped.”
If your operator key is not ~/.ssh/id_ed25519, substitute the path that
matches TF_VAR_ssh_public_key / secrets.OPERATOR_SSH_PUBLIC_KEY.
macOS Keychain persistence (optional): if you prefer the key to survive agent restarts until reboot:
ssh-add --apple-use-keychain ~/.ssh/id_ed25519
ssh-add -l
Sanity check:
ssh -o IdentitiesOnly=yes -i ~/.ssh/<operator-private-key> \
deploy@prod-podcast-1.<tailnet> 'echo ok'
Replace <tailnet> with your MagicDNS suffix (for example tail6d0ed4.ts.net).
If this fails but tailscale status shows prod as active, you are using the
wrong private key file for deploy@.
GitHub Actions does not use your laptop ssh-agent; it uses
PROD_SSH_PRIVATE_KEY and .github/actions/prod-ssh-key (see
GitHub Actions SSH to prod).
"Tailnet hostname won't resolve from my laptop"¶
tailscale status # is the laptop on the tailnet?
tailscale up # if not, log back in
tailscale ip -4 prod-podcast-1 # try variants if -1 / -2 suffix
If laptop is logged in and IP resolves but ssh hangs, the VPS may
have lost its tailscale registration (auth key expired before
re-up). Use Hetzner console (see Hetzner ops) to tailscale up again
with a fresh TS_AUTHKEY.
"I can't sudo as deploy"¶
Cloud-init explicitly sets sudo: false for the deploy user. This
is intentional — deploy owns /srv/podcast-scraper and is in the
docker group, which covers everything operators normally need.
Operations requiring root (apt, systemctl, etc.) need the root
user via Hetzner console or via the same ~/.ssh/id_ed25519 key
which cloud-init also injected into root's authorized_keys as
emergency access.
"Operator YAML on VPS keeps reverting"¶
If you edit /srv/podcast-scraper/corpus/viewer_operator.yaml
directly via SSH, then click Save in the viewer's Operator tab,
the viewer overwrites your on-disk edit with whatever's in the
textarea (which was loaded BEFORE your SSH edit and stayed cached).
Always edit via the viewer Operator tab, not on disk — viewer Save preserves the textarea content verbatim.
"Pipeline image just Pull complete then immediately exits"¶
Read the pipeline container's stdout (logged by the api factory):
ssh deploy@prod-podcast.<tailnet> \
'ls -lt /srv/podcast-scraper/corpus/jobs/*/log.txt | head -3'
# then cat the most recent one
Common causes:
- LLM key missing/wrong name (see openai key FAQ)
- Profile in operator YAML names a provider whose key isn't set
- Corpus directory has a permission issue (run owner mismatch)
"Stale Docker volume after env path changes"¶
docker volume inspect compose_corpus_data — if the device: field
doesn't match PODCAST_CORPUS_HOST_PATH, the volume was created with
old config. compose won't recreate volumes whose YAML changed. Fix:
ssh deploy@prod-podcast.<tailnet>
cd /srv/podcast-scraper
COMPOSE="docker compose --env-file /srv/podcast-scraper/.env \
-f compose/docker-compose.stack.yml -f compose/docker-compose.prod.yml \
-f compose/docker-compose.vps-prod.yml"
$COMPOSE down
docker volume rm compose_corpus_data
$COMPOSE up -d
The bind path content survives (host dir is the source of truth); only the Docker-side volume metadata is rebuilt.
Constraints to know¶
These are intentional design choices that look like bugs at first glance. Knowing about them saves debugging time.
deployuser has no sudo (cloud-init:sudo: false). Use root via Hetzner console for sudo-needed operations. See FAQ → I can't sudo as deploy.- Direct shell
docker composeMUST use--env-filebecause the project dir resolves tocompose/, not/srv/podcast-scraper/. Systemd-spawned compose is unaffected (different env-loading path). - Grafana Cloud has separate Prom + Loki user IDs. Single
GRAFANA_CLOUD_USERdoesn't work — split into PROM_USER / LOKI_USER. - Viewer image is published once per main push and used by BOTH
codespace preprod AND the prod VPS.
PODCAST_ENVis injected at runtime by nginxsub_filter— DON'T add--build-arg VITE_PODCAST_ENV=...back to stack-test.yml. pipeline_install_extrasis not in the profile YAMLs. It's an operator-set field separate from profile. Default fallback comes fromPODCAST_DEFAULT_PIPELINE_INSTALL_EXTRAS=llmenv (set in the prod overlay).- No public ingress. Hetzner firewall blocks all TCP. Viewer
reachable only over tailnet via
tailscale serve. Adding public exposure requires firewall + auth changes — don't add one without the other. - Auth keys (Tailscale) expire every 90 days max on Free plan. Calendar reminder.
- Profile dropdown is filtered to published images via
PODCAST_AVAILABLE_PROFILES. Don't add a profile to the allowlist whose backing pipeline image isn't published. - The published api image only ships
[dev]extras. Pipeline runs MUST go through Docker job mode (PODCAST_PIPELINE_EXEC_MODE=docker, set in prod overlay). In-process pipeline runs would crash on missing[llm]deps.
Cross-references¶
- RFC-082 — design
- RFC-083 — public edge + multi-compose (Draft)
- Prod operator cheat sheet
- VPS multi-app onboarding
infra/— IaC codetailscale/policy.hujson— ACL.github/workflows/deploy-prod.yml.github/workflows/infra-apply.yml— manual apply gate.github/workflows/backup-corpus-prod.yml.github/workflows/prod-restore-corpus.yml— manual prod corpus restore from backup releases (confirm PROD_RESTORE)- #714 — account prereqs checklist
- #723 — Phase B cutover
- #724 — DR drill
- #751 — DR drill prerequisites (before #724)