DR drill runbook (Hetzner + GitHub Actions)¶
This guide is drill-only: Hetzner workspace drill, Tailscale tag:dr-drill, and the
GitHub Actions workflows that target deploy@ on the drill VPS. Production VPS steps,
shared Tailscale credential setup, and full disaster-recovery narrative stay in
PROD_RUNBOOK.md (see GitHub Actions SSH to prod, GitHub Actions deploy to DR drill,
and Disaster recovery).
Full-system context (Tailscale + OpenTofu + GHA + Compose): Hosting and infrastructure.
IaC layout and OpenTofu workspace rules: infra/README.md (repo root) (section DR drill workspace).
Workflow index (CI file names and triggers): WORKFLOWS.md (OpenTofu / DR drill paragraph).
Stack contract (audit table, steady vs recovery): STACK_CONTRACT.md (ADR-093).
Timed exercise: use this runbook and GitHub #724; readiness is tracked in #751.
Steady-state vs full drill cycle¶
Routine validation on a drill host (deploy only): preflight → drill-deploy → drill-e2e
smoke and/or drill-stack-playwright — same stack discipline as prod; see
STACK_CONTRACT.md.
Full orchestrated exercise adds infra apply, simulated corpus restore
(drill-restore-corpus), and always destroy — see
Orchestrated full cycle below. Restore is not
implied for steady prod or Codespace bring-up.
When to use which doc¶
| Need | Doc |
|---|---|
Prod deploy@, prod Tailscale, prod backup/restore |
PROD_RUNBOOK.md |
Drill deploy@, drill workflows, full-cycle orchestrator, drill destroy |
This runbook + infra/README.md (repo root) |
OpenTofu state encryption, terraform.tfstate.enc.drill |
infra/README.md (repo root) |
| Whole-system narrative (diagrams, prod + drill) | Hosting and infrastructure |
| Corpus snapshot manifest, local Make vs Actions (prod / pre-prod / drill) | CORPUS_SNAPSHOT_MANIFEST_AND_RESTORE.md |
Keeping drill material here avoids duplicating long prod sections and keeps PROD_RUNBOOK shorter for day-to-day prod operators.
Repository secrets and variables (drill)¶
| Name | Kind | Used by |
|---|---|---|
HCLOUD_TOKEN_DRILL |
secret | drill-infra-plan, drill-infra-apply, drill-infra-destroy, orchestrator |
TS_API_KEY |
secret | Same infra workflows (Tailscale provider) |
TFSTATE_AGE_KEY |
secret | sops decrypt/encrypt for drill state |
OPERATOR_SSH_PUBLIC_KEY |
secret | Same as prod infra (public half only; log masking). First-boot deploy@ authorized_keys is written only from this value. |
TS_AUTHKEY |
secret | drill-deploy, drill-e2e, drill-restore-corpus, orchestrator app jobs |
DRILL_DEPLOY_SSH_PRIVATE_KEY |
secret | Same app jobs (PEM for deploy@ on drill) |
BACKUP_REPO_TOKEN |
secret (optional) | drill-restore-corpus when chipi/podcast_scraper-backup is private |
TAILNET_NAME |
variable | Infra TF_VAR_tailscale_tailnet |
DRILL_TAILNET_FQDN |
variable | Resolver input (e.g. dr-podcast.tail-xxxx.ts.net) |
ACL note: tag:gha-deployer → tag:dr-drill:22 must exist for CI SSH; prod infra-apply is
the usual path to land tailscale/policy.hujson changes.
Public SSH on the drill VPS (break-glass)¶
Drill terraform.drill.ci.tfvars sets hcloud_inbound_ssh_troubleshoot_cidrs so the Hetzner
firewall allows inbound TCP/22 from the listed CIDRs (CI uses IPv4+IPv6 any). Use the
server’s public IPv4 and the same operator key as OPERATOR_SSH_PUBLIC_KEY:
ssh -i ~/.ssh/<your-operator-key> deploy@<PUBLIC_IPV4>
First-boot prod.user-data also installs that public key for root@<PUBLIC_IPV4> if you
prefer root while debugging.
Where to read <PUBLIC_IPV4>: after a successful drill-infra-apply or drill-infra-plan
run (when drill state already contains a server), open the workflow run’s Summary tab — the job
posts Break-glass SSH with the IPv4 and a copy-paste ssh deploy@… line. You can still use
Hetzner console, hcloud server list, or tofu output -raw ipv4_address in workspace drill
if you prefer.
Use this when deploy@<DRILL_TAILNET_FQDN> is not reachable because Tailscale has not come up.
Prod leaves hcloud_inbound_ssh_troubleshoot_cidrs empty (SSH only over the tailnet). For a
long-lived drill VM, override the CIDR list in a gitignored tfvars to your /32 instead of 0.0.0.0/0.
Typed confirms (copy exactly)¶
| String | Workflow | Meaning |
|---|---|---|
DRILL_FULL_CYCLE or DRILL_EXERCISE |
drill-exercise.yml |
Gates full orchestrator (infra + app + always destroy) |
APPLY |
drill-infra-apply.yml (manual only) |
OpenTofu apply workspace drill |
DRILL_DESTROY |
drill-infra-destroy.yml (manual only) |
tofu destroy + Hetzner hcloud sweep + Tailscale device delete (tag:dr-drill or drill tailnet_hostname); standalone uses git state — orchestrator uses apply artifact (same run) |
DRILL_RESTORE |
drill-restore-corpus.yml (manual only) |
Overwrite drill corpus/ from backup snapshot.tgz |
DRILL_SMOKE |
drill-e2e.yml (manual or orchestrator) |
Read-only /api/health via tailnet SSH (host viewer :8080 adapter) |
DRILL_STACK_PLAYWRIGHT |
drill-stack-playwright.yml (manual; orchestrator uses skip_confirm) |
Run tests/stack-test/stack-viewer.spec.ts against drill HTTPS |
The orchestrator calls drill-infra-apply and drill-infra-destroy with internal
skip_confirm: true after the single DRILL_FULL_CYCLE / DRILL_EXERCISE gate; standalone
runs still require APPLY / DRILL_DESTROY as above.
Orchestrated full cycle (drill-exercise.yml)¶
Ordered jobs:
drill-infra-plan—tofu fmt -check,tofu validate,tofu plan(no apply).drill-infra-apply—tofu applyfor workspacedrill; uploadsterraform-state-after-apply-drill.drill-tfstate-bridge— caller job: re-downloads that artifact and uploadsdrill-tfstate-for-teardownsodrill-infra-destroy(a reusable workflow) can read state without a git commit.drill-deploy—infra/deploy/deploy.sh(in-containerapi:8000/api/health); workflow may also curl host viewer:8080as an adapter smoke (see STACK_CONTRACT).drill-restore-corpus— resolve newest compatiblesnapshot-prod-*viasnapshot.manifest.json(or pinbackup_tag);download_and_verify_snapshot.shon the runner, thenrestore_corpus_from_tarball_host.shon the drill host (manifest hub).drill-e2e— tailnet SSH adapter smoke: host viewer:8080/api/health(read-only).drill-stack-playwright—tests/stack-test/stack-viewer.spec.tsover HTTPS against the live drill host (browser + API + corpus).finalize— runsif: always()so the next step still runs when a middle job failed.drill-infra-destroy— downloadsdrill-tfstate-for-teardown(or apply artifact as fallback), thentofu destroy, Hetzner API sweep, Tailscale API removal of drill devices (always after finalize).
Each job that uses GitHub Environment drill may require a separate approval if your org
configured reviewers on that environment.
Dispatch example:
gh workflow run drill-exercise.yml -R chipi/podcast_scraper \
-f confirm=DRILL_FULL_CYCLE
Optional inputs: backup_tag, backup_repo, override_image_sha (see workflow file).
App-only path (no infra create/destroy)¶
Use individual workflows when you already have a drill VM and only want the app pipeline:
drill-deploy.ymldrill-restore-corpus.yml(confirmDRILL_RESTORE)drill-e2e.yml(confirmDRILL_SMOKE)drill-stack-playwright.yml(confirmDRILL_STACK_PLAYWRIGHT— runsstack-viewer.spec.tsagainst drill HTTPS)
Do not use drill-exercise.yml for that case: it always ends in drill-infra-destroy.
Drill host .env checklist (after first boot or restore)¶
On /srv/podcast-scraper/.env (modes 600), typical values include:
PODCAST_ENV=dr-drill(or another non-prodlabel for Sentry/Grafana filters).PODCAST_CORPUS_HOST_PATH=/srv/podcast-scraper/corpus(bind mount source on host).PODCAST_DOCKER_PROJECT_DIR=/srv/podcast-scraper(compose bind mount for api).PODCAST_DEFAULT_CORPUS_PATH=/app/output— pre-fills the viewer corpus path (container view of the same bind mount as the api). See composePODCAST_DEFAULT_CORPUS_PATHand nginx templates underdocker/viewer/.
Grafana Cloud: if GRAFANA_CLOUD_* URLs are placeholders (for example invalid.invalid), the
Grafana Agent will not ship metrics; that is expected for isolated drills unless you point drill at
real Grafana Cloud endpoints.
Manual infra destroy (without orchestrator)¶
gh workflow run drill-infra-destroy.yml -R chipi/podcast_scraper -f confirm=DRILL_DESTROY
The job runs tofu destroy, Hetzner hcloud sweep, and Tailscale DELETE /device for nodes with tag:dr-drill or hostname equal to tailnet_hostname in terraform.drill.ci.tfvars.
When you run drill-exercise, drill-infra-destroy downloads drill-tfstate-for-teardown from the drill-tfstate-bridge job (same workflow run), so destroy matches what apply created without committing terraform.tfstate.enc.drill to git. A standalone destroy (only this workflow) still uses the encrypted file from the git checkout; commit that file after standalone apply if you need destroy/plan to track that state.
Download the uploaded terraform-state-after-destroy-drill artifact when you need to commit an
updated infra/terraform/terraform.tfstate.enc.drill (same pattern as drill-infra-apply).
Prod corpus restore (separate workflow)¶
prod-restore-corpus.yml is documented in PROD_RUNBOOK.md (corpus backups and
cross-references). It uses PROD_SSH_PRIVATE_KEY, Environment prod, and confirm PROD_RESTORE.
It is not part of the drill orchestrator.
See also¶
- PROD_OPERATOR_CHEAT_SHEET.md — short prod-oriented reminders.
- RFC-082 — Always-on hosting — design context.