DR drill runbook (Hetzner + GitHub Actions)¶

This guide is drill-only: Hetzner workspace drill, Tailscale tag:dr-drill, and the GitHub Actions workflows that target deploy@ on the drill VPS. Production VPS steps, shared Tailscale credential setup, and full disaster-recovery narrative stay in PROD_RUNBOOK.md (see GitHub Actions SSH to prod, GitHub Actions deploy to DR drill, and Disaster recovery).

Full-system context (Tailscale + OpenTofu + GHA + Compose): Hosting and infrastructure.

IaC layout and OpenTofu workspace rules: infra/README.md (repo root) (section DR drill workspace).

Workflow index (CI file names and triggers): WORKFLOWS.md (OpenTofu / DR drill paragraph).

Stack contract (audit table, steady vs recovery): STACK_CONTRACT.md (ADR-093).

Timed exercise: use this runbook and GitHub #724; readiness is tracked in #751.

Steady-state vs full drill cycle¶

Routine validation on a drill host (deploy only): preflight → drill-deploy → drill-e2e smoke and/or drill-stack-playwright — same stack discipline as prod; see STACK_CONTRACT.md.

Full orchestrated exercise adds infra apply, simulated corpus restore (drill-restore-corpus), and always destroy — see Orchestrated full cycle below. Restore is not implied for steady prod or Codespace bring-up.

When to use which doc¶

Need	Doc
Prod `deploy@`, prod Tailscale, prod backup/restore	PROD_RUNBOOK.md
Drill `deploy@`, drill workflows, full-cycle orchestrator, drill destroy	This runbook + `infra/README.md` (repo root)
OpenTofu state encryption, `terraform.tfstate.enc.drill`	`infra/README.md` (repo root)
Whole-system narrative (diagrams, prod + drill)	Hosting and infrastructure
Corpus snapshot manifest, local Make vs Actions (prod / pre-prod / drill)	CORPUS_SNAPSHOT_MANIFEST_AND_RESTORE.md

Keeping drill material here avoids duplicating long prod sections and keeps PROD_RUNBOOK shorter for day-to-day prod operators.

Repository secrets and variables (drill)¶

Name	Kind	Used by
`HCLOUD_TOKEN_DRILL`	secret	`drill-infra-plan`, `drill-infra-apply`, `drill-infra-destroy`, orchestrator
`TS_API_KEY`	secret	Same infra workflows (Tailscale provider)
`TFSTATE_AGE_KEY`	secret	sops decrypt/encrypt for drill state
`OPERATOR_SSH_PUBLIC_KEY`	secret	Same as prod infra (public half only; log masking). First-boot `deploy@` `authorized_keys` is written only from this value.
`TS_AUTHKEY`	secret	`drill-deploy`, `drill-e2e`, `drill-restore-corpus`, orchestrator app jobs
`DRILL_DEPLOY_SSH_PRIVATE_KEY`	secret	Same app jobs (PEM for `deploy@` on drill)
`BACKUP_REPO_TOKEN`	secret (optional)	`drill-restore-corpus` when `chipi/podcast_scraper-backup` is private
`TAILNET_NAME`	variable	Infra `TF_VAR_tailscale_tailnet`
`DRILL_TAILNET_FQDN`	variable	Resolver input (e.g. `dr-podcast.tail-xxxx.ts.net`)

ACL note: tag:gha-deployer → tag:dr-drill:22 must exist for CI SSH; prod infra-apply is the usual path to land tailscale/policy.hujson changes.

Public SSH on the drill VPS (break-glass)¶

Drill terraform.drill.ci.tfvars sets hcloud_inbound_ssh_troubleshoot_cidrs so the Hetzner firewall allows inbound TCP/22 from the listed CIDRs (CI uses IPv4+IPv6 any). Use the server’s public IPv4 and the same operator key as OPERATOR_SSH_PUBLIC_KEY:

ssh -i ~/.ssh/<your-operator-key> deploy@<PUBLIC_IPV4>

First-boot prod.user-data also installs that public key for root@<PUBLIC_IPV4> if you prefer root while debugging.

Where to read <PUBLIC_IPV4>: after a successful drill-infra-apply or drill-infra-plan run (when drill state already contains a server), open the workflow run’s Summary tab — the job posts Break-glass SSH with the IPv4 and a copy-paste ssh deploy@… line. You can still use Hetzner console, hcloud server list, or tofu output -raw ipv4_address in workspace drill if you prefer.

Use this when deploy@<DRILL_TAILNET_FQDN> is not reachable because Tailscale has not come up. Prod leaves hcloud_inbound_ssh_troubleshoot_cidrs empty (SSH only over the tailnet). For a long-lived drill VM, override the CIDR list in a gitignored tfvars to your /32 instead of 0.0.0.0/0.

Typed confirms (copy exactly)¶

String	Workflow	Meaning
`DRILL_FULL_CYCLE` or `DRILL_EXERCISE`	`drill-exercise.yml`	Gates full orchestrator (infra + app + always destroy)
`APPLY`	`drill-infra-apply.yml` (manual only)	OpenTofu apply workspace `drill`
`DRILL_DESTROY`	`drill-infra-destroy.yml` (manual only)	`tofu destroy` + Hetzner `hcloud` sweep + Tailscale device delete (`tag:dr-drill` or drill `tailnet_hostname`); standalone uses git state — orchestrator uses apply artifact (same run)
`DRILL_RESTORE`	`drill-restore-corpus.yml` (manual only)	Overwrite drill `corpus/` from backup `snapshot.tgz`
`DRILL_SMOKE`	`drill-e2e.yml` (manual or orchestrator)	Read-only `/api/health` via tailnet SSH (host viewer `:8080` adapter)
`DRILL_STACK_PLAYWRIGHT`	`drill-stack-playwright.yml` (manual; orchestrator uses `skip_confirm`)	Run `tests/stack-test/stack-viewer.spec.ts` against drill HTTPS

The orchestrator calls drill-infra-apply and drill-infra-destroy with internal skip_confirm: true after the single DRILL_FULL_CYCLE / DRILL_EXERCISE gate; standalone runs still require APPLY / DRILL_DESTROY as above.

Orchestrated full cycle (`drill-exercise.yml`)¶

Ordered jobs:

drill-infra-plan — tofu fmt -check, tofu validate, tofu plan (no apply).
drill-infra-apply — tofu apply for workspace drill; uploads terraform-state-after-apply-drill.
drill-tfstate-bridge — caller job: re-downloads that artifact and uploads drill-tfstate-for-teardown so drill-infra-destroy (a reusable workflow) can read state without a git commit.
drill-deploy — infra/deploy/deploy.sh (in-container api :8000 /api/health); workflow may also curl host viewer :8080 as an adapter smoke (see STACK_CONTRACT).
drill-restore-corpus — resolve newest compatible snapshot-prod-* via snapshot.manifest.json (or pin backup_tag); download_and_verify_snapshot.sh on the runner, then restore_corpus_from_tarball_host.sh on the drill host (manifest hub).
drill-e2e — tailnet SSH adapter smoke: host viewer :8080 /api/health (read-only).
drill-stack-playwright — tests/stack-test/stack-viewer.spec.ts over HTTPS against the live drill host (browser + API + corpus).
finalize — runs if: always() so the next step still runs when a middle job failed.
drill-infra-destroy — downloads drill-tfstate-for-teardown (or apply artifact as fallback), then tofu destroy, Hetzner API sweep, Tailscale API removal of drill devices (always after finalize).

Each job that uses GitHub Environment drill may require a separate approval if your org configured reviewers on that environment.

Dispatch example:

gh workflow run drill-exercise.yml -R chipi/podcast_scraper \
  -f confirm=DRILL_FULL_CYCLE

Optional inputs: backup_tag, backup_repo, override_image_sha (see workflow file).

App-only path (no infra create/destroy)¶

Use individual workflows when you already have a drill VM and only want the app pipeline:

drill-deploy.yml
drill-restore-corpus.yml (confirm DRILL_RESTORE)
drill-e2e.yml (confirm DRILL_SMOKE)
drill-stack-playwright.yml (confirm DRILL_STACK_PLAYWRIGHT — runs stack-viewer.spec.ts against drill HTTPS)

Do not use drill-exercise.yml for that case: it always ends in drill-infra-destroy.

Drill host `.env` checklist (after first boot or restore)¶

On /srv/podcast-scraper/.env (modes 600), typical values include:

PODCAST_ENV=dr-drill (or another non-prod label for Sentry/Grafana filters).
PODCAST_CORPUS_HOST_PATH=/srv/podcast-scraper/corpus (bind mount source on host).
PODCAST_DOCKER_PROJECT_DIR=/srv/podcast-scraper (compose bind mount for api).
PODCAST_DEFAULT_CORPUS_PATH=/app/output — pre-fills the viewer corpus path (container view of the same bind mount as the api). See compose PODCAST_DEFAULT_CORPUS_PATH and nginx templates under docker/viewer/.

Grafana Cloud: if GRAFANA_CLOUD_* URLs are placeholders (for example invalid.invalid), the Grafana Agent will not ship metrics; that is expected for isolated drills unless you point drill at real Grafana Cloud endpoints.

Manual infra destroy (without orchestrator)¶

gh workflow run drill-infra-destroy.yml -R chipi/podcast_scraper -f confirm=DRILL_DESTROY

The job runs tofu destroy, Hetzner hcloud sweep, and Tailscale DELETE /device for nodes with tag:dr-drill or hostname equal to tailnet_hostname in terraform.drill.ci.tfvars.

When you run drill-exercise, drill-infra-destroy downloads drill-tfstate-for-teardown from the drill-tfstate-bridge job (same workflow run), so destroy matches what apply created without committing terraform.tfstate.enc.drill to git. A standalone destroy (only this workflow) still uses the encrypted file from the git checkout; commit that file after standalone apply if you need destroy/plan to track that state.

Download the uploaded terraform-state-after-destroy-drill artifact when you need to commit an updated infra/terraform/terraform.tfstate.enc.drill (same pattern as drill-infra-apply).

Prod corpus restore (separate workflow)¶

prod-restore-corpus.yml is documented in PROD_RUNBOOK.md (corpus backups and cross-references). It uses PROD_SSH_PRIVATE_KEY, Environment prod, and confirm PROD_RESTORE. It is not part of the drill orchestrator.