RFC-083: Production Failover — Orchestration, Spare Stack, and Traffic Cutover¶
- Status: Draft
- Authors: Podcast Scraper Team
- Stakeholders: Operator (infra), on-call (future)
- Related PRDs: —
- Related ADRs (decisions extracted from this RFC; ratified as Accepted):
- ADR-089 — orchestrator boundary vs DR drill
- ADR-090 — DNS-first cutover on tailnet
- ADR-091 — GHA triggers and safety gates
- Related RFCs:
- RFC-082 — always-on hosting, drill workspace
- RFC-084 — snapshot manifest and newest-compatible restore (Completed on this branch; #763)
- Related documents:
- GitHub #764 — problem statement, DNS path, runbook intent
- GitHub #762 — stack contract vs adapters (ADR-093; operator hub under
docs/guides/STACK_CONTRACT.md) - ADR-081, ADR-082, ADR-083, ADR-092
- CORPUS_SNAPSHOT_MANIFEST_AND_RESTORE.md — shared restore entry points
- PROD_RUNBOOK.md — tailnet health,
tailscale serve(dedicated failover chapter still open)
Abstract¶
When production is unhealthy or untrusted, we need a repeatable path to stand up a spare environment (same stack contract as prod), restore corpus, validate (including browser/API gates), move traffic to the spare, then fail back and decommission—without tearing down the spare at the end of the run (unlike drill-exercise).
This RFC specifies GitHub Actions orchestration (manual and, later, automated triggers), phased gates, DNS-first cutover on the tailnet, OpenTofu prerequisites for a distinct spare host, and explicit non-goals for v1. ADRs 089–091 lock the architectural decisions; this RFC holds the full technical design and implementation checklist.
Problem Statement¶
DR drill (drill-exercise.yml) optimizes for proof and cost: it ends in drill-infra-destroy. That is correct for drills and wrong for a live incident, where the spare must stay up until failback.
Today we lack:
- A documented and automatable sequence from “declare incident” to “traffic on spare” aligned with #764.
- A single CI entry point that operators trust under stress (instead of ad hoc SSH + DNS).
- A clear split between stand up + validate (safe to drive from automation) and cutover (high blast radius).
Impact: slower recovery, higher error rate, and risk of confusing drill workflows with prod failover.
Goals¶
- Orchestrated GHA parent workflow (new family, names in implementation PR) with phases: provision spare infra (when Terraform supports it) → deploy app to spare → restore corpus → validate (HTTPS +
tests/stack-testsubset pattern as on drill) → gated cutover → (later) failback job family. - Manual trigger via
workflow_dispatchwith typed confirms per dangerous phase. - Optional automated trigger via
repository_dispatch(monitoring → GitHub) with secret-verified payload; v1 does not auto-run DNS cutover without human approval path (see ADR-091). - DNS-first cutover for the canonical tailnet hostname (
https://<tailnet_hostname>.<tailscale_tailnet>/perinfra/terraform/outputs.tf), with TTL/TLS prerequisites documented in runbooks (ADR-090). Issue #764 frames this as a stable hostname flip; in this repo that is MagicDNS on the tailnet, not a public internet A/AAAA record (ADR-083). - Spare footprint reuses the existing drill Hetzner project, OpenTofu
drillworkspace, and drill Actions secrets (HCLOUD_TOKEN_DRILL, deploy/restore keys) — no new cloud project or token family (ADR-089). - Same stack contract on spare as prod (ADR-093): compose
overlays, in-container
/api/healthon port 8000, andtests/stack-testsemantics for behavioral gates. - Corpus restore on spare uses the same
scripts/ops/corpus_snapshot/and newest-compatible tag selection as prod/drill (ADR-092); explicitbackup_tagoverride remains available under incident stress. - Operator-owned timing for cutover, failback, and spare teardown: automation prepares and validates; it does not schedule decommission or unattended traffic moves (ADR-091).
Non-goals (v1)¶
- Scheduled warm spare that runs full stack on a cron without an incident (separate future RFC/issue).
- Fully unattended DNS cutover on first external alert.
- Automated spare teardown at the end of a failover run (decommission stays a manual operator action).
- Multi-region active-active.
Constraints & Assumptions¶
- Tailnet-first prod per existing ops docs; public ingress is out of scope for cutover mechanics.
- Incident concurrency (split-brain): treat scheduled feed ingestion as the main dual-writer risk. Routine ingestion is not continuous — it fires only from
scheduled_jobs:in the corpusviewer_operator.yaml(in-process APScheduler; see SERVER_GUIDE.md). Pragmatic policy: before cutover, freeze the prod instance (disable schedules / stop in-process scheduler); on the spare, do not enable schedules on first bring-up even when the restored backup hadenabled: trueentries. ManualPOST /api/jobsremains possible on either side — operators avoid overlapping manual runs during overlap. - OpenTofu / Hetzner: spare is the existing
drillworkspace footprint in the same Hetzner project and token scope asdrill-exercise(ADR-081, ADR-089). On demand,drill-infra-applystands up the throwaway VPS; there is no always-on spare and no second server in prod OpenTofu state. ACL rows fortag:dr-drill/ drill MagicDNS remain owned by the prod workspace per ADR-081. - Secrets: reuse drill Actions secrets for spare bring-up, deploy, and restore; cutover and teardown remain manual operator steps (runbook + optional typed confirms), not new secret families.
Branch status (2026-05-13)¶
Landed (this branch, not prod-failover-specific):
- Corpus snapshot manifest, validation, and newest-compatible restore (RFC-084 / ADR-092); prod/drill restore workflows and
scripts/ops/corpus_snapshot/+resolve_latest_snapshot_prod_tag.sh. - Cross-surface stack contract hub and ADR-093; shared VPS restore host script path (#762).
- Drill reference orchestrator
drill-exercise.yml(plan → apply → deploy → restore → e2e → Playwright → destroy) anddrill-stack-playwright.ymlHTTPS gate pattern to mirror on spare validation.
Not landed (tracked by #764 / this RFC):
prod-failover-*workflow family composing existing drill reusables withoutdrill-exerciseor automated destroy.- Dedicated prod failover chapter in PROD_RUNBOOK.md and WORKFLOWS.md index rows.
- Optional
repository_dispatchhelpers for stand-up / validate only (ADR-091).
Design & Implementation¶
1. Orchestrator shape (parent workflow)¶
- New top-level workflow(s), e.g.
prod-failover-stand-up.yml(exact name in implementation),concurrency: prod-failover,cancel-in-progress: false. - Jobs compose existing drill reusables where safe:
drill-infra-plan,drill-infra-apply,drill-deploy,drill-restore-corpus,drill-e2e,drill-stack-playwright— same Hetzner project, workspace, and secrets asdrill-exercise, with failover-specific typed confirms and run summaries. - Never
usesdrill-exerciseas the parent graph and neverworkflow_calldrill-infra-destroyfrom this tree (ADR-089). Spare may stay up for minutes or days until the operator runs destroy manually. - Cutover (canonical MagicDNS / hostname flip) and failback are runbook + manual steps in v1; the orchestrator may stop after phase D with spare FQDN and checklist output (ADR-091).
2. Phases and typed confirms (illustrative)¶
| Phase | Purpose | Example confirm | Notes |
|---|---|---|---|
| A | Provision spare (drill-infra-apply, workspace drill) |
PROD_FAILOVER_PROVISION |
Skipped if drill VPS already exists from a prior stand-up |
| B | Deploy image + compose on spare (drill-deploy) |
inherit or PROD_FAILOVER_DEPLOY |
Reuse deploy.sh on drill host |
| C | Restore corpus from backup repo (drill-restore-corpus) |
PROD_FAILOVER_RESTORE or input backup_tag |
Default tag via resolve_latest_snapshot_prod_tag.sh (ADR-092); shared restore_corpus_from_tarball_host.sh |
| D | Validate: in-container api /api/health on 8000, refresh tailscale serve, HTTPS probe on drill MagicDNS, then Playwright stack-viewer.spec.ts (drill-e2e + drill-stack-playwright) |
automatic after C or sub-confirm | Schedules off on spare; prod frozen per §6 before cutover |
| E | Cutover: canonical hostname / MagicDNS | Manual runbook only in v1 | ADR-090; operator decides when to flip |
| F | Failback + decommission spare | Manual only | Repair prod, revert traffic if needed, then drill-infra-destroy.yml with DRILL_DESTROY when spare is no longer needed — not chained from the failover parent |
3. Cutover (DNS path)¶
Full operator narrative (TTL where applicable, TLS on spare before flip, propagation checks, rollback) lives in #764 and PROD_RUNBOOK.md; normative decision is ADR-090.
v1 operator path (matches #764 non-goals): manual checklist and copy-paste checks for the hostname flip; automation stops after phase D with spare target, rollback notes, and cutover checklist — no unattended flip. Optional later job: DNS API token + script behind environment protection — not required for first runbook delivery.
Spare validation name: exercise the spare on the drill MagicDNS hostname (DRILL_TAILNET_FQDN / drill tailnet_hostname) before
repointing the canonical prod name; do not learn TLS or serve misconfig only at cutover.
4. Triggers¶
Normative list: ADR-091.
5. Observability & audit¶
- Log backup tag, image SHA, resolved spare FQDN, and cutover timestamp in workflow summaries.
- Open a single GitHub issue per incident or attach to existing incident thread.
6. Ingestion freeze and spare scheduler policy¶
Why this is enough for v1: without enabled scheduled_jobs, neither host runs unattended feed sweeps. Corpus backup and restore are operator-triggered workflows, not background timers on the API. The remaining overlap window is manual job starts and misfire catch-up (APScheduler may fire within 1 h after a host wakes if schedules were still enabled — another reason to freeze prod before overlap).
Prod (before cutover):
- Freeze — disable every
scheduled_jobsentry (setenabled: falseinviewer_operator.yamlvia Configuration save /PUT /api/operator-config, or equivalent host edit) so the in-process scheduler stops firing. Confirm withGET /api/scheduled-jobs(scheduler_running: falseor no enabled jobs with futurenext_run_at). - Optional — wait for in-flight
POST /api/jobsruns to finish; cancel stale jobs if the operator model supports it. - Avoid starting a new prod backup tarball while the spare is being validated for cutover unless the runbook explicitly calls for a final snapshot.
Spare (after restore, before cutover):
- Do not turn on
scheduled_jobson first bring-up — even when the restored tarball carriedenabled: truerows. Prefer a failover adapter step (workflow or runbook) that forces all schedule entriesenabled: falsebeforecompose up/ API restart, or keepsPODCAST_SERVE_ENABLE_JOBS_APIoff until post-cutover review. - Validate read-only / smoke paths on drill MagicDNS with schedules off; enable schedules on the spare only after failback or an explicit operator decision to make the spare primary for ingestion.
After failback: re-enable schedules on one host only; keep the other frozen until decommission.
Testing & Validation¶
- Dry-run mode (optional input): run through SSH reachability +
tofu planonly where supported. - Phase D must fail if HTTPS :443 is not reachable from runner (same class of bug as drill stack Playwright before serve refresh).
Implementation checklist (engineering)¶
- [ ] Parent
prod-failover-*.ymlcomposing drill reusables through validate; never calldrill-exerciseordrill-infra-destroy(ADR-089). - [ ]
docs/guides/PROD_RUNBOOK.mdfailover chapter (phases, drill vs incident table, ingestion freeze, manual cutover/decommission, DNS/TLS checklist) + WORKFLOWS.md + #764 cross-links to RFC-083 and ADR-089–091. - [ ] Failover stand-up adapter: after restore, force
scheduled_jobsoff on spare before API serves traffic (workflow step or documented host edit). - [ ] ADR-093 / stack contract audit row clarifying incident spare reuses the drill VPS row when stood up.
- [ ] Optional
repository_dispatchfor stand-up / validate only (ADR-091).
Corpus restore contract and shared ops scripts (RFC-084 / ADR-092) and stack contract hub (ADR-093) are landed on this branch; see Branch status above. Drill OpenTofu workspace, Hetzner project, and Actions secrets are reused for on-demand spare (ADR-089); no new cloud project or token family.