ADR-089: Prod Failover Orchestrator Is Separate from DR Drill¶
- Status: Accepted
- Date: 2026-05-12
- Authors: Podcast Scraper Team
- Related RFCs: RFC-083, RFC-082
- Related ADRs: ADR-081
Context¶
drill-exercise proves infra and app paths on a throwaway workspace and always runs drill-infra-destroy. Production failover must keep the spare running for an operator-chosen interval, use different confirms and run summaries, and must never be confused with the drill exercise parent workflow.
The spare is not always on. When needed, it is stood up on demand in the same Hetzner project and OpenTofu drill workspace already used for orchestrated DR drill, reusing existing drill Actions secrets — no new cloud project or token family.
Decision¶
- Prod failover is implemented as a separate GitHub Actions workflow family (parent + composed reusables), not a flag on
drill-exercise. - Spare infra uses the existing drill footprint: OpenTofu workspace
drill, Hetzner project scoped toHCLOUD_TOKEN_DRILL, drill deploy/restore/e2e/Playwright reusables, and drill MagicDNS for pre-cutover validation. No second server in prod OpenTofu state and no new Hetzner project or secrets for spare bring-up. - The failover parent may
workflow_calldrill-infra-plan,drill-infra-apply,drill-deploy,drill-restore-corpus,drill-e2e, anddrill-stack-playwrightas phased tools. It must notworkflow_calldrill-exerciseordrill-infra-destroy. Decommission runs only when the operator manually dispatchesdrill-infra-destroy.yml(typedDRILL_DESTROY) after failback or when the spare is no longer needed. - Failover phases that touch the stack (deploy, restore, validate) must follow the stack contract and shared
scripts/ops/paths documented in STACK_CONTRACT.md and ADR-093; only adapters (confirms, run summaries, manual cutover timing) differ from routine drill.
Consequences¶
- Positive: Reuses proven drill infra and secrets; clear separation from
drill-exerciseauto-destroy; operator controls spare lifetime. - Negative: Failover and drill share one throwaway workspace — only one full spare stack at a time; operators must not run
drill-exerciseand incident stand-up concurrently on the same state.