ADR-092: Corpus Snapshot Backup Manifest and Newest-Compatible Restore Default¶
- Status: Accepted
- Date: 2026-05-12
- Authors: Podcast Scraper Team
- Related RFCs: RFC-084
- Related PRDs: —
- Tracking: GitHub #763
Context¶
Corpus snapshot backups are tagged and restored in GitHub Actions. Compatibility between packed corpus bytes and deployed reader code is today inferred from dates and human release notes. That fails for rollback, schema migration, and DR when “latest” is not safe.
corpus_manifest.json at the corpus parent (ADR-074)
serves multi-feed discovery inside a live tree. It is not a substitute for release
artifact metadata on snapshot.tgz.
Decision¶
- Every published corpus snapshot ships
snapshot.manifest.jsonwith at minimum:schema_version,corpus_format_version(integer, bump only on breaking reader-facing on-disk or schema changes),created_at,produceridentity (git_shaand/orimage_digest), andarchive.relative_path(+ recommendedarchive.sha256). - Dual placement: the manifest exists inside the tarball at a documented path and as a
separate sibling artifact next to
snapshot.tgzso consumers can read compatibility without downloading the full archive. - Default restore selection (when the operator does not pin a tag): choose the newest
candidate backup for which
corpus_format_versionis supported by the deployed reader; if none match, fail closed and require an explicit backup pin / override. corpus_format_versionbump rule: bump only when an older reader would error, mis-read, or corrupt data against the new dump; not for non-breaking bug fixes. “Migration releases” that read an old format and write a new one still follow this rule: bump when old readers cannot open new dumps.
Rationale¶
- Integer format version keeps comparisons trivial in Actions and shell; semver string adds parsing surface without clarifying semantics for corpus bytes.
- Sibling manifest asset avoids large downloads for compatibility checks.
- Newest compatible default matches operator intent under normal operations; fail closed prevents silent wrong-version restores during incidents.
Alternatives Considered¶
- Tags only / no manifest — Rejected; does not encode reader compatibility (#763).
- Manifest only inside tarball — Rejected as the only copy; forces full fetch to inspect metadata.
- Default to exact producer SHA match — Rejected for routine DR (too strict); may be added later as an optional strict mode.
Consequences¶
- Positive: Machine-readable compatibility, safer defaults, clearer audit trail (
producer, optionalbackup_workflow). - Negative: Backup and restore workflows must stay in sync with schema v1; future
schema_versionbumps need a small compatibility shim in consumers. - Neutral: Distinct from
corpus_manifest.json; operators must learn two manifests’ roles.
Implementation Notes¶
- Normative detail: RFC-084.
- Single implementation surface: repo scripts (emit / validate / select / download) plus thin
maketargets; GitHub Actions call those entrypoints so local restore and CI restores stay aligned (RFC §5). - Workflows (current tree):
backup-corpus-prod.yml,backup-corpus.yml,drill-restore-corpus.yml,prod-restore-corpus.yml(paths may change; RFC remains source for behavior). - Code: landed in
scripts/ops/corpus_snapshot/, thinmaketargets, and the workflows above; index Code column reads Yes (#763). - Cross-surface steady vs recovery playbooks: ADR-093, STACK_CONTRACT.md.