RFC-084: Corpus Snapshot Backup Manifest and Version-Aware Restore¶
- Status: Completed
- Authors: Podcast Scraper Team
- Created: 2026-05-12
- Domain: Infrastructure / backups / disaster recovery
- Tracking: GitHub #763
- Related RFCs:
- RFC-081 — Phase 1E corpus backup context
- RFC-082 — always-on host backup story
- RFC-063 —
corpus_manifest.jsonat corpus parent (distinct from snapshot manifest) - RFC-004 — on-disk layout evolution
- Related ADRs:
- ADR-074 — operational corpus parent artifacts (not tarball metadata)
- ADR-092 — accepted policy and contract for this RFC
- Related Documents:
.github/workflows/backup-corpus-prod.yml,backup-corpus.yml.github/workflows/drill-restore-corpus.yml,prod-restore-corpus.ymldocs/guides/DR_DRILL_RUNBOOK.md,docs/guides/PROD_RUNBOOK.mddocs/guides/CORPUS_SNAPSHOT_MANIFEST_AND_RESTORE.md— operator hub: localmakevs GitHub Actions (prod, pre-prod, DR)
Abstract¶
Corpus releases are tagged by calendar date (snapshot-YYYYMMDD, snapshot-prod-YYYYMMDD).
Restore automation often defaults to the latest matching tag. When on-disk layout or JSON
schemas change with a software release, operators cannot see whether a given deployed image
can safely read a given tarball without implicit knowledge.
This RFC defines a small machine-readable manifest shipped with each backup and a default
restore selection policy: choose the newest backup whose corpus_format_version the deployed
reader supports, otherwise fail closed and require an explicit operator-selected tag.
Problem Statement¶
Today
- Compatibility of backup versus reader is implicit (dates, release notes, tribal knowledge).
- After rollback or during schema migration, “latest” may point at a corpus the rolled-back code cannot parse, or vice versa.
- Under incident stress, wrong tarball choice wastes time or extends outage.
Distinction
corpus_manifest.json(ADR-074): operational index inside a live corpus parent (multi-feed, discovery).snapshot.manifest.json(this RFC): release artifact metadata describing the packed snapshot and the producer that wrote it. They are complementary; do not overload one file for both roles.
Goals¶
- Explicit format identity: every published snapshot carries a bumpon-breaking-change
corpus_format_versionand producer identity (git SHA and/or image digest). - Safe default restore: workflows select newest compatible backup without operator input when possible.
- Fail closed: if no candidate matches, restore does not silently pick “latest”; logs
explain the mismatch and point to explicit
backup_tag/ pin inputs. - Audit: optional fields tie an artifact to workflow run or script version.
Non-goals for v1:
- Exact producer match as default (too strict for routine DR; optional strict mode later).
- Encryption / compression metadata beyond what the tarball already implies (optional later).
Constraints and Assumptions¶
Constraints
- Manifest must be small, JSON, and easy to generate in GitHub Actions without new runtime services.
- Must not require reading the multi-gigabyte tarball to learn format version when a separate manifest asset is available (see placement).
- Drill and prod restore workflows already accept explicit tag overrides; default path must remain overridable.
Assumptions
- Reader support for format versions can be expressed as a range or table in code (e.g. min/max integer) owned by the application repo.
- Breaking layout or schema changes are rare and review-gated; bumping
corpus_format_versionis part of that change.
Design¶
1. Artifact: snapshot.manifest.json¶
MIME: application/json
Filename: snapshot.manifest.json (alongside snapshot.tgz or as sibling release asset).
Required fields (v1)
| Field | Type | Description |
|---|---|---|
schema_version |
integer | Version of this manifest schema (start at 1). |
corpus_format_version |
integer | Bumped only on breaking on-disk or schema changes older readers cannot handle. |
created_at |
string | RFC 3339 UTC timestamp when the manifest was written. |
producer |
object | Who produced the tarball (see below). |
archive |
object | What is inside the snapshot (see below). |
producer object (at least one of git or image identity)
| Field | Type | Description |
|---|---|---|
git_sha |
string | Full 40-hex commit SHA of the repo that wrote the corpus (preferred when backup runs from CI checkout). |
image_digest |
string | sha256:… of the pipeline image if backup runs from a published image (optional if git_sha suffices). |
archive object
| Field | Type | Description |
|---|---|---|
relative_path |
string | Path of the tarball relative to manifest in the release layout (typically snapshot.tgz). |
sha256 |
string | Hex digest of the tarball (optional but recommended for integrity checks). |
Optional fields
| Field | Type | Description |
|---|---|---|
backup_workflow |
object | e.g. { "name": "backup-corpus-prod", "run_id": "…", "attempt": 1 } |
notes |
string | Short human-readable line (not a substitute for release notes). |
Example
{
"schema_version": 1,
"corpus_format_version": 1,
"created_at": "2026-05-12T12:34:56Z",
"producer": {
"git_sha": "a1b2c3d4e5f6789012345678901234567890abcd"
},
"archive": {
"relative_path": "snapshot.tgz",
"sha256": "…"
},
"backup_workflow": {
"name": "backup-corpus-prod",
"run_id": "12345678901",
"attempt": 1
}
}
Semantics
corpus_format_version: monotonic integer for compatibility, not feature richness. Patch releases that only fix bugs without changing on-disk contracts do not bump it. A release that changes layout so an old reader errors or mis-reads must bump it.- SemVer strings were considered; integer keeps comparisons trivial in shell and Actions.
2. Placement¶
Decision (see ADR-092)
- Inside the tarball: always place
snapshot.manifest.jsonat a well-known path (e.g. tarball root) so an extracted tree is self-describing. - Release / storage: also upload
snapshot.manifest.jsonas a separate asset next tosnapshot.tgzso consumers can read compatibility without downloading the full archive.
3. Reader compatibility¶
The application (or restore script) declares which corpus_format_version values it supports,
for example:
- Constants:
CORPUS_FORMAT_MIN = 1,CORPUS_FORMAT_MAX = 2, or - A small table keyed by major release.
Rules
- A backup is compatible if
corpus_format_versionis within the reader’s supported set (inclusive range unless documented otherwise). - Newer backup with higher format than the reader supports: incompatible.
- Older backup within range: compatible.
Migration releases that read old and write new bump corpus_format_version only when old
readers cannot open new dumps (per ADR-092).
4. Restore selection algorithm (default)¶
Given deploy reader compatibility [min,max] and a list of backup candidates newest first
(e.g. tags or release assets sorted by date or semver tag):
- For each candidate, fetch
snapshot.manifest.json(prefer sibling asset; fallback: error if only tarball exists and operator did not opt in to download). - Select the first candidate whose
corpus_format_versionis compatible. - If none match: fail the workflow step with a clear message and require explicit
backup_tag/ URL override (existing workflow inputs).
Logging
- On success: print chosen tag,
corpus_format_version, andproducer.git_sha/image_digest. - On skip: log incompatible candidate and reason (version too new / too old).
5. Shared implementation: scripts, Make, and GitHub Actions¶
Goal: define behavior once and reuse it from local make, drill/prod hosts, and
GitHub Actions, so restore checks do not drift from CI-only YAML.
| Layer | Responsibility |
|---|---|
| Repo scripts | Canonical helpers under scripts/ops/corpus_snapshot/: emit_manifest.sh, finalize_backup_bundle.sh, validate_snapshot_manifest.sh, select_release_tag.sh, download_and_verify_snapshot.sh, restore_corpus_release.sh; prod workflows call resolve_latest_snapshot_prod_tag.sh. |
| Makefile | corpus-snapshot-manifest-validate, corpus-snapshot-select-tag, corpus-snapshot-select-tag-prod, corpus-snapshot-selftest, restore-corpus (codespace layout), restore-corpus-prod (VPS layout). Unset PODCAST_BACKUP_TAG → newest-compatible selection; set pin → skip scan but still validate when a sibling manifest exists. codespace-backup-cloud / backup-corpus.yml own cloud emit. |
| Workflows | backup-corpus-prod.yml, backup-corpus.yml: after tarball build, finalize_backup_bundle.sh, upload snapshot.tgz + sibling snapshot.manifest.json when dry_run is false. drill-restore-corpus.yml, prod-restore-corpus.yml: resolve_latest_snapshot_prod_tag.sh (or pin), runner download_and_verify_snapshot.sh, host restore via restore_corpus_from_tarball_host.sh. |
| CI | Optional: make-driven check job (or workflow step) that runs manifest validation on fixtures and/or tests the selection helper with golden JSON; optional path guard when “breaking” corpus layout paths change without a documented corpus_format_version bump (see §9). |
Mapping to GitHub #763: backup writes manifest + dual placement; restore reads manifests and enforces newest-compatible / fail closed; scripts + Make + workflows together cover the deliverable — not Actions-only logic with no local equivalent.
Operator map (all surfaces): CORPUS_SNAPSHOT_MANIFEST_AND_RESTORE.md — Make for local validation/restore loop; Actions for prod, pre-prod/codespace backup, and drill restore.
6. Backup workflows¶
Scope: .github/workflows/backup-corpus-prod.yml, backup-corpus.yml, and any drill-specific
backup if present.
Steps (conceptual)
- After tarball is built and before upload: compute
sha256ofsnapshot.tgz. - Emit
snapshot.manifest.jsonwith fields above (git_shafrom${{ github.sha }}or image digest from metadata) via the shared script ormaketarget (§5). - Ensure manifest is inside the tarball at the chosen path, then upload tarball and sibling manifest; keep naming stable for downstream scripts.
Sequencing (from issue #763): implement after dr-drill-exercise E2E is stable if concurrent risk is high; schema and RFC can land earlier.
7. Restore workflows and scripts¶
Scope: .github/workflows/drill-restore-corpus.yml, prod-restore-corpus.yml, shared restore
scripts called by them, and restore-corpus / restore-corpus-prod in the Makefile.
- Resolve reader supported range (from env baked into image, or from a small repo file read by the script).
- List candidates; fetch manifests; run newest compatible selection (shared implementation §5).
- Preserve operator override: explicit tag/env wins over default.
- Log chosen tag,
corpus_format_version, and producer identity; on miss, fail with instruction to pinbackup_tag/ URL.
8. Documentation¶
- DR and prod runbooks: when newest-compatible default is wrong (rollback, format bump, mixed-age hosts) — see CORPUS_SNAPSHOT_MANIFEST_AND_RESTORE.md.
- Cross-link this RFC and ADR-092.
- Document
restore-corpus,restore-corpus-prod, and env vars (PODCAST_BACKUP_TAG,PODCAST_BACKUP_REPO) vs default selection behavior after implementation.
9. Testing and CI (optional follow-up)¶
- Unit or script tests for selection given fixture manifests.
- Optional lint: if a PR changes documented “breaking” corpus layout paths or schemas, require
a bump to
corpus_format_version/ changelog entry (lightweight guardrail).
Alternatives Considered¶
- Calendar tags only, no manifest — Rejected; fails under rollback and schema migration (GitHub #763).
- SemVer for
corpus_format_version— Rejected for v1; integer comparisons are simpler in Actions and shell. - Only manifest inside tarball — Rejected as sole option; forces full download to inspect compatibility. Dual placement keeps small-metadata fetches cheap.
- Reuse
corpus_manifest.jsonas tarball metadata — Rejected; different lifecycle and purpose (ADR-074).
Rollout¶
- Schema + example in repo; ADR ratified (ADR-092).
- Shared scripts +
maketargets for emit, validate, and select-newest-compatible (§5). - Backup workflows call shared emit path; upload tarball + sibling manifest.
- Restore workflows and
restore-corpus/restore-corpus-prodcall shared selection and download paths; explicit pins unchanged. - Docs and runbooks updated; optional CI check (§9).
Open Questions¶
Resolved in implementation:
snapshot.manifest.jsonat tarball archive root (alongsidecorpus/or.codespace_corpus/).- Drill restore uses
snapshot-prod-*tags viascripts/ops/resolve_latest_snapshot_prod_tag.sh(same as prod). - Reader range in
config/corpus_snapshot_reader_support.json; producer format inconfig/corpus_snapshot_format.json.