RFC-081: Pre-prod Environment, Minimal Observability, and Mobile Control Plane¶
- Status: Draft
- Authors: Marko + Claude
- Created: 2026-04-27
- Domain: Infrastructure / DevOps / Observability
- Related RFCs:
- RFC-079 — full-stack compose topology (the artefact this RFC deploys)
- RFC-078 — CI ephemeral smoke (the test side of the same compose stack)
- RFC-077 — viewer feeds + jobs API (the runtime surface)
- Tracking:
- Phase 1 umbrella (Codespaces deploy + devcontainer + corpus backup + secrets): TBD
- Phase 1 core prep (slim profile preload +
.envleak assessment + log hygiene gate / audit): TBD - Grafana Cloud observability wiring: TBD
- Sentry integration (FastAPI + Vue): TBD
- Critical + integration + e2e test suite review (parallel pre-work): TBD
- Implementation: not started; this RFC is the design.
- Scope boundary: this RFC covers Phase 1 (Codespaces) only. Always-on hosting, VPS provisioning, Tailscale, and the long-running corpus story are deferred to a follow-up RFC ("RFC-08X: Always-on pre-prod") to be drafted once Phase 1 ships.
Abstract¶
Today the project has a perfectly good local stack (make stack-test-ml-ci)
and a CI gate (Stack-test workflow) but no always-on environment. There
is nowhere to point a phone, paste a feed URL, and watch a real pipeline
run from outside the laptop.
This RFC proposes a Phase 1 pre-prod environment that:
- Reuses the same Docker Compose topology from RFC-079 — no new image surface, no new code paths.
- Auto-deploys on every push to
mainthat passes Stack-test. - Runs entirely on GitHub Codespaces — no third-party host, GitHub-native auth, $0/mo within the free-tier envelope.
- Streams metrics + logs into hosted Grafana Cloud (free tier) so the operator can monitor everything from a phone without running a self-hosted Prometheus stack.
- Sits behind GitHub-authenticated forwarded ports — no open inbound ports on the public internet, single GitHub login from any device.
- Notifies a Slack channel on deploy / pipeline / cost events, making the same surface usable from home-automation hooks (iOS / Android push, optional Home Assistant webhook later).
Phase 2 — an always-on host (likely a small VPS) with its own auth wall and persistence story — is out of scope for this RFC and will be specified in a follow-up RFC after Phase 1 has shipped and the operator has real signal on what does and doesn't translate.
Problem Statement¶
Current state:
- Local laptop:
make serve-api,make serve-ui, occasionallymake stack-test-ml-ci. Fine for development, gone when the laptop closes. - CI Stack-test: spins up the full compose stack on every main push, runs Playwright, tears down. Validates correctness, but the environment is dead by the time the workflow finishes.
- No always-on URL for: trying a new feed against the real ingestion path, observing GIL/KG output drift over time, sharing a graph with a collaborator, or hitting the pipeline from a phone "while watching TV on the couch".
Pain points the RFC needs to address:
- No place to play with the stack from outside the laptop. Tooling for pre-prod / staging is the missing rung between "works on my machine" and "works in production".
- No persistent observability. When a pipeline run misbehaves on CI we get a 5-minute log and a teardown; if it misbehaves between CI runs, we have nothing.
- No mobile-friendly control plane. GitHub mobile shows workflow status, but the api / viewer / pipeline jobs are not reachable from the phone without ad-hoc tunnels.
- No automation hook. No way to wire "pipeline failed" or "yesterday's cost broke $X" into Slack / Home Assistant / the operator's existing notification habits.
Goals¶
- Single-host deployment of the RFC-079 stack that survives between CI runs and across reboots.
- Push-to-deploy from
main: a passing Stack-test triggers a deploy job; no manual SSH. - Container image registry (GHCR) so deploys are
docker compose pullnotdocker compose build— fast, repeatable, and the same bits the CI gate validated. - Minimal observability that covers metrics + logs + alerts without a self-hosted Prometheus/Grafana/Loki stack.
- One-login control plane reachable from desktop and phone, with no public inbound ports on the host.
- Slack notifications on deploy + pipeline + cost events, with a structure that lets future Home Assistant / phone-automation hooks subscribe to the same source of truth.
- Total cost: $0/month at typical hobby load. A clear ceiling (and a clear paid-tier exit door) when usage exceeds free.
Non-Goals¶
- Multi-region / HA — single Codespace is fine.
- Multi-tenant — the operator and at most a handful of collaborators (collaborators reach the env via the same Codespaces forwarded URL with GitHub auth, no separate identity layer).
- Replacing CI Stack-test — pre-prod is downstream of Stack-test, not a replacement.
- Always-on hosting. Codespaces auto-suspends; that is acceptable for Phase 1 hobby use. Always-on is the trigger for the follow-up RFC.
- Open public endpoints. Everything sits behind GitHub auth.
- Long-term log retention beyond the free tier (~14 days for Grafana Cloud, 90 days for Slack Free).
- Production hardening (log shipping to S3, secrets rotation, blue/ green deploys). Pre-prod is "stable enough to play with", not "five nines".
Use Cases¶
- Operator on couch with phone: opens the bookmarked Codespaces forwarded URL on iOS, signs in once via GitHub, drops a new feed URL, watches the pipeline run, sees the resulting GI/KG graph.
- Operator monitoring: opens Grafana Cloud app on phone, sees recent pipeline cost, today's job latency p95, recent error logs.
- Slack alert: pipeline job exits with
exit_code_1; operator gets a Slack push notification on the phone with a link to the failed job's log in Grafana. - Pre-merge UAT: collaborator wants to validate a feature against real corpus data before approving a PR; operator grants codespace access via GitHub repo permissions, collaborator opens the same forwarded URL.
- Home automation hook (future): Home Assistant subscribes to the same job-state webhook surface; out of scope for Phase 1 but the webhook surface this RFC defines is the contract a future automation can subscribe to.
Design¶
The design is layered: deployment → observability → control plane → notifications. Each layer is otherwise independent and can be implemented as its own sub-step.
Layer 1 — Deployment (GitHub Codespaces)¶
Why Codespaces: the operator already has GitHub credentials, the same SSO controls deploy + access + secrets, and there is no third- party vendor sign-up. This is the cheapest possible "does the pipeline run end-to-end and can I reach it from my phone?" loop.
Spec match (with the published image set restricted to api / viewer
/ pipeline-llm — the heavy pipeline-ml is never published, so disk
pressure on the codespace is much lower than the raw 15.5 GB CI
image suggests):
| Codespaces detail | Value | Fit for our stack |
|---|---|---|
| Free hours / month (Personal) | 120 core-hours; 4-core spends 2× → ~60 hrs/mo | Enough for hobby UAT |
| Machine | 4 vCPU / 16 GB RAM / 32 GB SSD | RAM and disk fit comfortably with the slimmer published image set (~2 GB total) |
| Storage | 15 GB-month free (workspace + prebuilds combined) | Enough; prune images when stale |
| Port forwarding auth | Private by default, GitHub auth required | Acts as the control plane — no extra VPN needed |
| Mobile | Browser-only via vscode.dev / github.dev (no native app) |
Acceptable; the viewer + Grafana are HTTPS, both render in mobile Safari |
| Idle stop | 30 min default, max 240 min | Fine — auto-suspend is the cost floor |
| Wake from Actions | gh codespace start <name> or POST /user/codespaces/{name}/start |
Allows scheduled wakes / on-deploy refresh |
| Prebuilds | Build the devcontainer + image pull on push to main, cached | Drops cold start from ~15 min to ~1-2 min |
| Auth scope for automation | Needs PAT / fine-grained token with codespaces scope (the default GITHUB_TOKEN does not work) |
One-time setup |
Codespaces wiring:
.devcontainer/devcontainer.jsonrunsdocker compose up -dfor the api + viewer + mock-feeds (no pipeline service in the always-running set; pipeline is launched on demand by the api job factory)..devcontainer/prebuild.ymltriggers prebuilds on every push tomainso the 15.5 GB pipeline image is pre-pulled into the prebuild cache. Cold-start cost on a fresh codespace drops dramatically.- A new GHA
deployjob (gated on Stack-test green) triggers a rebuild-and-wake of the named pre-prod codespace via thegh codespace rebuild --fullCLI, using a fine-grained PAT stored as a GHA secret. - Operator workflow: open the codespace from
github.com/chipi/podcast_scraper/codespaces, browse the forwarded port for the viewer, paste a feed URL, watch the pipeline run.
Caveats Phase 1 must accept:
- Auto-suspend means it's "always-resumable" not "always-on". Phone notifications via Slack still arrive when the codespace is asleep (because GHA + Grafana Cloud are independent of the host); the viewer just takes ~30 s to wake when tapped.
- 60 hrs/mo at 4-core ≈ 2 hrs/day. Heavy real-feed playtesting will blow through this — when that happens, the follow-up RFC takes over with an always-on host.
- 32 GB disk is tight if the published image set ever grows back toward including ML models. With the cloud-only published set decided below, this is comfortable.
Image registry + published image set¶
GHCR (ghcr.io/chipi/podcast-scraper-stack-*). Free for public images,
native GitHub auth, same package surface as the source repo.
Critical scoping decision: only the cloud variant is published. The
ML pipeline image (pipeline-ml, ~15.5 GB with HuggingFace cache baked
in) is not redistributed. It is built on CI runners for Stack-test,
used during the test run, and discarded. This sidesteps two real risks
in one move:
- Llama 3.2 license —
DISLab/SummLlama3.2-3B(the heavy local summarization model used by theairgappedprofile) carries the Llama 3.2 Community License, which restricts redistribution without attribution. By not publishing the image, the question doesn't arise. - Image size + pre-prod scope — pre-prod runs against
cloud_balanced(Gemini Flash) by default, withcloud_thinas a lighter no-vector-index variant. Both target a 100-200 episode test corpus and are ~30× faster wall-time than local-ML on a 4-core CPU. The publishedpipeline-llmimage is ~1.5 GB — cloud SDKs plus the minimum local ML needed to build a FAISS vector index (sentence-transformers+faiss-cpu+torchCPU; no spaCy / Whisper / Pegasus / llama-cpp). Fits the Codespaces 32 GB SSD comfortably.
| Image | Published to GHCR? | Used where |
|---|---|---|
api |
✅ yes | Codespaces pre-prod (cloud_balanced + cloud_thin profiles) |
viewer |
✅ yes | Codespaces pre-prod |
pipeline-llm |
✅ yes | Codespaces pre-prod ([llm] + [search]; runs both cloud_balanced and cloud_thin) |
pipeline (ML variant) |
❌ no | stack-test on CI runners only; local dev where redistribution doesn't apply |
Tags on published images:
:main— latest passing main (auto-promoted when Stack-test goes green).:sha-<short>— every successful main build, retained for rollback.:pr-<num>— optional, for collaborator UAT.
Image hygiene gates (CI assertions before any push):
.dockerignoreexcludes.env*,**/.env*,*.pem,*.key, plus the existing fixture-related paths. This is the static guard.- A workflow assertion runs
docker run --rm <img> sh -c 'find / -name ".env*" 2>/dev/null'against each candidate image and fails the publish job if any path matches. Belt-and-suspenders against the static guard drifting. - License audit on the
pipeline-llmimage is trivial: it bakes in Python SDKs only (OpenAI, Gemini SDK, Anthropic SDK, Mistral) — all permissive licenses. No attribution-strings-attached models.
Codespaces pulls these three images into its devcontainer at boot; the same tags will be reused unchanged by the follow-up always-on RFC, so there is no re-tagging churn when Phase 2 lands.
Layer 2 — Observability (minimal)¶
Three signal types, three free hosted services, zero self-hosted storage:
| Signal | Source | Pipe | Sink | Retention (free) |
|---|---|---|---|---|
| Metrics | api /metrics (FastAPI Prometheus exporter) + node-exporter on host |
Grafana Agent (single binary) | Grafana Cloud | 14 days, 10 k active series |
| Logs | docker compose logs for api + viewer + pipeline |
Grafana Agent (Promtail mode) | Grafana Cloud Loki | 14 days, 50 GB/month ingest |
| Healthchecks | external probe of /api/health |
Grafana Synthetic Monitoring | Grafana Cloud | 30 day uptime history |
Why Grafana Cloud instead of self-hosted Prometheus + Grafana + Loki:
- Zero containers added to the compose stack — the agent is one extra service, ~30 MB RAM.
- Single login, single dashboard URL, mobile app exists.
- The free tier (14 day retention, 10 k series, 50 GB logs) is more than enough for hobby load.
- We can graduate to self-hosted later by pointing the agent at a different endpoint; nothing in the stack changes.
The api already has /metrics (or trivially can — FastAPI
prometheus middleware is a 3-line add). Pipeline runs as one-shot
containers; the agent picks up their stdout/stderr via the Docker
logging driver, no app changes needed.
Detailed sink contract deferred to a follow-up issue. The exact set of metric names + cardinality budget + log streams + log hygiene gates (no leaking secrets / PII / copyrighted-fragment text via Loki) is non-trivial and warrants its own review pass. A separate GitHub issue tracks the detailed observability spec — see the Tracking header. RFC-081 itself only commits to the three sink categories (metrics / logs / healthchecks), the free-tier targets, and the principle that nothing ships off-host until the log hygiene audit is signed off. Phase 1B's acceptance is gated on that issue closing first.
A third sink — Sentry free tier (5 k errors/mo, 10 k transactions,
1 user) — is also planned for error tracking + performance traces.
Complementary to Grafana metrics + logs: Sentry captures stack traces
on unhandled exceptions and JS errors with full source context.
Wiring (sentry-sdk[fastapi] for api + @sentry/vue for viewer) is
tracked as a separate GitHub issue under the same parent.
Layer 3 — Control plane¶
The auth wall changes shape between phases.
GitHub-native auth on forwarded ports. Codespaces port forwarding
is private by default; reaching viewer:80 from a phone requires the
operator's GitHub login. No third-party VPN, no DNS plumbing, no
inbound port. The downside is mobile UX: vscode.dev / github.dev
work in mobile Safari but the GitHub mobile app does not open
Codespaces. For Phase 1 hobby use that is acceptable — the operator
bookmarks the forwarded URL and signs in once per session.
The follow-up RFC for an always-on host will introduce its own auth wall (likely Tailscale or Cloudflare Tunnel + Access). That choice is deferred until Phase 1 is operational and we have signal on what the operator actually wants from a remote viewing surface.
Layer 4 — Notifications + mobile / home automation¶
A single notification spine, multiple subscribers:
GitHub Actions ─┐
├─► Slack channel #pipeline-events ─► iOS / Android push
Grafana Cloud ──┤ ─► Home Assistant webhook
api `/api/jobs` ┘ ─► future automations
(job state webhook)
Slack is the central fan-out because:
- The operator already uses it; one app fewer to install.
- Per-message threading + mute schedules + rich formatting mature.
- Both GHA and Grafana Cloud have first-class Slack senders.
- Home Assistant has a
slack_eventintegration for inbound webhooks if the operator wants the same events to hit a smart light, a speaker, or a Shortcuts automation.
Event taxonomy (initial, expandable):
| Event | Source | Severity | Channel |
|---|---|---|---|
| Deploy succeeded / failed | GHA deploy job |
INFO / ERROR | #pipeline-events |
| Stack-test failed | GHA Stack-test | ERROR | #pipeline-events |
| Pipeline job exit_code != 0 | api → Slack webhook | ERROR | #pipeline-events |
| Daily cost > $X | Grafana alert on total_cost_usd |
WARNING | #pipeline-events |
/api/health 5-min downtime |
Grafana Synthetic | ERROR | #pipeline-events |
The api gets a thin webhook emitter (env-configured, defaults off) for job-state transitions. It does not call Slack directly — it posts to a generic webhook URL the operator sets to point at Slack, Home Assistant, or a Shortcuts handler. Same surface, different sinks.
Security¶
- No public inbound ports. Codespaces port forwarding is private-by-default with GitHub auth, no inbound port on the public internet.
- Published GHCR images are scope-limited to api / viewer / pipeline-llm — never the ML pipeline image. The publish workflow job's image list is the explicit enforcement boundary against accidental redistribution of attribution-restricted models.
.envleakage CI assertion runs against every candidate image before push (see Layer 1 image hygiene gates).- Codespaces automation token is a fine-grained PAT scoped to the
codespacespermission only, used by the deploy workflow to rebuild + start the named pre-prod codespace. The defaultGITHUB_TOKENdoes not have this scope. - Secrets (provider API keys, Slack webhook URL, Grafana Cloud token, Sentry DSN, backup-repo PAT) live in:
- Codespaces Secrets — Settings → Codespaces → Secrets,
user-level or repo-level. Injected as env vars at codespace
boot; compose interpolates
${OPENAI_API_KEY:-}etc. - GitHub Actions Secrets (deploy / backup workflows). Never in compose files, never in images.
Costs¶
| Component | Plan | Cost |
|---|---|---|
| GitHub Codespaces | Personal free, 120 core-hrs/mo, 15 GB-month storage | $0 |
| GHCR (public images) | included with GitHub | $0 |
| Grafana Cloud Free | 14 day metrics, 50 GB logs/mo, 1 synthetic | $0 |
| Sentry Free | 5 k errors/mo, 10 k transactions, 1 user | $0 |
| GitHub Releases (private backup repo) | ~10 GB soft cap, 2 GB per asset | $0 |
| Slack Free | 90-day message history, unlimited channels | $0 |
| Total | $0 |
The only non-zero cost is the pipeline run itself — provider API
charges (OpenAI Whisper + Gemini Flash Lite) on the operator's own
keys, against either cloud_balanced (default; with vector index) or
cloud_thin (no vector index). A 200-episode test corpus is ~$40
one-time on either profile.
Paid-tier exits if hobby grows¶
- Grafana Cloud Pro: $19/mo for 1 yr retention + 100 GB logs.
- Codespaces beyond 120 core-hrs/mo: $0.18 per core-hour (4-core ≈ $0.36/hr).
- Always-on hosting: covered by the follow-up RFC.
Phased Rollout¶
Two phases. Each is a complete, usable environment; the second is a lift-and-shift of the first onto a different host once the operator has outgrown Codespaces. Within each phase, the four layers (deploy / observability / control plane / notifications) ship as sub-steps so the operator can stop at any sub-step with a working artefact.
Phase 1 — Codespaces (the validation rung)¶
Goal: prove the deploy → run → observe → alert chain works end-to-end using only GitHub-native primitives, before introducing a new vendor.
1A — Build, test, publish (workflow restructure):
The current stack-test.yml workflow is a single job that builds + tests.
Restructure into three jobs with explicit gating so build/test failures
attribute cleanly and a flaky test doesn't trigger an image rebuild:
build: # builds api + viewer + pipeline-llm + (locally-only) pipeline-ml
# uses gha buildx cache for layer reuse
# outputs: image digests + sha tags
# no push yet
stack-test: # needs: build
# runs Playwright against the locally-built images
# exercises pipeline-ml (airgapped_thin profile) — never published
publish: # needs: stack-test
# if: success() && github.ref == 'refs/heads/main'
# re-tags + pushes ONLY api / viewer / pipeline-llm to GHCR
# runs the .env / license assertions before push
The pipeline-ml image is built (so stack-test exercises the airgapped
path) but never reaches the publish step. The publish job's image
list is the explicit safety boundary against accidental redistribution.
Profiles in pre-prod: cloud_balanced (default) + cloud_thin.
Reasoning recap:
- Speed: 100-200 episode test corpus runs in ~30 min wall-time vs ~10 hrs on Codespaces 4-core CPU with full local ML.
- Image size:
pipeline-llmis ~1.5 GB (cloud SDKs + minimum [search] extras: faiss + sentence-transformers + torch CPU) vspipeline-ml's 15.5 GB — fits Codespaces 32 GB SSD with headroom. - License: zero attribution-required models in the published image (no SummLlama, no Pegasus, no transformer NER).
- Cost: ~$40 one-time for 200 episodes (OpenAI Whisper + Gemini Flash Lite); orthogonal to host cost.
cloud_balancedis the default because it ships with the local vector index populated, so the viewer's Search/Index surface is usable from day one.cloud_thin(no vector_search) is kept as a lighter variant for operators who want minimum image weight.
Codespace devcontainer:
.devcontainer/devcontainer.jsonrunsdocker compose -f compose/docker-compose.stack.yml -f compose/docker-compose.prod.yml up -dfor api + viewer; the api job factory launches the pipeline on demand using thepipeline-llmservice (nopipeline-mlin this compose set)..devcontainer/prebuild.ymltriggers prebuilds on push tomainso GHCR pulls land in the prebuild cache. With the slimmed published image set, the cold-start hit drops from ~15 min (pulling 15.5 GB) to ~1-2 min (pulling ~2 GB total across the three published images).
Separate deploy-codespace workflow (.github/workflows/
deploy-codespace.yml):
- Triggers:
workflow_dispatch(manual) +workflow_run(auto on Stack-test success onmain). - Steps:
gh codespace rebuild --fullagainst the named pre-prod codespace via a fine-grained PAT (codespacesscope). - Healthcheck: poll the forwarded
/api/healthfrom the runner with a 5-minute budget; fail the workflow if the new stack doesn't come up. make deploy-codespaceMakefile target as the manual escape hatch.
Acceptance: push to main → build/test/publish all green → published
images appear at ghcr.io/chipi/podcast-scraper-stack-{api,viewer,
pipeline-llm}:main → deploy-codespace workflow fires → codespace
wakes with the new bits → forwarded :8090 → /api/health returns
200 to mobile Safari → loading a feed kicks off a job that uses the
cloud_balanced profile end-to-end (vector index populated, viewer
Search works on first run).
1B — Observability (Grafana Cloud free):
- Grafana Agent added as a devcontainer feature (no compose change).
- FastAPI
/metricsendpoint viaprometheus-fastapi-instrumentator(~3-line add). - Agent ships container logs (Docker logging driver) + metrics +
/api/healthsynthetic monitor to Grafana Cloud. - Grafana Cloud free tier credentials live in GHA Secrets, injected
into the codespace via
secrets:indevcontainer.json. - Acceptance: phone Grafana Cloud app shows last 24 h of api requests, last failed pipeline job's stdout, current daily cost rollup.
1C — Control plane:
- Codespaces port forwarding is the auth wall: GitHub login required, no inbound ports exposed.
- Operator bookmarks the forwarded viewer URL (mobile Safari bookmark, plus iOS share-sheet to home screen for "app-like" feel).
- Grafana Cloud SSO bound to the operator's GitHub identity.
- Acceptance: phone on cellular hits the forwarded URL, signs in once via GitHub, lands on the viewer.
1D — Notifications + automation:
- Slack incoming webhook URL stored in GHA Secrets + Grafana Cloud.
- GHA notify step on deploy success / failure / Stack-test failure
→
#pipeline-events. - Grafana alert rules (cost threshold, healthcheck downtime, error spike) routed to the same Slack channel via Grafana Cloud's contact points.
- api gets an optional job-state webhook emitter (env-configured, defaults off; reused unchanged in Phase 2).
- Acceptance: trigger a deliberate failure (rename a feed URL), observe the Slack push reach the phone within 60 s.
1E — Corpus persistence + backup:
Codespaces' default delete-after-30-days-inactivity is a real data-loss risk for an accumulating corpus. Two-layer mitigation, GitHub-native all the way (no third-party storage vendor):
- In-codespace persistence: bind-mount the corpus to
/workspaces/podcast_scraper/.codespace_corpus/(workspace path, survives suspend) instead of an anonymous Docker volume (which survives suspend but is harder to back up). - Off-Codespace backup to a private GitHub Releases repo:
- One-time setup: create
chipi/podcast_scraper-backupas a private repository (free for personal accounts; private so backup contents are never publicly readable). - New GHA workflow
.github/workflows/backup-corpus.yml, operatorworkflow_dispatchwhen a snapshot is wanted (no cron; the codespace is often shut down). - Wakes the codespace, runs
tar -czf snapshot.tgz .codespace_corpus, thengh release create snapshot-$(date -u +%Y%m%d) snapshot.tgz --repo chipi/podcast_scraper-backup --notes "auto". - Retention: keep 7 daily + 4 weekly snapshots; pruner step removes
older releases via
gh release delete. Soft repo cap is ~10 GB (matches what we'd have had on R2 anyway). - The workflow uses a fine-grained PAT with
contents: writeon the backup repo only — stored asBACKUP_REPO_TOKENin GHA Secrets. - No new vendor sign-up. Same GitHub identity, same Codespaces
Secrets surface, same
ghCLI.
Restore path:
make restore-corpusMakefile target: pulls the newest compatiblesnapshot-YYYYMMDDviagh release download(sibling manifest when present) and untars into.codespace_corpus/.- Works from a fresh codespace (resurrected after delete).
Why GitHub Releases for Phase 1:
- Persistent (no time-based expiry, unlike Actions artifacts).
- No new vendor. Same GitHub identity, same
ghCLI, same Codespaces Secrets surface. - ~10 GB soft cap on a free private repo, 2 GB per asset — matches what we'd get on R2's free tier anyway, and is plenty for corpus snapshots in the Phase 1 window.
Phase 2 backup is a separate decision, not a lift-and-shift of this. When the always-on RFC drafts, it will likely want a real object store (Cloudflare R2 / Backblaze B2 / Storj) because always-on changes the constraints: corpus accumulates without the Codespaces 30-day reset; a DB backup (when one lands) wants point-in-time semantics that align better with object versioning; cross-host restore performance + bandwidth caps start to matter. The follow-up RFC owns that decision; the GitHub Releases mechanism here does not silently graduate.
Exit criteria for Phase 1 (triggers for the follow-up RFC):
- Codespaces 60 hrs/mo limit consistently hit.
- Auto-suspend latency annoying for casual phone use.
- Operator wants to run a real feed continuously for days.
When any of those bites, draft the always-on RFC; until then, Phase 1 is the contract.
Alternatives Considered¶
- Self-hosted Prometheus + Grafana + Loki: rejected as more moving parts than the hobby goal warrants. Free tier of Grafana Cloud removes ~3 services from the codespace without losing capability.
- Always-on host (any VPS) instead of Codespaces for Phase 1: rejected for this RFC's scope — would force a vendor sign-up and a control-plane decision (Tailscale / Cloudflare Tunnel / etc.) before we have signal on whether the deploy → run → observe → alert chain even works for the operator. Phase 1 on Codespaces lets us validate the chain end-to-end with only GitHub primitives; the always-on follow-up RFC then makes a host decision with real usage data.
- Self-hosted everything (Codespaces dev container running its own Prometheus / Grafana / Loki): rejected. The codespace is meant to be cheap and transient; pinning a stateful observability stack to it makes both observability and the codespace heavier.
- GitHub Pages / Render free / Fly.io free for the published viewer: rejected because they don't run the api or the pipeline, only static assets. We need the whole compose stack reachable, not just the viewer.
- Docker Swarm / Kubernetes: comically over-spec for a single- codespace hobby env.
Open Questions¶
- Codespaces image-cache pressure on 32 GB SSD. The 4-core tier
has 32 GB host disk. With the slimmed published image set (no
pipeline-ml, ~2 GB total) the disk is comfortable, but the prebuild cache + workspace checkout + dev deps still adds up. Lean: start on 4-core; ifdf -hinside the codespace consistently >85 %, escalate to 8-core (64 GB disk, spends 4× faster but lasts longer per use). - Codespaces auto-suspend latency. First-tap-to-loaded after a suspended codespace is ~30-60 s with prebuilds, longer without. Acceptable for hobby phone use; chronic annoyance is the trigger for the always-on follow-up RFC.
- Cost-spike alert thresholds. What is "$X" for the cost alert? Pulled from RFC-068 (cost observability) — the operator sets a per-day ceiling. Default: ~$1/day; tune based on observed baseline.
- Backup retention policy. Default proposal: keep 7 daily + 4 weekly snapshots in the private Releases repo, prune older. Tune once we see actual corpus growth. If the ~10 GB soft cap on the backup repo gets uncomfortable, that is itself signal that the always-on RFC's object-store decision is overdue.
- Codespace deletion threshold. Codespaces auto-delete after 30 days of inactivity. With weekly Releases snapshots that's a soft loss (the corpus restores from the backup repo), but the codespace identity itself resets — the deploy workflow must handle "named codespace doesn't exist, create one" as a path, not just "rebuild existing".