Skip to content

PRD-016 · Science Overlay & Episode System (audio)

Status · Draft v0.4 (all 8 v1 architectural decisions resolved 2026-05-16) Date · 2026-05-16 Owner · Marko Closes into · RFC-019 Slice gate · v0.9 (after PRD-014 surface-hotspots and PRD-015 mobile wrapper)

Why this is a PRD. A persistent narrated audio layer turns Orrery from a visual reference into an editorial experience — closer to a museum than a documentation site. The decision touches every screen's UI, the build pipeline, the asset budget (~91 MB of v1 audio in the GH Pages build, 12 locales × 33 episodes ~800 MB at full hierarchy), the runtime cost (paid TTS providers, free-tier accounting), and the per-locale content authoring effort (12 locales × 3 voice personas × 33 episodes = 1188 audio assets at full corpus). It needs a product gate before any audio asset, player component, or TTS API key lands.


§why

The original orrery — the 18th-century brass instrument the product is named after — was an object with sound. Brass gears, wood, the click of position. Orrery the web app has rebuilt the visual language of that instrument with planet textures, orbits, mission arcs, and 3D landing sites. The thing it does not have is a voice.

Museums solved this problem fifty years ago: the audio guide. A voice that doesn't replace the artefact but accompanies it — telling you what you're looking at, why it matters, what the silent visual leaves out. Done well (think the Hayden Planetarium, the Cité de l'espace, the Mudam wall texts), the audio doesn't compete with the visual; it deepens it. The visitor's attention stays on the object; the voice supplies the editorial layer.

Orrery has the editorial scope but not the editorial layer. A first-time visitor lands on /explore, sees Saturn rendered to scale, and reads HUD numbers. They learn nothing about scale, nothing about why Saturn matters, nothing about how we got there. A narrator could tell them — could spend ninety seconds on the absurd ratio of Saturn's volume to the spacecraft that crossed it, and turn the visit from "an interactive but" into a moment they remember.

This PRD scopes that layer. It is one feature, not eleven (one per screen): a single audio system, three voiced personalities, three depths of content, all 12 supported locales, generated at build time, served as static files, and presented through a single overlay component that is the same on every screen.


§audiences

AudienceWhy audio helps them
Curious learner (most-served audience)Doesn't read long-form. A 90-second narrated episode while watching the planets move is the format they already consume on YouTube and podcasts.
Space enthusiastAlready knows the visuals; comes for the editorial framing — the story around an artefact. Audio is the editorial layer.
First-time visitorThe landing-page autoplay-blocked silent video problem solves itself: there's a play button on the home screen that does something audible and tells them what they're looking at.
Educator / journalistCites a specific narration moment ("Orrery's Cernan-last-words segment frames Apollo 17 as…") as a reference point. Audio is shareable as a per-episode link.
Vision-impaired userThe narration is the alt-text-at-scale: every visual moment in /explore, /fly, /earth, /moon, /mars, /iss, /tiangong has a spoken counterpart. Not a substitute for proper accessibility (ADR-025), but a layer that helps.

§what's already shipped (audio-readiness inventory)

Before scoping the work, the parts that are already in place:

CapabilityStatusSource
12-locale Paraglide message bundleshipped (v0.4–v0.5)src/lib/paraglide/messages/*.json, ADR-031/032/033
11 primary nav routes (incl. 7 with 3D scenes)shipped (v0.6)src/routes/+layout.svelte, TA.md route inventory
PWA service worker (registerType: 'autoUpdate')shipped (v0.5.x)vite.config.ts, ADR-029
Cookie-based locale persistence (orrery_locale)shippedADR-057 (only persistent storage allowed; localStorage forbidden)
Capacitor mobile wrapper (Android-first)planned (PRD-015 / RFC-018, v0.8)
Bundle-slimming machinery (MOBILE=1 env, lazy locale chunks)planned (RFC-018 §4, v0.8)
External-link click delegation (src/routes/+layout.svelte)shipped (v0.6)will be reused for the audio overlay's "more about this" links

What this means: the infrastructure for shipping a 12-locale audio system that lazy-loads per-locale assets and survives both web + Capacitor distribution exists by v0.8. PRD-016's work begins after that foundation lands.


§goal

Ship a persistent audio episode system across all 11 routes, generated build-time from human-edited markdown scripts, narrated by 3 voice personas, in all 12 locales, presented through one overlay component that is identical on every screen. The first cut ships in v0.9, scoped to fit inside the GH Pages soft 1 GB repo limit (≈ 91 MB of v1 audio added to the existing ~355 MB build).

Phasing (v1 = "first audio ship"; v1.x adds depth):

PhaseAudio scopeAudio MB addedRepo total
v1 (v0.9)English full hierarchy (33 episodes) + Curator Full Tour in 12 locales (8 segments × 12)~91 MB~445 MB
v1.1Guide-level episodes in priority locales (en, es, fr, de, it, pt, ja) — 11 routes × 7 locales+~140 MB~585 MB
v1.2Enthusiast object-level episodes in priority locales+~120 MB~705 MB
v1.3Long-tail locales (zh, ko, ru, sr-Cyrl, hi) at all levels+~95 MB~800 MB
v2VPS docker-compose migration (planned for v1.0 of overall product per Marko 2026-05-16) absorbs whatever audio corpus has accumulated; CDN trigger no longer GH-Pages-bound

Mobile (Capacitor) ships only the user's locale of audio, ~67 MB add to the ~85 MB Capacitor budget per RFC-018 §4. Other locales lazy-fetch from chipi.github.io on locale switch.


§user-stories

US-1 — Screen narration (Guide voice). Visitor opens /explore, taps the waveform icon in the nav, and a 5–8 minute episode plays explaining what they're looking at — the scale, the time, the planets in their orbits — while the scene continues to render. The narration does not interrupt camera controls or planet hover. The user can pause, scrub, change speed, and switch locale without losing position.

US-2 — Object narration (Enthusiast voice). Visitor selects Mars on /explore, taps "More about this object" inside the planet detail panel, and a 90-second episode plays specifically about Mars — its orbital eccentricity, why missions launch in 26-month windows, the 14.5-second one-way signal delay at maximum range. Numbers are spoken with their unit; equations are voiced as their physical meaning, not as letters.

US-3 — Full Tour (Curator playlist). Visitor on / taps "Take the tour" and gets a ~90 minute documentary-order playlist that walks them through the whole product. The Curator voice acts as a docent: introduces each section, hands off to Guide for screen-level narration, hands off to Enthusiast for object-level deep-dives, and closes with a Sagan-register epilogue. Resumable, scrubbable, exits cleanly to free navigation if interrupted.

US-4 — Episode inventory. From the audio overlay, the user can see "all the audio for this screen" plus "all episodes across the product, grouped by route" plus "what I've already heard". The heard-tracking is in-memory only (lost on reload — ADR-057 forbids localStorage); a future v1.1 may opt-in a single-cookie heard-bitset if usage data justifies the persistence cost.

US-5 — i18n parity. Every voiced episode that exists in en-US must exist in all 12 supported locales by the time that locale tier ships (per the phasing table). A locale-switch event (?lang=es URL change or LocaleSwitcher click) restarts the currently-playing episode in the new locale at the matching timestamp (best-effort match, not bit-perfect).

US-6 — Offline playback. On the web, the PWA service worker caches whatever audio the user has played (per ADR-029, autoUpdate semantics). On mobile (Capacitor), the user's primary locale is bundled at install time; airline / no-signal use cases play the full episode set without network. Switching to a non-bundled locale on mobile fetches that locale's assets and caches them for offline replay.

US-7 — Caption and transcript surface. Every audio episode has a synced caption track (WebVTT) and a downloadable plain-text transcript in the playing locale. Caption rendering happens inside the overlay; transcript downloads as .txt. Required for accessibility (ADR-025) and for users who prefer reading.


§must-have requirements

IDRequirement
M1Single audio overlay component, presented as right-panel on desktop and bottom-sheet on mobile (<800 px viewport). Same component, two layouts. Triggered by waveform icon in Nav.svelte.
M2Three voice personas (Curator / Guide / Enthusiast) implemented as separate voice IDs per locale, curated in static/data/audio/voices.json. Per-locale voice testing required before that locale ships.
M3Episode taxonomy covers all 11 routes (not just 6 as in the original draft): /, /explore, /missions, /fly, /earth, /moon, /mars, /iss, /tiangong, /science, /fleet. Each route gets at least one Guide-level screen episode in en-US for v1; full per-locale rollout per phasing table.
M4Build-time-only TTS generation. No runtime API calls to TTS providers from the browser. All audio is .mp3 files served as static assets. Provider credentials never reach the client.
M5TTS provider abstraction (TtsProvider interface in RFC-019 §provider-abstraction). v1 ships with ElevenLabs as the anchor; the system can swap to OpenAI / Google Cloud / Azure / Coqui-local with an env-var change + new voice IDs in voices.json. No pipeline rewrite.
M6Audio asset packaging: MP3 96 kbps mono (Opus 32 kbps where browser support is universal — likely v1.1+). Target average episode size ~2 MB / 5–8 min for screen episodes, ~600 KB / 90 s for object episodes.
M7All audio assets live under static/audio/{locale}/{persona}/{episode-id}.mp3. Web build serves them as static files; Capacitor sync includes only the user's locale (RFC-018 §4 lazy-loading pattern). Hosting is host-agnostic — same paths work on GH Pages today, on the planned VPS docker-compose at v1.0.
M8Heard-state tracking is in-memory only, runtime-scoped (lost on reload). ADR-057 forbids localStorage; the only persistent storage allowed is the orrery_locale cookie. v1 does not persist heard episodes. (v1.x may opt in a single-cookie bitset if data justifies it.)
M9Caption tracks (WebVTT) generated alongside every audio asset, served from static/audio/{locale}/{persona}/{episode-id}.vtt. Captions render inside the overlay; toggle defaults to on when ANY of: prefers-reduced-motion set, screen-reader detected, Audio.muted == true, OR effective bandwidth < 1 Mbps (navigator.connection.effectiveType heuristic, best-effort).
M10Transcripts (plain text) downloadable from the overlay; lives at static/audio/{locale}/{persona}/{episode-id}.txt.
M11Async generation pipeline runs both locally (Marko's machine, manual npm run audio:build) and in GitHub Actions (free-tier minutes, automated on script-PR merge). Same script, same outputs, same cache keys. RFC-019 §async-generation.
M12Cost telemetry: every TTS API call records (provider, locale, persona, char count, $cost) into static/data/audio/cost-ledger.json. Free-tier accounting per provider is the ledger's job, not the pipeline's.
M13Service-worker cache strategy: audio fetched once, cached forever. Cache invalidation by content hash in the asset URL ({episode-id}.{hash8}.mp3).
M14The waveform icon's behaviour adapts to device input: on touch devices, opens the overlay as bottom-sheet; on desktop with keyboard focus, opens as right-panel and traps focus inside.
M15Episode-share-link: /?audio=explore-guide-en-US deep-links to "open /explore, autoplay this episode" — works on web and through Capacitor's deep-link handler (RFC-018 §7).

§should-have requirements

IDRequirement
S1Speed control (0.75x / 1x / 1.25x / 1.5x) inside the overlay.
S2"Continue where I left off" — purely runtime, not persisted; if the user reloads, position resets.
S3Visual cue on the screen during playback: the route's HUD shows a discrete pulse-bar so the user remembers the audio is on.
S4Locale-switch mid-playback restarts the current episode in the new locale at the proportionally-matched timestamp (best-effort).
S5"Skip to next episode" + "skip to previous episode" within Full Tour playlist.
S6Per-screen autoplay-prompt: first time the user lands on /explore (or /fly, etc.) the overlay shows a non-modal "1-tap to listen to this screen's episode" toast. Dismissed → never shown again that session.

§will-not-have (v1)

  • Real-time TTS. No browser-side voice generation. All audio is pre-generated.
  • User-recorded audio / community contributions. Editorial control stays in-house.
  • Music bed / ambient layer. A v2 candidate; out of v1 scope to keep ship date honest.
  • Push notifications for "new episode". PWA push is out of project scope (CLAUDE.md).
  • Persistent heard-state. Per ADR-057 storage rules; revisit in v1.x.
  • Voice persona names surfaced in UI. The personas (Curator/Guide/Enthusiast) are an internal editorial tool, not a user-facing taxonomy. The user just hears "the right voice for this moment."
  • iOS-only voice cloning of a "signature Orrery voice". Optional v2 candidate via Coqui-local + ElevenLabs voice cloning; out of v1.

§success-criteria

Editorial:

  1. A first-time visitor on / who taps "Take the tour" stays for ≥ 10 minutes (proxy for the museum-grade atmosphere goal).
  2. Each of the 8 "Atmospheric Moves" identified in RFC-019 (signal-delay, porkchop, pale blue dot, 14.5-second delay, capability ladder, Cernan's last words, far side, Curiosity persistence) lands as a recognisable beat — verified by Marko + 3 reviewer listens before the locale ships.

Technical: 3. v1 web build stays under 500 MB total repo size (current 355 MB + ~91 MB v1 audio + headroom for poster updates). 4. v1 mobile (Capacitor) bundle stays under ~150 MB installed (RFC-018 M11 ceiling), with the user's locale of audio bundled. 5. Cold-start to first-audible-narration < 1.5 s on 4G (target 800 ms cached). 6. PWA service-worker hit-rate for audio replays > 90 % (audio plays once, cached forever).

Operational: 7. Async generation pipeline completes a full English-locale rebuild (~33 episodes ≈ 264k chars) in < 15 min on Marko's M-series Mac, < 30 min in GH Actions. 8. TTS cost for v1 priority cut (~1.4 M chars total — full English + Curator Tour × 12 locales) lands under $50 one-shot (provider-dependent — see RFC-019 §cost-analysis).


§dependencies

  • PRD-015 / RFC-018 (mobile wrapper) must ship first; the audio system's mobile bundle strategy depends on RFC-018 §4 lazy-locale machinery.
  • PRD-014 (surface hotspots) should ship before this; otherwise we're authoring narration scripts about landing sites whose visual model is still in flight.
  • ADR-029 (PWA service worker) — already in place.
  • ADR-057 (no localStorage) — constrains heard-state design.
  • ADR-031 / ADR-032 / ADR-033 (i18n strategy) — defines the 12-locale catalog.

§resolved decisions

Resolved 2026-05-16 in conversation with Marko.

  1. v1 ambition — RESOLVED: Full hierarchy × 12 locales as the editorial goal; phased ship per the §goal table. v1.0 = English full hierarchy + 12-locale Curator Full Tour; v1.x adds depth.
  2. TTS provider — RESOLVED: ElevenLabs as anchor, with provider abstraction (M5) so swap to OpenAI / Google / Azure / Coqui-local is an env-var change. No upfront work to integrate alternates; the abstraction earns the optionality.
  3. v1 provider sequencing — RESOLVED: Hybrid. Google Cloud TTS (free tier) generates the bulk of the corpus at $0; ElevenLabs voices the 8 Atmospheric Moves anchor episodes only (~$15 total). Mixed-provider via voices.json. RFC-019 §4.3.
  4. Mobile audio v1 — RESOLVED: Bundle the user's locale only (~67 MB add to the ~85 MB Capacitor budget). Other locales lazy-fetch on switch. Locale's WebVTT captions bundled too (~1 MB add).
  5. Audio asset hosting — RESOLVED: GH Pages site (static/audio/) for v1. Re-evaluation triggered by Marko's planned v1.0 VPS docker-compose migration, not by an audio-size threshold; design keeps audio host-agnostic so the migration is a config change, not a rewrite.
  6. Async generation — RESOLVED: Both local AND GH Actions, same pipeline. Marko triggers locally for iteration; GH Actions runs on script-PR merge for completeness.
  7. Voice persona surfacing in UI — RESOLVED: Implicit. No badge, no Curator/Guide/Enthusiast label in the overlay. The user just hears the right voice for the moment. Keeps focus on content; matches museum-audio-guide UX. Lets us re-cast personas later without UI churn.
  8. Curator Full Tour ordering — RESOLVED: Documentary order. Curator opens (pale-blue-dot register) → Solar System big picture → closer to home (Earth, Moon) → missions sent → people in space (ISS, Tiangong) → Mars + future → Curator close. Optimised for ~90-min listen-through. NOT nav order.
  9. Re-translation strategy — RESOLVED: Full episode re-translate on source change. Cost ≈ $0.50 per episode revision (Claude API). Paragraph-diff optimisation deferred to v1.1 if revision frequency justifies it.
  10. Caption auto-on triggers — RESOLVED: ALL FOUR. prefers-reduced-motion set, screen-reader detected, Audio.muted == true, OR effective bandwidth < 1 Mbps. M9 covers this.
  11. Cost-budget thresholds — RESOLVED: $50/mo soft warn, $200/mo hard halt. Looser than initial recommendation; gives headroom for one-shot rebuilds during iteration.
  12. Mobile VTT bundling — RESOLVED: Yes. Captions bundled alongside the user's locale of audio (~1 MB total per locale). Accessibility parity with web; deaf / hard-of-hearing + Audio.muted + airplane-mode users all covered.
  13. Per-locale voice-quality review — RESOLVED: Defer non-en review until v1.1. Audio ships in all 12 locales for v1 with a "beta" UI flag on non-en locales (small chip on the overlay header for affected locales). Reviewers recruited in v1.1; non-en flagged as beta until reviewed. Faster ship; honest about the quality gap.

§open questions

All v1 architectural questions resolved. Operational follow-ups:

  1. "Beta" UI flag visual treatment for non-en locales. Small chip + tooltip ("Voice quality reviewed in en-US only; other locales pending v1.1 review") in the overlay header. Treatment + copy can be polished at implementation; not blocking.
  2. Music bed — v2 candidate, not v1. Re-open as a separate PRD if v1 surfaces "the silence between Curator segments feels empty."

PRD-016 · Orrery · Science Overlay & Episode System · Drafted 2026-05-16 · Closes-into-RFC-019

Orrery — architecture documentation · MIT · No tracking