RFC-077: Viewer feeds API, operator config, job runner, and process hygiene (serve)¶
- Status: Draft
- Authors: Podcast Scraper Team
- Stakeholders: Maintainers; viewer operators
- Tracking: GitHub #626
- Related PRDs:
- PRD-030
- PRD-003
- PRD-025
- Related ADRs:
- ADR-064 — one server; feature-flagged routes (see § ADR for whether a new ADR is needed for subprocess jobs)
- Related RFCs:
- RFC-008 —
Config/ validation - RFC-062 — viewer +
serve - RFC-063
- RFC-065 — CLI monitor only; not authoritative for HTTP jobs
- RFC-007
- Related UX specs:
- VIEWER_IA.md
- UXS-001
- UXS-006
- Related Documents:
- SERVER_GUIDE.md
- CONFIGURATION.md
- GitHub #593 — optional root
profile:merges packagedconfig/profiles/<name>.yamldefaults (Config._resolve_profile)
Abstract¶
This RFC specifies Phase 1: (a) opt-in GET/PUT /api/feeds for structured corpus feeds.spec.yaml (root { feeds: [...] } — JSON on the wire; extension selects YAML vs JSON on disk when using --feeds-spec); (b) opt-in GET/PUT /api/operator-config for a viewer-safe operator YAML file whose location follows precedence: path from podcast serve --config-file when the server was started with it, else a fixed basename under the resolved corpus root (e.g. viewer_operator.yaml). Secrets never belong in that file — only environment variables; PUT and (optionally) GET enforce forbidden secret keys and forbidden feed-list keys at the YAML root (see § Phase 1b), aligned with RFC-008. GET also returns available_profiles for the viewer profile picker (#593).
Phase 2 specifies HTTP-triggered pipeline jobs (child OS subprocess of serve — see § Phase 2 — Architecture decision for why and what we rejected), a durable job registry, stale and orphan detection, cancel semantics, and operator-facing reconciliation so many runs do not leave ambiguous background state. GET /api/health gains only capability booleans (feeds_api, operator_config_api, jobs_api); job payloads never live on health.
Product choices (from stakeholder, #626): secrets only in env; config path serve --config-file else corpus default.
Problem Statement¶
Operators need feeds + config + job clarity without leaving the viewer. Without hygiene, many subprocess jobs create zombie/stale records, orphan PIDs, and unclear “what is still running” at end of day. Without config in the UI, operators over-type paths and paste secrets into YAML — we forbid the latter in the operator file and steer env instead.
Goals¶
- Feeds API + health flag + serve opt-in (Phase 1a).
- Operator config API + validation + health flag + serve opt-in (Phase 1b — may ship same or next PR).
- Job runner + registry + cancel + stale/reconcile + Dashboard surfacing (Phase 2).
- Document forbidden keys and path resolution in SERVER_GUIDE.
Constraints & Assumptions¶
- Corpus path resolution reuses
resolve_corpus_path_paraminsrc/podcast_scraper/server/pathutil.py(same as other viewer routes). - Forbidden keys on PUT: maintain
FORBIDDEN_OPERATOR_CONFIG_KEYS(or generate fromConfigfields documented as secrets in CONFIGURATION.md / field names matching*_api_key,api_key,openai_api_key, etc. — implementation picks one approach; tests must cover rejection). - GET may return
409/ structured error if on-disk file already contains forbidden keys (legacy) — operator fixes out-of-band, or server offers one-time strip behind explicit?force_sanitize=1(optional — default safe).
Design & Implementation¶
Phase 1a — Feeds (structured spec)¶
- Canonical basename:
feeds.spec.yamlat corpus root (writer default).--feeds-spec PATHaccepts.yaml,.yml, or.json; parser choice follows the file extension (same pattern as main--config/load_config_file). - Document schema: top-level object with a single
feedsarray (required). Each element is either a string (RSS URL) or an object with requiredurl(http/https) and optional keys that are an explicit allowlist of per-inner-runConfigfields (download resilience, timeouts, user agent, episode window — seepodcast_scraper.rss.feeds_spec.RSS_FEED_ENTRY_OVERRIDE_KEYS). Unknown keys on feed objects are rejected (extra="forbid"). Unknown top-level keys besidesfeedsand optional_comment*are rejected. GET/PUT /api/feeds?path=— JSON:GETreturnspath,file_relpath,feeds(mixed strings and objects; URL-only entries may be returned as bare strings).PUTbody{ "feeds": [...] }; server validates, dedupes by URL (first-seen order), writes UTF-8 YAML atomically tofeeds.spec.yaml.- Merge order (CLI and jobs): packaged
profile:(if any) → global operatorConfig→ per-feed entry fields from the spec file for overlapping keys on each innerrun_pipelineconfig (model_copyupdate). feeds_apionGET /api/healthreflectsapp.state.feeds_api_enabled(fromcreate_app(..., enable_feeds_api=...)).create_app(..., enable_feeds_api=False);PODCAST_SERVE_ENABLE_FEEDS_APIfor the uvicorn reload factory (serve_feature_kwargs_from_environ).- Fix typo in any prior notes: health route is
GET /api/health, not/api/handler.
Phase 1b — Operator config file¶
Path precedence (computed once at server startup — not chosen by the browser):
- If
podcast servewas started with--config-file <PATH>(wired throughparse_serve_argv→create_app(..., operator_config_file=PATH)andPODCAST_SERVE_CONFIG_FILEfor reload factory), set
app.state.operator_config_path = Path(PATH).expanduser().resolve(). - Else set
app.state.operator_config_path = output_dir.resolve() / "viewer_operator.yaml"(basename documented in SERVER_GUIDE).
Security: GET/PUT accept the same path= corpus query as other viewer routes (for anchor checks and consistency), but the YAML bytes are read/written only at operator_config_path — the client never supplies a filesystem path for the config file itself. That honors “prefer --config-file when set, else corpus default” without path injection.
Wire podcast serve: add optional --config-file; set env for reload; pass into create_app.
HTTP:
GET /api/operator-config?path=<corpus>— resolve corpus withresolve_corpus_path_param; JSON matchesOperatorConfigGetResponse:corpus_path,operator_config_path(absolute path string),content,available_profiles(sorted list of packaged preset names without.yamlextension). Side effect: if the operator file is missing or whitespace-only andcloud_balancedis inavailable_profiles, the handler creates the file withprofile: cloud_balancedbefore returning (so the viewer profile picker has a sane default without an extra PUT). Ifcloud_balancedis not packaged in the environment,contentmay stay empty until the operator PUTs.PUT /api/operator-config?path=<corpus>— same corpus resolution; body = YAML string; reject forbidden secret keys and forbidden feed-list top-level keys (see § Feeds vs operator YAML); atomic write tooperator_config_pathonly.
Health: operator_config_api: bool reflects app.state.operator_config_api_enabled.
Viewer: status bar Config next to Feeds; shared modal with tabs (Feeds | Config) per UXS-001; Config tab = Profile <select> (from available_profiles + “None”) + monospace overrides YAML + Save; surface validation errors from 422 / forbidden-key 400 bodies.
Profile composition (#593)¶
Operator YAML may include a single top-level profile: <preset> line. At pipeline load, Config merges config/profiles/<preset>.yaml as defaults, then applies explicit keys from the same operator file (explicit wins). The profile key is consumed and is not a persisted Config field. available_profiles on GET /api/operator-config lists packaged presets by unioning *.yaml stems from every config/profiles directory that exists: (1) cwd-relative config/profiles (same as Config._resolve_profile’s first candidate), (2) repo-root config/profiles next to the installed sources, de-duplicated by resolved path. Excludes *.example.yaml (stem ends with .example). If no directories exist, available_profiles is an empty array.
Feeds vs operator YAML¶
Canonical viewer workflow: feeds live in corpus feeds.spec.yaml (Feeds API; CLI --feeds-spec with the same resolved path). Operator YAML must not duplicate feed sources at the root mapping. PUT /api/operator-config rejects these top-level keys (normalized): rss, rss_url, rss_urls, feeds — response detail.error = forbidden_operator_feed_keys when every forbidden key is a feed-list key; forbidden_operator_keys when any other forbidden key appears (including a mix of feed keys and secrets — clients should use detail.keys for specifics; same HTTP 400 shape with keys list).
Phase 2 jobs: when feeds.spec.yaml exists under the corpus anchor, child argv passes --feeds-spec <absolute path> and --config (same flag the main CLI uses) to operator_config_path; effective pipeline config includes profile: merge as above. Legacy --rss-file remains on the main CLI for line lists but is not what the job runner or Feeds API emit.
Phase 2 — Jobs and subprocess¶
POST /api/jobs(enqueue),GET /api/jobs(list),GET /api/jobs/{id}(detail),POST /api/jobs/{id}/cancel,POST /api/jobs/reconcile(registry + PID hygiene).jobs_apionGET /api/healthreflectsapp.state.jobs_api_enabled.
Phase 2 — Architecture decision: where the pipeline runs (review this)¶
Chosen approach (v1): start the pipeline as an OS child process of the uvicorn parent — typically asyncio.create_subprocess_exec with sys.executable and python -m podcast_scraper.cli … (exact argv mirrors what operators would type). The HTTP handler returns 202 + job_id immediately; the parent records PID, redirects child stdout/stderr to a corpus-local log file, and a background asyncio task awaits the child to update the job registry (succeeded / failed / exit code).
Why we decided this (fit for podcast serve + ADR-064):
- CLI parity & debuggability — failures reproduce with the same entrypoint as manual runs; support and docs already center on
podcast …argv. - Crash and memory isolation — pipeline OOM or hard abort in native code should not tear down the HTTP server that serves search, artifacts, and the SPA.
- CPU / event loop — the scraper is CPU- and I/O-heavy; running it synchronously inside an
async defroute blocks the event loop. StarletteBackgroundTasksstill run in-process and are widely treated as appropriate only for lightweight work; they do not add crash isolation. A separate interpreter process avoids sharing fate with the API worker. - Alignment with “one server” (ADR-064) — we add routes + a small runner inside the existing FastAPI app instead of introducing a second always-on service (broker worker) for the local operator story.
Compose / full-stack Docker (#660, RFC-079 Phase 2)¶
When the API runs inside RFC-079 compose and PODCAST_PIPELINE_EXEC_MODE=docker, GitHub #660 replaces the in-container sys.executable subprocess with a factory that invokes docker compose run into the pipeline or pipeline-llm service image. That path is additive: native make serve-api (full venv on the host) keeps the subprocess + argv model above unchanged. CONFIG_FILE / corpus profile (what the pipeline does) and viewer_operator.yaml → pipeline_install_extras (which Compose service / image tier runs the job) are separate; the server does not infer the latter from the profile name — keep them consistent (see make verify-stack-profiles). Docker-only operator metadata is validated only when enqueueing in Docker mode — not for laptop-only operators. See RFC-079 §Native vs Docker.
Alternatives considered (explicit non-picks for v1):
| Option | Pros | Cons | Verdict |
|---|---|---|---|
In-process service.run() on a thread pool (run_in_executor) |
Easiest to attach Python loggers; no argv marshalling | Same process as serve — OOM / fatal error can kill the viewer; GIL contention with request threads; uvicorn --reload restarts abort in-flight runs; cancel is cooperative only |
Not v1 default — acceptable only for very short dev experiments |
Starlette BackgroundTasks |
Minimal code | Runs after response but still in-process; same crash/GIL/reload issues; no strong job lifecycle without extra plumbing | Rejected for multi-hour pipelines |
| External task queue (Celery, RQ, Dramatiq, Redis + separate worker) | Retries, persistence, horizontal workers | New operational dependencies; second deployable; overkill for single-machine 127.0.0.1 workflows; duplicates what ADR-064 deferred as “platform” until needed |
Future if we ship a hosted multi-tenant product |
multiprocessing.Process (fork/spawn API) |
In-Python IPC | Pickling large Config, spawn cost and macOS/Windows quirks; worse argv story than “just run CLI” |
Subprocess CLI preferred for parity |
| External only (cron, systemd timer, file watcher) | Zero serve CPU |
Disjoint from “Run job” in the UI; harder to attach job_id, cancel, and Dashboard rows to one product |
Out of scope for interactive Phase 2 |
Implementation notes tied to this choice:
- Non-blocking spawn: use
create_subprocess_exec(notsubprocess.run()insideasync defwithout offload) so the worker does not stall the event loop while the child runs for hours. - Cancel (v1 shipped): parent sends SIGTERM to the child PID only (no timed grace → SIGKILL loop yet). Windows differs (
TerminateProcess/ taskkill); v1 is POSIX-first. Future revision may add grace + SIGKILL after documentedPODCAST_*tuning. - Durability caveat: until registry + replay exist, server restart may orphan in-flight children or lose
queuedrows — v1 documents local dev expectations; production hardening may move toward option C later.
Validation — why subprocess + local registry is sound (and where it stops)¶
This subsection records design validation (including informal cross-check against common FastAPI/Starlette practice). It is not a formal benchmark or security audit.
What we believe is correct for v1 (podcast serve, local operator, single machine):
- Subprocess is the right default for long, CPU-heavy, failure-prone work sitting beside an ASGI app: the pipeline can OOM, hit native extension faults, or run for hours without taking down HTTP routes the operator still needs (health, search, artifacts).
- CLI-shaped argv is the right integration surface for this repo: support and debugging already assume
python -m podcast_scraper.cli …; reproducing a “bad job” does not require a different entrypoint than manual runs. - Middle ground is intentional: community guidance typically buckets work into (a) in-process / BackgroundTasks for light post-response tasks, (b) separate process for isolation, (c) broker-backed workers (Celery, RQ, etc.) when persistence, retries, and horizontal workers matter. For local
serve, (b) + a durable-enough registry matches product scope without (c)’s operational cost. - ADR-064 alignment: we keep one long-lived server module and feature-flag job routes — we do not require a second always-on service for the first shipping slice.
Known limits (honest scope — not hidden “gotchas”):
- Durability across
serverestart is weaker than a broker: in-flight orqueuedjobs may be orphaned or lost until replay / reconciliation semantics exist; v1 targets local dev clarity, not cloud SLA. uvicorn --workers > 1requires cross-process locking (or documented single-worker constraint for job submission v1).- Horizontal scale-out (many API replicas sharing one job queue) is out of scope for this design; escalation path is the external queue row in the alternatives table above.
- Windows cancel and signal semantics differ from POSIX; implementation must document behavior; v1 may be POSIX-first with best-effort Windows.
When we would revisit the decision: sustained need for cross-restart durability, multi-tenant job isolation, or multiple concurrent heavy jobs per corpus under many uvicorn workers → prefer broker + worker (or a small embedded queue with a second process) and supersede this section with a new ADR (see below).
ADR: do we need a separate, new ADR?¶
Default: no — not required to ship. RFC-077 plus existing ADR-064 (“one server, feature-flagged route groups”) are sufficient design records for the first implementation: the process model and alternatives live in this RFC’s § Phase 2 — Architecture decision and § Validation.
Add a new ADR (recommended later if any trigger applies):
| Trigger | Why extract an ADR |
|---|---|
| Cross-reference load | Another RFC or subsystem needs a one-line “Accepted: pipeline jobs run as subprocess of serve” without importing all of RFC-077. |
| Immutability ritual | Stakeholders want an Accepted ADR row in docs/adr/index.md as the canonical “this is how jobs run” pin. |
| Supersession | We adopt Celery/RQ (or similar) later — a new ADR should supersede the subprocess-first decision and point at the broker RFC. |
If we add an ADR: use the next free ADR number from docs/adr/index.md, keep this RFC as the detailed spec (API, registry, hygiene), and make the ADR a short decision + consequences with links back here.
Phase 2 — Job registry and hygiene (“end of day clear”)¶
Child argv (reference): subprocess includes --output-dir = resolved corpus anchor, --config = str(app.state.operator_config_path) (operator file may set profile: — merged with packaged preset before Config validation), and --feeds-spec = absolute path to feeds.spec.yaml under anchor when present (plus any agreed pipeline flags from RFC / CLI). Feed entries for the run come from that spec file, not from operator YAML root keys.
Registry storage: under corpus root .viewer/jobs.jsonl (or SQLite if concurrency demands — start JSONL + file lock).
Job record fields (minimum): job_id, created_at, started_at, ended_at, status (queued | running | succeeded | failed | cancelled | stale), pid (nullable), argv_summary, exit_code, log_relpath, last_progress_at (optional heartbeat from worker thread writing sidecar).
Stale detection:
- Wall-clock timeout T (configurable, default e.g. 24h or product-set): if
runningandnow - started_at > T→ auto-markstaleor requirePOST /api/jobs/reconcileto promote — pick auto-mark stale + visible banner in Dashboard. - PID liveness: periodic lightweight check (optional background task in server)
os.kill(pid, 0); if dead but status stillrunning→orphan_reconciled→failedwith reason.
Cancel: POST /api/jobs/{id}/cancel sends SIGTERM (v1); registry records cancel_requested; if the child is still alive the row may remain running with cancel_requested: true until the process exits and the waiter updates status — clients should treat that as “cancel pending,” not necessarily terminal cancelled yet. Idempotent if already terminal.
Reconcile: POST /api/jobs/reconcile (operator-triggered) scans registry + PIDs, returns summary { "updated": n, "details": [...] } for UI toast and Pipeline refresh.
Multi-job policy: default one active running job per corpus (file lock); queue additional POSTs as queued or return 409 — RFC recommends queue with visible position in Pipeline table.
Multi-worker uvicorn: file lock on registry append + job spawn; document single worker for v1 if lock complexity deferred.
Distinction from RFC-065¶
| Aspect | RFC-065 | RFC-077 Phase 2 |
|---|---|---|
| Surface | CLI + terminal | HTTP + Vue + Dashboard |
| State | .pipeline_status.json |
Job registry + API |
Optional bridge: child pipeline may still emit .pipeline_status.json; server may attach last lines to job record — not required v1.
Documentation deliverables¶
Phase 1a: SERVER_GUIDE (feeds), VIEWER_IA, UXS-001 (Feeds).
Phase 1b: SERVER_GUIDE (operator-config path rules, forbidden keys), UXS-001 (tabs / Config editor).
Phase 2: UXS-006 (Pipeline: status, stale, cancel, reconcile), UXS-001 (optional job chip), SERVER_GUIDE (job API).
Testing strategy¶
- Integration: feeds + operator-config PUT rejection on forbidden secret and forbidden feed keys; GET legacy file with secret → error path; GET includes
available_profiles; feedsGET/PUTround-tripfeedsJSON vsfeeds.spec.yamlon disk; structured spec parser rejects unknown top-level keys and unknown per-feed keys. - Job hygiene: unit/integration for stale transition + cancel mock subprocess.
- Playwright: mocked APIs for dialogs.
Risks & mitigations¶
| Risk | Mitigation |
|---|---|
| Secrets leak via GET | Forbidden-key policy; optional redact read |
| Config path confusion | Single operator_config_path on app.state at startup; SERVER_GUIDE path rules |
| Orphan subprocess | PID checks + cancel + reconcile |
Revision history¶
| Date | Change |
|---|---|
| 2026-04-19 | Initial Draft |
| 2026-04-19 | Operator config (path precedence + no secrets in file); job registry + stale/cancel/reconcile; health flags |
| 2026-04-19 | ADR-style section: child subprocess for pipeline vs in-process / BackgroundTasks / broker; alternatives table + external validation note |
| 2026-04-19 | § Validation expanded (soundness + limits); § ADR — default RFC-only, when to add new ADR |
| 2026-04-19 | Align with shipped code: app.state.*_enabled, operator_config_path, GET JSON shape, Phase 2 route list + reconcile, cancel v1 (SIGTERM only, cancel_pending semantics) |
| 2026-04-20 | Profiles (#593): available_profiles on GET; PUT rejects root feed keys; viewer profile picker + overrides; § Profile composition + feeds vs operator YAML |
| 2026-04-21 | available_profiles union cwd + repo config/profiles/; mixed PUT errors documented; viewer profile select sole source of truth on save |
| 2026-04-20 | Feeds spec: canonical feeds.spec.yaml, { feeds: [...] } schema, --feeds-spec, GET/PUT JSON shape (feeds not urls), per-feed override allowlist + merge order; jobs argv --config + --feeds-spec (replaces --rss-file / --config-file for child CLI) |
| 2026-04-22 | § Compose / #660: native subprocess jobs unchanged; Docker stack (#660) additive; pipeline_install_extras Docker-path-only — cross-ref RFC-079 §Native vs Docker |