Server Guide¶
The FastAPI server layer for the GI/KG viewer and future platform APIs.
Overview¶
podcast_scraper follows a "one pipeline core, multiple shells" philosophy:
the same Python library that powers the CLI, service API, and batch workflows
also backs an HTTP server.
The server exposes corpus artifacts, semantic search, GI explore,
Cross-layer CIL queries (person/topic arcs over *.bridge.json + GI/KG siblings),
Corpus Library catalog APIs, and vector index stats / rebuild controls
through a JSON API consumed by the Vue 3 SPA
(web/gi-kg-viewer/).
Corpus lifecycle: routes read current files under the configured corpus root. Re-running extraction, the bridge builder, or corpus topic clustering can change canonical ids and cluster compounds; responses are not pinned to an older ingest. See Operational note (canonical identity).
Repo layout: Python at the root, Node UI in web/gi-kg-viewer/. See
Polyglot repository guide for env files and Makefile targets (make serve,
make test-ui, etc.).
The server is described in
GI/KG viewer v2 and is implemented in
src/podcast_scraper/server/.
Route groups are additive: viewer routes ship by default (artifacts,
search, explore, index stats/rebuild, corpus library, corpus digest, CIL, …).
RFC-077 opt-in routes (feeds, operator-config, pipeline jobs)
mount separately via create_app flags (see RFC-077
and ADR-064 addendum).
Future megasketch platform routes (#50, #347) remain behind enable_platform when implemented.
Quick start¶
1. Install the [dev] extra¶
pip install -e '.[dev]'
This adds fastapi and uvicorn[standard].
2. Build the frontend (once)¶
cd web/gi-kg-viewer && npm install && npm run build
The build produces web/gi-kg-viewer/dist/.
When that directory exists, the server mounts it at / automatically.
3. Start the server¶
podcast serve --output-dir /path/to/corpus/output
4. Open the browser¶
Navigate to http://127.0.0.1:8000.
Set Corpus root folder in the UI to the same --output-dir path,
then List files, select artifacts, and Load selected into graph.
Use the Library tab to browse feeds and episodes (Corpus Library).
Architecture¶
App factory¶
app.py exposes two entry points:
| Function | Purpose |
|---|---|
create_app(output_dir, *, static_dir) |
Build the FastAPI instance with viewer routes and optional static assets. |
create_app_for_uvicorn() |
Factory for uvicorn --factory (reload mode). Reads PODCAST_SERVE_OUTPUT_DIR from the environment. |
create_app stores output_dir on app.state so route handlers can
fall back to it when the caller omits the ?path= query parameter.
When output_dir is set, any ?path= value must resolve to that directory
or a subdirectory (see server/pathutil.py). Overrides are rejected if the
server has no default corpus root, so callers cannot aim the API at arbitrary
paths on disk.
Route groups¶
Routers are included with prefix="/api" and organized by domain:
routes/
health.py # viewer — always available
artifacts.py # viewer — list / load GI & KG JSON
index_stats.py # viewer — FAISS vector index metrics + staleness
index_rebuild.py # viewer — POST /index/rebuild (background job)
search.py # viewer — semantic corpus search
explore.py # viewer — GI explore + UC4 NL query
cil.py # viewer — CIL position arc, person profile, topic timeline (#527)
corpus_library.py # viewer — /corpus/* catalog + similar episodes
corpus_text_file.py # viewer — GET /corpus/text-file (transcript / caption files under corpus root)
corpus_digest.py # viewer — GET /corpus/digest
corpus_topic_clusters.py # viewer — GET /corpus/topic-clusters
feeds.py # optional — GET/PUT /feeds (feeds.spec.yaml); enable_feeds_api
operator_config.py # optional — GET/PUT /operator-config; enable_operator_config_api
jobs.py # optional — POST/GET /jobs, cancel, reconcile; enable_jobs_api
scheduled_jobs.py # optional — GET /scheduled-jobs (cron list + next-run); enable_jobs_api
platform/ # reserved stub package (#50, #347); not used for RFC-077
RFC-077 routers (feeds, operator_config, jobs) mount when their
matching enable_*_api flags are true in create_app (or the equivalent
PODCAST_SERVE_ENABLE_* env vars for uvicorn --factory reload).
The routes/platform/ package is not mounted today; it is reserved for
future megasketch platform work
(ADR-064).
Static file mounting¶
When a built SPA exists at web/gi-kg-viewer/dist/, create_app mounts
it at / via StaticFiles(directory=..., html=True).
Pass --no-static (or static_dir=False in code) to skip mounting and
run the API standalone.
CORS¶
The CORS middleware allows http://127.0.0.1:5173 and
http://localhost:5173 so the Vite dev server can call the API
during development.
API reference¶
All endpoints live under the /api prefix. With the server running, OpenAPI is at /docs (Swagger UI) and /openapi.json.
Authentication¶
Local dev server: no auth. Treat production deployments as out-of-scope for this guide unless you add your own reverse proxy or middleware.
| Method | Path | Tag | Description | Key query params |
|---|---|---|---|---|
| GET | /api/health |
health | Liveness and capability flags: status; core viewer artifacts_api, search_api, explore_api, index_routes_api, corpus_metrics_api, cil_queries_api (default true when mounted); catalog corpus_library_api, corpus_digest_api, corpus_binary_api. RFC-077 (default false): feeds_api, operator_config_api, jobs_api when those route groups are mounted. Omit digest flag on older builds → clients treat digest as unavailable. |
— |
| GET, PUT | /api/feeds |
feeds | Read/write structured feeds.spec.yaml under the resolved corpus root (JSON { "feeds": [...] } on PUT). |
path |
| GET, PUT | /api/operator-config |
operator_config | Read/write viewer-safe operator YAML at the server-resolved path. GET returns content, operator_config_path, and available_profiles (union of packaged config/profiles/*.yaml names from cwd and repo roots, same as Config preset resolution; excluding *.example.yaml). When the file is missing or whitespace-only and the packaged preset cloud_balanced exists, GET writes a minimal profile: cloud_balanced file first (idempotent if already that content). PUT rejects forbidden secret keys and top-level feed keys (rss, rss_url, rss_urls, feeds) — use /api/feeds / feeds.spec.yaml for feeds. See RFC-077 (Phase 1b) and PRD-030. |
path |
| GET, POST | /api/jobs |
jobs | List (GET) or enqueue (POST) pipeline jobs for the corpus; GET /api/jobs/{id}, POST /api/jobs/{id}/cancel, POST /api/jobs/reconcile. |
path |
| GET | /api/scheduled-jobs |
scheduled-jobs | List in-process cron schedules from viewer_operator.yaml scheduled_jobs: (#708). Each row carries name, cron, enabled, next_run_at (UTC ISO; null when disabled or invalid cron). Mounts only when enable_jobs_api is on. Operators add/remove schedules by editing the YAML via PUT /api/operator-config, which triggers a scheduler reload in-process. |
path |
| GET | /api/artifacts |
artifacts | List *.gi.json, *.kg.json, and *.bridge.json (recursive); each item includes mtime_utc (#507) and publish_date (YYYY-MM-DD from episode metadata when present, else UTC calendar day from file mtime). |
path (required) — corpus output directory |
| GET | /api/artifacts/{path} |
artifacts | Load and return a single artifact JSON by relative path. | path (required) — corpus root for the relative lookup |
| GET | /api/index/stats |
index | FAISS index stats, staleness heuristics, and rebuild job flags (rebuild_in_progress, rebuild_last_error; #507). |
path, embedding_model (optional; compare index to this id, else Config default) |
| POST | /api/index/rebuild |
index | Queue background index_corpus (202); mutex per corpus. Poll GET /api/index/stats. |
path, rebuild, embedding_model, vector_index_path, vector_faiss_index_mode, vector_index_types (comma-separated) |
| GET | /api/search |
search | Semantic corpus search via FAISS + sentence embeddings. Transcript hits may include optional lifted: overlapping Quote → Insight plus speaker / topic display names from bridge.json when present (Semantic Search Guide — lift). Insight hits may include supporting_quotes (from indexer enrichment): quote speaker_id / speaker_name mirror .gi.json — often null / absent when segments lack diarization labels (GitHub #541; canonical rules: Development Guide — GI quote speaker_id). Successful responses include optional lift_stats: transcript_hits_returned and lift_applied for the returned page (after top_k). |
q (required), path, type, feed, since, speaker, grounded_only, top_k, embedding_model, dedupe_kg_surfaces (default true: merge same-text kg_entity / kg_topic rows) |
| GET | /api/explore |
explore | GI cross-episode explore (filter mode) or UC4 natural-language query. Insight rows may include supporting_quotes with speaker_id / speaker_name mirroring .gi.json (often absent without diarization — GitHub #541; Development Guide — GI quote speaker_id). |
path, question / q, topic, speaker, grounded_only, min_confidence, sort_by, limit, strict |
| GET | /api/persons/{person_id}/positions |
cil | Position arc — chronological insights for a person and topic across episodes. Scans **/*.bridge.json with sibling GI/KG. |
topic (required), path, insight_types (comma-separated; omit → claim only; all / * → no filter) |
| GET | /api/persons/{person_id}/brief |
cil | Person profile — insights grouped by topic plus quotes for that person. | path |
| GET | /api/persons/{person_id}/topics |
cil | Distinct topic ids for that person (from brief keys). | path |
| GET | /api/topics/{topic_id}/timeline |
cil | Topic timeline — insights about the topic per episode. Each episode may include episode_title, feed_title, episode_number, and artwork fields (episode_image_url, episode_image_local_relpath, feed_image_url, feed_image_local_relpath) from sibling *.metadata.json. |
path, insight_types (omit → all types; all / * → all) |
| POST | /api/topics/timeline |
cil | Merged topic timeline — same Pattern C rules as GET, but one corpus scan for multiple topic_ids in the JSON body (cluster scope). Response includes topic_ids (deduped, canonical order) and episodes (merged per episode_id, insight nodes deduped). |
JSON body: topic_ids (required), path (optional if default output dir), insight_types (optional; same semantics as GET) |
| GET | /api/topics/{topic_id}/persons |
cil | Distinct person: ids that discuss the topic via grounded quotes. |
path |
| POST | /api/corpus/resolve-episode-artifacts |
corpus | Map logical episode_id values (from metadata) to corpus-relative gi_relative_path / kg_relative_path / bridge_relative_path from one catalog scan. Rows without GI on disk are listed in missing_episode_ids. The GI/KG viewer’s topic-cluster sibling auto-load builds candidate ids from topic_clusters.json → clusters[].members[].episode_ids (union per touched cluster). |
JSON body: episode_ids (required), path (optional if server default set) |
| POST | /api/corpus/node-episodes |
corpus | Progressive graph expansion (cross-episode): for a canonical node_id (person:, org:, or topic:), scan *.bridge.json bundles (no full triple-read), then resolve sibling *.gi.json / *.kg.json by stem. Returns episodes (metadata-relative paths when present, optional episode_id, GI/KG flags) and truncated when the cap is hit. The GI/KG viewer uses this on eligible nodes (double-tap expand). |
JSON body: node_id (required), path (optional if server default set), limit (optional cap) |
| GET | /api/corpus/feeds |
corpus | Aggregate feeds from episode metadata under the corpus root. | path (optional if server default set) |
| GET | /api/corpus/episodes |
corpus | Paginated episode list (newest-first scan); optional filters. Each item includes cil_digest_topics as an empty array (CIL pills are returned on digest rows and episode detail only, to avoid per-row bridge reads in the list). GI/KG path fields support graph loads after opening an episode. When topic_cluster_only=true, the server keeps rows whose bridge.json topic appears on a multi-member topic-cluster member that lists this episode's episode_id in members[].episode_ids (reads search/topic_clusters.json per request; pipeline-built clusters include those ids). |
path, feed_id, q (title substring), topic_q, since (YYYY-MM-DD), topic_cluster_only (boolean), limit (1–200), cursor |
| GET | /api/corpus/episodes/detail |
corpus | Episode row + summary bullets + GI/KG/bridge paths + cil_digest_topics (same shape as list / digest). |
path, metadata_relpath (required) |
| GET | /api/corpus/episodes/similar |
corpus | FAISS semantic peers for an episode; 200 with error when index missing. |
path, metadata_relpath (required), top_k (1–25) |
| GET | /api/corpus/digest |
corpus | Feed-diverse recent episodes (metadata + GI/KG flags), optional cil_digest_topics per row (bridge + topic-cluster order; omitted when compact=true), and optional semantic topic bands. compact=true forces 24h, smaller cap, no topic bands and no CIL pill enrichment. |
path, window (24h / 7d / 1mo / since), since (required if window=since; 1mo = previous calendar month, UTC), compact, include_topics, max_rows |
| GET | /api/corpus/topic-clusters |
corpus | Topic clustering artifact: returns search/topic_clusters.json when present (404 with available: false when missing). |
path (optional if server default set) |
| GET | /api/corpus/text-file |
corpus | Inline file under the corpus root for browser viewing (.txt, .md, .vtt, .srt, .json). Graph Quote node detail uses this for the in-app transcript viewer and the Open raw transcript in new tab header link. Pipeline runs that direct-download WebVTT/SubRip normalize to transcripts/… .txt (plus *.segments.json for GI timing); metadata usually points at that .txt. If relpath ends with .txt (and is not already *.cleaned.txt) and that file is missing, the server tries the sibling stem.cleaned.txt (metadata often still references the raw Whisper path). |
path (optional if server default set), relpath (required) |
| GET | /api/corpus/stats |
corpus | Publish-month histogram (YYYY-MM → episode count) from one catalog scan; GI/KG Dashboard. |
path |
| GET | /api/corpus/documents/manifest |
corpus | Return corpus_manifest.json at corpus root (404 if missing). |
path |
| GET | /api/corpus/documents/run-summary |
corpus | Return corpus_run_summary.json at corpus root (404 if missing). |
path |
| GET | /api/corpus/runs/summary |
corpus | Discover run.json under the tree (capped), compact metrics per file for Dashboard. |
path |
| GET | /api/corpus/coverage |
corpus | GI/KG sibling-file presence per episode; aggregates by publish month and feed (Dashboard). | path |
| GET | /api/corpus/persons/top |
corpus | Top speakers by grounded insight count (scans *.gi.json under catalog). |
path, limit |
Design and response field semantics: Corpus Digest. Topic strings: repo config/digest_topics.yaml.
Response models¶
Pydantic response schemas are defined in
schemas.py:
HealthResponseArtifactListResponse/ArtifactItemIndexStatsEnvelope/IndexStatsBody/IndexRebuildAcceptedCorpusSearchApiResponse/SearchHitModel(optionalliftedon transcript rows when lift applies; optionalsupporting_quoteson insight rows — quote speaker fields follow GI segment/diarization rules, issue #541)ExploreApiResponse(insightsupporting_quotesspeaker fields follow GI segment/diarization rules, issue #541)CilArcEpisodeBlock/CilPositionArcResponse/CilPersonProfileInsightRow/CilPersonProfileQuoteRow/CilPersonProfileResponse/CilTopicTimelineResponse/CilTopicTimelineMergeRequest/CilTopicTimelineMergedResponse/CilIdListResponseCorpusResolveEpisodesRequest/CorpusResolveEpisodesResponse/CorpusResolvedEpisodeArtifactCorpusFeedsResponse/CorpusFeedItemCorpusEpisodesResponse/CorpusEpisodeListItemCorpusEpisodeDetailResponseCorpusSimilarEpisodesResponse/CorpusSimilarEpisodeItemCorpusDigestResponse/CorpusDigestRow/CorpusDigestTopicBand/CorpusDigestTopicHit(digest rows:summary_bullet_graph_topic_idsparallel tosummary_bullets_preview; topic bands:graph_topic_id—topic:{slug}from the band label for Graph focus)CorpusStatsResponse/CorpusRunsSummaryResponse/CorpusRunSummaryItem
Route conventions¶
- One file per domain —
routes/<domain>.py(e.g.,routes/health.py,routes/search.py). - Each file exports a
router— anAPIRouterinstance with atagslist for OpenAPI grouping. - No prefix on the router itself — the
/apiprefix is applied inapp.pyviaapp.include_router(router, prefix="/api"). - Corpus resolution helper — routes that accept an optional
?path=parameter resolve it againstrequest.app.state.output_diras a fallback. The private_resolve_corpus_root(path, fallback)pattern is repeated inindex_stats.py,index_rebuild.py,search.py,explore.py,cil.py,corpus_library.py,corpus_digest.py, andcorpus_metrics.py. - RFC-077 routes live as top-level
routes/*.pyfiles (same router pattern).routes/platform/is reserved for future #50/#347 surfaces, not feeds/jobs v1.
Adding new routes¶
Step-by-step¶
- Create the route file —
src/podcast_scraper/server/routes/<domain>.py:
from fastapi import APIRouter
router = APIRouter(tags=["<domain>"])
@router.get("/<domain>")
async def my_endpoint() -> dict:
return {"hello": "world"}
- Add Pydantic response model(s) to
schemas.py:
class MyResponse(BaseModel):
hello: str
- Include the router in
app.py:
from podcast_scraper.server.routes import my_domain
app.include_router(my_domain.router, prefix="/api")
- Write unit tests in
tests/unit/podcast_scraper/server/test_viewer_<domain>.pyusingFastAPI'sTestClient:
pytest.importorskip("fastapi")
from fastapi.testclient import TestClient
from podcast_scraper.server.app import create_app
def test_my_endpoint(tmp_path):
app = create_app(tmp_path, static_dir=False)
client = TestClient(app)
resp = client.get("/api/<domain>")
assert resp.status_code == 200
- Update this guide — add the new endpoint to the API reference table.
Configuration¶
CLI flags¶
| Flag | Default | Description |
|---|---|---|
--output-dir |
(required) | Corpus output directory containing metadata/*.gi.json, metadata/*.kg.json, and search/. |
--host |
127.0.0.1 |
Bind address. |
--port |
8000 |
TCP port. |
--reload |
off | Dev only: restart workers when Python files change (uses uvicorn --factory). |
--no-static |
off | Do not mount built SPA assets even if web/gi-kg-viewer/dist/ exists. |
Environment variable¶
| Variable | Purpose |
|---|---|
PODCAST_SERVE_OUTPUT_DIR |
Set automatically by run_serve(). Used by create_app_for_uvicorn() in reload mode. Can also be set manually when running uvicorn directly. |
PODCAST_SCHEDULER_TZ |
Timezone for the in-process feed-sweep scheduler (#708). Defaults to TZ env, then UTC. |
CLI entry point¶
podcast serve --output-dir /path/to/output [--host 0.0.0.0] [--port 9000] [--reload]
The serve sub-command is handled by
cli_handlers.py (parse_serve_argv + run_serve).
Scheduled feed sweeps (#708)¶
Optional in-process cron scheduler that fires the same pipeline-job path
as POST /api/jobs on the operator's chosen schedule. Built on
APScheduler (3.x; in the
[dev] extra). Mounts whenever enable_jobs_api=True and the operator
YAML contains scheduled_jobs:.
Why API-level (not host-side cron)¶
Works on Codespace pre-prod (no systemd) and VPS prod alike, persists with the corpus (no host state to migrate on redeploy), and routes failures through the existing job-state webhook surface so Slack and Grafana already see the events. See GH #708.
Schedule definition¶
Add a top-level scheduled_jobs: list to viewer_operator.yaml (the
packaged example ships a commented hint):
scheduled_jobs:
- name: morning-feed-sweep
cron: "0 4 * * *" # standard 5-field cron (m h dom mon dow)
enabled: true
- name: evening-sweep
cron: "0 20 * * *"
enabled: false # loaded but not fired
Each schedule reuses the corpus's standing feeds.spec.yaml + this same
operator YAML — there is no per-schedule profile / feeds / max_episodes
override in V1 (use multiple schedules if you need different cadences for
different feed sets, V2 will revisit per-schedule overrides).
name is used as the job id, the Prometheus label, and shows up in logs
and any Slack alerts; keep it short and stable. Allowed characters:
letters, digits, -, _. Names must be unique within a corpus.
Reload behavior¶
- App startup (FastAPI lifespan): scheduler starts if at least one enabled job exists.
PUT /api/operator-config: scheduler reloads the YAML and rebuilds triggers in-process. No restart needed — operators can add/remove schedules from the viewer Configuration tab and they take effect on Save.- App shutdown: scheduler stops cleanly via the same lifespan hook.
Misfire grace is 1 hour: a schedule whose trigger time was missed (host suspended / rebooting) will fire on wakeup if within 1 h of the nominal time, then skip silently if not.
Inspecting state¶
GET /api/scheduled-jobs?path=<corpus> returns the parsed schedule list
with each job's next-run preview:
{
"path": "/corpora/main",
"scheduler_running": true,
"timezone": "UTC",
"jobs": [
{"name": "morning-feed-sweep", "cron": "0 4 * * *", "enabled": true,
"next_run_at": "2026-05-03T04:00:00Z"},
{"name": "evening-sweep", "cron": "0 20 * * *", "enabled": false,
"next_run_at": null}
]
}
Failure handling¶
A scheduled fire is indistinguishable from a manual POST /api/jobs once
it lands — the job appears in the same JSONL registry, Slack/HA webhooks
fire on terminal state through emit_job_state_change, and the row is
visible at GET /api/jobs. Two scheduler-specific Prometheus counters
flank this:
| Counter | Labels | Increment trigger |
|---|---|---|
podcast_scheduled_jobs_triggered_total |
name |
Cron fired (job_id may or may not have been issued yet) |
podcast_scheduled_jobs_failed_total |
name, reason |
Spawn raised before a job was registered (e.g. invalid cron, event-loop unavailable) |
Both are no-ops when prometheus_client is unavailable (i.e. without
the [dev] extra).
Limitations (V1)¶
- Per-schedule overrides (profile / feeds / max_episodes) are not wired — use multiple schedules with different operator YAMLs if you need them.
- No calendar-aware schedules (
every 3rd Tuesday); standard cron only. - No Scheduled tab in the viewer — operators edit
scheduled_jobs:via the existing Configuration YAML editor. TheGET /api/scheduled-jobsendpoint exists so a future viewer tab can list schedules + next-run previews; that UI is a follow-up of #708.
Development workflow¶
Three Makefile targets¶
| Target | What it does | When to use |
|---|---|---|
make serve |
Runs serve-api and serve-ui in parallel (make -j2). |
Day-to-day full-stack development. |
make serve-api |
Starts the FastAPI server on port 8000. | Backend-only work, or when running Vite separately. |
make serve-ui |
Runs npm run dev inside web/gi-kg-viewer/ (Vite on port 5173). |
Frontend-only work, or when running the API separately. |
Override the corpus directory:
make serve SERVE_OUTPUT_DIR=/path/to/output
Vite proxy¶
During development, Vite proxies /api/* requests to
http://127.0.0.1:8000.
Open http://127.0.0.1:5173 in the browser (not port 8000) to get
hot module replacement for Vue components while the API runs separately.
Hot reload (Python)¶
Pass --reload to restart uvicorn workers when Python files change:
podcast serve --output-dir ./output --reload
In reload mode, uvicorn uses create_app_for_uvicorn as a factory and
reads PODCAST_SERVE_OUTPUT_DIR from the environment.
Testing¶
Unit tests¶
Located in tests/unit/podcast_scraper/server/:
| File | Coverage |
|---|---|
test_viewer_api_m1.py |
Health endpoint, artifact listing, artifact loading, path traversal rejection. |
test_viewer_index_stats.py |
Index stats with no corpus, no index, and mocked FAISS store. |
test_viewer_search.py |
Search with no corpus, mocked run_corpus_search results. |
test_viewer_explore.py |
Explore filter mode and UC4 natural-language mode with mocked GI functions. |
test_cil_queries.py |
CIL query helpers over synthetic bridge + GI/KG bundles (#527). |
test_corpus_catalog.py |
Catalog rows, filters, and pagination helpers for Corpus Library. |
test_index_rebuild_gate.py |
Per-corpus rebuild mutex (CorpusRebuildGate). |
test_index_staleness.py |
Index staleness helpers used by index stats. |
Lift and offsets (search): tests/unit/podcast_scraper/search/test_transcript_chunk_lift.py
exercises chunk-to-Insight lift (#528); test_gil_chunk_offset_verify.py covers Quote vs
indexed chunk alignment helpers used by verify-gil-chunk-offsets.
All test files guard on pytest.importorskip("fastapi") so they stay
runnable in intentionally stripped environments (the default test-unit
job installs .[dev], which includes FastAPI).
Tests use FastAPI's synchronous TestClient and tmp_path fixtures.
Run them with (requires [dev] so FastAPI is present):
pip install -e '.[dev]' # if needed
make test-unit -k server
Integration tests¶
tests/integration/server/test_server_api.py exercises the wired app with real filesystem
artifacts (no mocking of route internals). Tests cover health, artifact listing/loading,
path-traversal blocking, index stats, search (no-index graceful error), explore (filter +
NL modes), and app factory edge cases. Marked @pytest.mark.integration.
Additional integration modules under tests/integration/server/ include
test_viewer_corpus_library.py (Corpus Library routes),
test_viewer_corpus_digest.py (GET /api/corpus/digest),
test_cil_api.py (CIL /api/persons/* and /api/topics/* routes),
test_index_rebuild.py (POST /api/index/rebuild gate and acceptance),
test_viewer_index_stats.py, test_viewer_api.py, test_server_app.py, and
test_server_package_init.py.
pytest tests/integration/server/test_server_api.py -v
Frontend unit tests (Vitest)¶
Pure TypeScript utility logic (parsing, merge, metrics, formatting, colors, visual groups, search-focus mapping) is covered by Vitest unit tests co-located with source:
make test-ui # ~150 ms, no browser
Tests: web/gi-kg-viewer/src/utils/*.test.ts. Config: vite.config.ts test block.
Browser E2E (Playwright)¶
ADR-066 chose
Playwright for browser-level E2E testing.
Tests live in web/gi-kg-viewer/e2e/ and run against a Vite dev server
on port 5174 (separate from the dev port 5173).
make test-ui-e2e
This target installs npm dependencies, installs the Firefox browser, and
runs npm run test:e2e inside web/gi-kg-viewer/.
CI vs serve: Playwright in CI runs the SPA on Vite with mocked or spec-level /api/*
handling — fast and deterministic. It is not a substitute for smoke-testing cli serve
(or make serve), where FastAPI serves built dist/ and real viewer APIs against a
corpus. See Testing Guide — Browser E2E.
See the E2E Testing Guide and Testing Guide for more detail.
Platform evolution¶
Mounted by default: viewer routers (health, artifacts, index_stats,
index_rebuild, search, explore, cil, corpus_library, and related /api/corpus/*
routes) as wired in create_app.
RFC-077 (opt-in): routes/feeds.py, routes/operator_config.py, and routes/jobs.py
mount when enable_feeds_api, enable_operator_config_api, and enable_jobs_api are
true (CLI flags on podcast serve or PODCAST_SERVE_ENABLE_* for reload). These are
not the historical routes/platform/* stubs — see RFC-077
and ADR-064 addendum.
Job runner operations note: the in-process job registry + spawn path targets
single uvicorn worker local use. Multiple workers or hosts do not share job state;
use one worker for predictable POST /api/jobs behavior unless a future design adds
a broker. Cancel is POSIX-first (SIGTERM to the child); Windows semantics differ.
Still reserved: routes/platform/ and enable_platform for future megasketch
catalog / DB-backed APIs (#50, #347) — not the RSS-list or RFC-077 job runner.
Batch pipeline JSONL and Grafana Loki (#746)¶
Long CLI / docker batch runs can stream structured events through
workflow/jsonl_emitter.py (run_started, episode_finished, run_finished)
to run.jsonl when jsonl_metrics_enabled is true. For Grafana Loki, each
line should be ingested as one JSON log line. The optional flag
jsonl_metrics_echo_stdout duplicates every JSONL line to stdout so the
existing Grafana Agent docker log pipeline (compose/grafana-agent.yaml)
ships them with stable labels (env, release, app=podcast_scraper,
compose service surfaced as service_name in Grafana Cloud). Use LogQL
| json | event_type="run_finished" in Explore or the imported dashboard
config/grafana/grafana-dashboard-pipeline-execution.json (also linked from
config/manual/). Prometheus /metrics on the API container remains
separate from this JSONL path.
Related docs¶
| Document | Description |
|---|---|
| GI/KG viewer v2 | Viewer design (milestones, success criteria). |
| Corpus Library | Catalog API and viewer integration. |
| Corpus Digest | Digest API & viewer (GET /api/corpus/digest, Digest tab, Library glance). |
| ADR-064 | Canonical server layer with feature-flagged route groups. |
| RFC-077 | Opt-in feeds file, operator YAML, and HTTP pipeline jobs on serve. |
| ADR-065 | Vue 3 + Vite + Cytoscape.js frontend stack decision. |
| ADR-066 | Playwright for UI E2E testing. |
| Testing Guide | Commands, markers, and browser E2E section. |
| E2E Testing Guide | Playwright browser E2E details. |
| Development Guide | Dev environment, make serve / serve-api / serve-ui. |
| Configuration API | Env vs YAML, twelve-factor alignment (config). |
| Viewer README | Frontend SPA setup, dev workflow, build instructions. |
src/podcast_scraper/server/ |
Server source code. |
Version: 1.3 Created: 2026-04-04 Updated: 2026-05-05 — Batch pipeline JSONL / Loki (#746)