ADR-086 — LLM-vision-as-provider for AI-derived masks¶
Status · Accepted Date · 2026-05-08 TA anchor · /components/masking · /contracts/mcp-tools · /constraints/byoa Related RFC · RFC-026 (LLM-vision-as-provider for AI-derived masks)
Context¶
ADR-076 retired the PNG-mask Protocol after discovering darktable-cli does not consume raster mask bytes. ADR-084 / RFC-029 closed the build-by-words spatial mask workflow. ADR-085 / RFC-024 closed the parametric range-filter refinement. The remaining content-derived masking gap — "lift the iguana's face," "deepen the sky," "trace the manta" — needs vision over the actual photo content. The naive shape (deploy a SAM-class model) requires every photographer to install something.
Modern chat clients (Claude.ai, ChatGPT, Claude Code) ship vision-capable LLMs in the conversation surface. When the photographer attaches a photo or chemigram surfaces a render via render_preview, the LLM sees it and can identify subjects, estimate bounding boxes, trace coarse polygons, and suggest color/luminance ranges. The full deliberation lives in RFC-026; this ADR captures the closing decision.
Decision¶
The LLM in the photographer's chat client is the AI mask provider for v1.9.0. No engine code changes; no new MCP tools. The integration is documentation + workflow patterns, leveraging:
render_preview(image_id)to surface a current-state JPEG (already shipped).- The chat client's image rendering (Claude Code via
Read; Claude Desktop / ChatGPT inline) — surfaces the image to the LLM's vision. - The
mask_specwire (RFC-029 / ADR-084 + RFC-024 / ADR-085) — acceptsdt_form(rectangle, ellipse, path) andrange_filter(luminance, color_h/s/l) constructed from the LLM's spatial reasoning.
The closing artifact is docs/guides/llm-vision-for-masks.md — a pattern library showing photographer phrase → LLM reasoning → resulting mask_spec, covering subject region, sky/foreground split, color-range estimation, and polygon trace.
Precision-tier use cases (pixel-perfect silhouettes, dense spot enumeration, depth maps) where LLM-vision quality is insufficient route to RFC-030 (deferred), which holds the deployed-sibling-provider scaffolding.
Rationale¶
The wire is already there. RFC-029 / ADR-084 ships dt_form for spatial shapes including arbitrary N-vertex polygons. RFC-024 / ADR-085 ships range_filter for content-derived pixel selection (luminance bands, HSL hue/sat/lightness ranges). Anything the LLM might construct from looking at a photo — a bounding box, a coarse silhouette, a hue range — has a direct mapping into these schemas.
The deployed-provider arc (separate sibling project, model weights, MCP scaffolding) is materially heavier and forces installation friction for every photographer who wants content-derived masks. LLM-as-provider unlocks the workflow at zero deployment cost. Quality is bounded — LLMs produce coarse polygons (~8-15 vertices, not pixel-precise) and bounding-box-quality region estimates — but covers ~70% of real workflows: subject regions, sky/foreground splits, color-range estimation, iterative refinement.
The 30% that LLM-vision cannot serve (single-strand hair, hundreds of small spots, per-pixel depth) earns the precision-tier RFC-030 work when photographer evidence demands it. Splitting LLM-vision (now) from deployed-providers (when needed) lets each ship on its own evidence.
Alternatives considered¶
- Skip LLM-as-provider; ship deployed-provider only (the original RFC-026 v0.1 shape). Rejected — forces every photographer to install a sibling project just to get coarse subject masks. LLM-vision covers ~70% of use cases at zero cost.
- Skip the deployed path entirely; LLM-vision forever. Rejected — precision-tier workflows (portrait retouching, wildlife pixel-precision) need it eventually. RFC-030 keeps the path open.
- Bundle SAM into chemigram core. Rejected — violates ADR-007.
- Lightweight Anthropic-API / GPT-4-Vision wrapper as a default sibling. Considered. Adds API key / billing / rate-limit UX surface that v1.9.0 doesn't need. Reconsider in RFC-030.
- Single combined RFC-026 covering both LLM-vision and deployed. Considered, drafted briefly, rejected. Confused the priority signal — shipping LLM-vision MVP is a different decision from designing deployed-provider scaffolding. Splitting (RFC-026 LLM-vision + RFC-030 deployed) lets each close on its own evidence + timeline.
Consequences¶
Positive:
- Content-derived masks ship for every photographer with a vision-capable chat client. Zero deployment cost, zero infrastructure, zero new core dependencies.
- MCP surface stays narrow. No new tools; ADR-033 preserved.
- BYOA principle taken further. The "provider" is the chat client the photographer already chose; chemigram never specifies a model.
- Composes with the existing wire. LLM-vision constructs flow through the same
mask_specapply path as drawn masks (RFC-029) and parametric filters (RFC-024). No fork in the architecture. - Workflow patterns are documentable. A pattern library guide gives the LLM stable grounding so different sessions produce consistent mask designs.
Negative:
- Precision is bounded. Coarse bounding boxes and ~12-vertex polygons; not pixel-precise. Single-strand hair, complex silhouettes, fine edges all degrade. Mitigated by RFC-030 as the upgrade path.
- Workflow is conversation-shaped. Pure-CLI photographers without an LLM in context can't use this RFC's path. Drawn masks (RFC-029) cover their core needs; RFC-030's deployed providers eventually serve them too.
- No deterministic reproducibility. Different LLM responses to the same photo may yield slightly different mask coordinates. Mitigated by chemigram's snapshot history capturing the actual coordinates used.
- Quality varies by chat-client LLM. Better LLMs → better masks. Same architectural shape; the photographer's choice of chat client determines vision quality.
- Image surfacing depends on chat-client capability. Claude Code's
Readworks today; other MCP clients may need MCP image content blocks. Mitigated by drag-drop / paste workflows and a future enhancement for inline-image MCP support.
Implementation notes¶
No engine code changes. The closing artifact is docs/guides/llm-vision-for-masks.md — pattern library covering:
- Subject region (rectangle / ellipse / path from LLM bounding-box-or-polygon estimate).
- Sky / foreground split (gradient + color_h refinement).
- Color-range estimation (LLM examines image, suggests
range_filter.kind=color_h, min, max). - Polygon trace (path form from coarse vertex estimate).
- Iterative refinement (render preview, LLM re-examines, adjust mask, re-apply — the conversation IS the refinement loop).
Each pattern shows: photographer's phrase → LLM's expected reasoning → resulting mask_spec JSON.
Image surfacing today¶
render_preview(image_id) returns a JPEG path. Claude Code clients use Read to surface the image to the conversation. Claude Desktop / ChatGPT users can drag-drop or paste the rendered JPEG. Future enhancement: a sibling tool returning base64-encoded JPEG content as an MCP image content block for clients that support it. Not blocking ADR-086 — the workflow works today via the path-based mechanism.
What this ADR explicitly does NOT settle¶
- Deployed-sibling-provider Protocol design (RFC-030).
- AI content-aware spot detection (RFC-025's deferred AI sub-path; routes to RFC-030).
- Depth-mask encoding (RFC-030).
- Cloud-API wrapper as a default provider (RFC-030 open question).
When LLM-vision becomes the bottleneck on real workflows, RFC-030 unfreezes; until then, this ADR is the v1.9.0 ship.