ADR-086 — LLM-vision-as-provider for AI-derived masks¶

Status · Accepted Date · 2026-05-08 TA anchor · /components/masking · /contracts/mcp-tools · /constraints/byoa Related RFC · RFC-026 (LLM-vision-as-provider for AI-derived masks)

Context¶

ADR-076 retired the PNG-mask Protocol after discovering darktable-cli does not consume raster mask bytes. ADR-084 / RFC-029 closed the build-by-words spatial mask workflow. ADR-085 / RFC-024 closed the parametric range-filter refinement. The remaining content-derived masking gap — "lift the iguana's face," "deepen the sky," "trace the manta" — needs vision over the actual photo content. The naive shape (deploy a SAM-class model) requires every photographer to install something.

Modern chat clients (Claude.ai, ChatGPT, Claude Code) ship vision-capable LLMs in the conversation surface. When the photographer attaches a photo or chemigram surfaces a render via render_preview, the LLM sees it and can identify subjects, estimate bounding boxes, trace coarse polygons, and suggest color/luminance ranges. The full deliberation lives in RFC-026; this ADR captures the closing decision.

Decision¶

The LLM in the photographer's chat client is the AI mask provider for v1.9.0. No engine code changes; no new MCP tools. The integration is documentation + workflow patterns, leveraging:

render_preview(image_id) to surface a current-state JPEG (already shipped).
The chat client's image rendering (Claude Code via Read; Claude Desktop / ChatGPT inline) — surfaces the image to the LLM's vision.
The mask_spec wire (RFC-029 / ADR-084 + RFC-024 / ADR-085) — accepts dt_form (rectangle, ellipse, path) and range_filter (luminance, color_h/s/l) constructed from the LLM's spatial reasoning.

The closing artifact is docs/guides/llm-vision-for-masks.md — a pattern library showing photographer phrase → LLM reasoning → resulting mask_spec, covering subject region, sky/foreground split, color-range estimation, and polygon trace.

Precision-tier use cases (pixel-perfect silhouettes, dense spot enumeration, depth maps) where LLM-vision quality is insufficient route to RFC-030 (deferred), which holds the deployed-sibling-provider scaffolding.

Rationale¶

The wire is already there. RFC-029 / ADR-084 ships dt_form for spatial shapes including arbitrary N-vertex polygons. RFC-024 / ADR-085 ships range_filter for content-derived pixel selection (luminance bands, HSL hue/sat/lightness ranges). Anything the LLM might construct from looking at a photo — a bounding box, a coarse silhouette, a hue range — has a direct mapping into these schemas.

The deployed-provider arc (separate sibling project, model weights, MCP scaffolding) is materially heavier and forces installation friction for every photographer who wants content-derived masks. LLM-as-provider unlocks the workflow at zero deployment cost. Quality is bounded — LLMs produce coarse polygons (~8-15 vertices, not pixel-precise) and bounding-box-quality region estimates — but covers ~70% of real workflows: subject regions, sky/foreground splits, color-range estimation, iterative refinement.

The 30% that LLM-vision cannot serve (single-strand hair, hundreds of small spots, per-pixel depth) earns the precision-tier RFC-030 work when photographer evidence demands it. Splitting LLM-vision (now) from deployed-providers (when needed) lets each ship on its own evidence.

Alternatives considered¶

Skip LLM-as-provider; ship deployed-provider only (the original RFC-026 v0.1 shape). Rejected — forces every photographer to install a sibling project just to get coarse subject masks. LLM-vision covers ~70% of use cases at zero cost.
Skip the deployed path entirely; LLM-vision forever. Rejected — precision-tier workflows (portrait retouching, wildlife pixel-precision) need it eventually. RFC-030 keeps the path open.
Bundle SAM into chemigram core. Rejected — violates ADR-007.
Lightweight Anthropic-API / GPT-4-Vision wrapper as a default sibling. Considered. Adds API key / billing / rate-limit UX surface that v1.9.0 doesn't need. Reconsider in RFC-030.
Single combined RFC-026 covering both LLM-vision and deployed. Considered, drafted briefly, rejected. Confused the priority signal — shipping LLM-vision MVP is a different decision from designing deployed-provider scaffolding. Splitting (RFC-026 LLM-vision + RFC-030 deployed) lets each close on its own evidence + timeline.

Consequences¶

Positive:

Content-derived masks ship for every photographer with a vision-capable chat client. Zero deployment cost, zero infrastructure, zero new core dependencies.
MCP surface stays narrow. No new tools; ADR-033 preserved.
BYOA principle taken further. The "provider" is the chat client the photographer already chose; chemigram never specifies a model.
Composes with the existing wire. LLM-vision constructs flow through the same mask_spec apply path as drawn masks (RFC-029) and parametric filters (RFC-024). No fork in the architecture.
Workflow patterns are documentable. A pattern library guide gives the LLM stable grounding so different sessions produce consistent mask designs.

Negative:

Precision is bounded. Coarse bounding boxes and ~12-vertex polygons; not pixel-precise. Single-strand hair, complex silhouettes, fine edges all degrade. Mitigated by RFC-030 as the upgrade path.
Workflow is conversation-shaped. Pure-CLI photographers without an LLM in context can't use this RFC's path. Drawn masks (RFC-029) cover their core needs; RFC-030's deployed providers eventually serve them too.
No deterministic reproducibility. Different LLM responses to the same photo may yield slightly different mask coordinates. Mitigated by chemigram's snapshot history capturing the actual coordinates used.
Quality varies by chat-client LLM. Better LLMs → better masks. Same architectural shape; the photographer's choice of chat client determines vision quality.
Image surfacing depends on chat-client capability. Claude Code's Read works today; other MCP clients may need MCP image content blocks. Mitigated by drag-drop / paste workflows and a future enhancement for inline-image MCP support.

Implementation notes¶

No engine code changes. The closing artifact is docs/guides/llm-vision-for-masks.md — pattern library covering:

Subject region (rectangle / ellipse / path from LLM bounding-box-or-polygon estimate).
Sky / foreground split (gradient + color_h refinement).
Color-range estimation (LLM examines image, suggests range_filter.kind=color_h, min, max).
Polygon trace (path form from coarse vertex estimate).
Iterative refinement (render preview, LLM re-examines, adjust mask, re-apply — the conversation IS the refinement loop).

Each pattern shows: photographer's phrase → LLM's expected reasoning → resulting mask_spec JSON.

Image surfacing today¶

render_preview(image_id) returns a JPEG path. Claude Code clients use Read to surface the image to the conversation. Claude Desktop / ChatGPT users can drag-drop or paste the rendered JPEG. Future enhancement: a sibling tool returning base64-encoded JPEG content as an MCP image content block for clients that support it. Not blocking ADR-086 — the workflow works today via the path-based mechanism.

What this ADR explicitly does NOT settle¶

Deployed-sibling-provider Protocol design (RFC-030).
AI content-aware spot detection (RFC-025's deferred AI sub-path; routes to RFC-030).
Depth-mask encoding (RFC-030).
Cloud-API wrapper as a default provider (RFC-030 open question).

When LLM-vision becomes the bottleneck on real workflows, RFC-030 unfreezes; until then, this ADR is the v1.9.0 ship.