Using your LLM's vision to design masks¶
Companion to RFC-026 / ADR-086. The build-by-vision workflow: when the photographer says "lift the iguana's face," the LLM in the chat client looks at the photo and constructs a
mask_specfrom what it sees. Sister tomask-shapes-from-words.md(build-by-words, ADR-084) andmask-applicable-controls.md(which primitive can be applied through a mask).
The workflow shape¶
- Photographer says: "lift the iguana's face by half a stop".
- Agent calls
render_preview(image_id)to surface a current-state JPEG of the photo. - Chat client renders the JPEG inline (Claude Code via
Read; Claude Desktop / ChatGPT inline; agent tooling that supports MCP image blocks via image content). - The LLM in the conversation looks at the photo. It identifies the iguana's face, estimates a bounding region, decides whether ellipse / rectangle / polygon is the right shape.
- LLM constructs
mask_specand callsapply_primitive("exposure", 0.5, mask_spec={...}). - Photographer reviews the result. If it's wrong, the LLM re-examines and adjusts.
The "AI provider" here is the chat-client LLM itself. No sibling project, no model deployment, no inference server. The vision capability is already in the conversation surface the photographer chose.
When this works (~70% of content-derived masking)¶
| Workflow | LLM-vision can do this | Notes |
|---|---|---|
| Coarse subject region | ✓ | Bounding box / ellipse around the subject |
| Sky / foreground split | ✓ | Gradient at horizon + optional color_h refinement |
| Color-range estimation from the photo | ✓ | "What hue range covers the warm tones?" |
| Coarse polygon trace (8-15 vertices) | ✓ | Manta silhouette, building outline, etc. |
| Subject vs background routing | ✓ | Apply edit + apply complementary edit with invert: true |
| Iterative refinement | ✓ | LLM re-examines render, adjusts mask coordinates |
| Single-strand hair / fur edges | ✗ | Use drawn radial approximation; precision-tier RFC-030 will lift this ceiling |
| 200+ scattered small spots | ✗ | RFC-025 manual marking + future AI variant in RFC-030 |
| Per-pixel depth maps | ✗ | Needs MiDaS-class model — RFC-030 |
Pattern library¶
Each pattern shows: photographer's phrase → LLM's expected reasoning → resulting mask_spec.
Pattern 1: Subject region (rectangle or ellipse)¶
Photographer: "Lift the iguana's face."
LLM reasoning (after looking at the photo via render_preview):
"The iguana's head is in the center-right of the frame, occupying roughly the middle horizontally and the upper third vertically. An ellipse centered around (0.6, 0.35) with radii (0.18, 0.22) and a soft border (0.06) covers the head with natural feathering."
mask_spec:
{
"dt_form": "ellipse",
"dt_params": {
"center_x": 0.6, "center_y": 0.35,
"radius_x": 0.18, "radius_y": 0.22,
"border": 0.06
}
}
Apply:
When to use rectangle instead of ellipse: subject is rectangular (a building, a sign, a structured object) or the photographer wants hard edges.
Pattern 2: Polygon trace for irregular subjects¶
Photographer: "Mask the manta."
LLM reasoning:
"The manta is in the lower half of the frame, diamond-shaped with wing tips extending. Tracing the rough silhouette as 10 vertices: top of head, right shoulder, right wing tip, right tail, bottom center, left tail, left wing tip, left shoulder, head center, back to top."
mask_spec:
{
"dt_form": "path",
"dt_params": {
"vertices": [
[0.5, 0.3], [0.6, 0.4], [0.85, 0.55],
[0.7, 0.7], [0.5, 0.75],
[0.3, 0.7], [0.15, 0.55],
[0.4, 0.4], [0.5, 0.32]
],
"border": 0.04
}
}
LLM polygon estimates degrade past ~15 vertices and on subjects with very irregular boundaries. For pixel-precise silhouettes (single-strand hair, fur edges), drop to RFC-030's deployed-provider arc when it ships.
Pattern 3: Sky / foreground split (gradient + color_h)¶
Photographer: "Deepen the sky's blue without affecting the foreground."
LLM reasoning:
"The horizon line sits at roughly y=0.45. Above is sky (cyan-blue, hue range ~0.55-0.7); below is foreground (warmer tones). Two ways to localize: (a) drawn gradient with anchor at horizon, light side up — affects everything in the upper region uniformly; (b) gradient + color_h range filter — affects only the blue pixels in the upper region. Option (b) is more precise."
mask_spec (drawn gradient + parametric color refinement, intersection):
{
"dt_form": "gradient",
"dt_params": {
"anchor_x": 0.5, "anchor_y": 0.45,
"rotation": 0.0, "compression": 0.5, "state": 2
},
"range_filter": {
"kind": "color_h",
"min": 0.55, "max": 0.7,
"feather": 0.05
}
}
This is the canonical compositional case: spatial mask defines the region (upper half), parametric filter refines to which pixels (cyan-blue hues only).
Pattern 4: Color-range estimation (parametric only)¶
Photographer: "Mute the saturated reds in the photo."
LLM reasoning:
"The saturated reds in this photo cluster around hue 0.0-0.05 with high saturation (>0.6). I'll use a parametric
color_sfilter to target high-saturation pixels, no spatial mask needed since the photographer wants the effect everywhere reds appear."
mask_spec (parametric only):
For more precise hue-specific targeting, combine with a hue range:
(Note: hue wraps, so reds span 0.95-1.0 and 0.0-0.05. For wrap-around ranges, multiple apply calls or a future compose schema may be needed; for now, pick the side with more red pixels.)
Pattern 5: Subject vs background routing¶
Photographer: "Brighten the iguana, dim everything else."
LLM reasoning:
"Two complementary edits. First: lift exposure on the subject region (ellipse around the iguana). Second: dim exposure on everything else — same region, inverted via the
range_filter.invertflag (or apply the complementary edit through a separatemask_specif drawn-only). Or: use just the parametric inversion via a luminance band wider than the whole image, marking the inside as the iguana's region."
Apply (two calls):
# First: brighten the subject
apply_primitive("exposure", 0.4, mask_spec={
"dt_form": "ellipse",
"dt_params": {"center_x": 0.6, "center_y": 0.35, "radius_x": 0.2, "radius_y": 0.25, "border": 0.08}
})
# Second: dim the background — same shape, but apply a NEGATIVE edit
# (the mask_spec is identical; deterministic-hash means same mask_id;
# darktable composes the two edits)
apply_primitive("exposure", -0.3, mask_spec={
"dt_form": "ellipse",
"dt_params": {"center_x": 0.6, "center_y": 0.35, "radius_x": 0.2, "radius_y": 0.25, "border": 0.08},
"range_filter": {"kind": "luminance", "min": 0.0, "max": 1.0, "feather": 0.0, "invert": True}
})
(The invert: true on a full-luminance band effectively inverts the mask; apply order + edit polarity together produce the subject-vs-background contrast.)
Pattern 6: Iterative refinement¶
Photographer: "That mask is too small — extend it to include the iguana's tail."
LLM reasoning:
"The iguana's tail extends to the lower-right, around (0.85, 0.6). Current ellipse is at (0.6, 0.35) with radii (0.18, 0.22) — too small and not covering the tail. Switching to a path with vertices that include the tail: head ellipse-like vertices plus tail vertices."
Apply (revised mask):
apply_primitive("exposure", 0.5, mask_spec={
"dt_form": "path",
"dt_params": {
"vertices": [
[0.45, 0.2], [0.75, 0.25], [0.85, 0.4],
[0.95, 0.55], [0.85, 0.7],
[0.7, 0.55], [0.45, 0.5], [0.4, 0.3]
],
"border": 0.05
}
})
Photographer sees the new render, gives feedback, LLM iterates. The conversation IS the refinement loop.
Pattern 7: Named-mask escalation (RFC-032 + LLM-vision)¶
Photographer: "Cool down the sky."
Context: chemigram ships named maskdefs (RFC-032) — mask_sky, mask_subject, mask_eye_region — that carry both a parametric fallback (luminance / hue band approximation) AND an llm_vision_prompt field. The fallback works for grand-vista landscapes with clean skies; it leaks badly on sunsets, partial clouds, and trees against bright sky. The named maskdef is a hint that LLM-vision is the right tool here.
LLM reasoning:
"The photographer asked for sky. The vocabulary ships
mask_sky— a named maskdef with both a parametric fallback and an LLM-vision prompt. Because this is a complex sky (clouds + horizon + tree silhouettes), the parametric fallback would leak onto the trees. I'll render a preview, examine the actual sky boundary, and construct a path-form mask using the maskdef's canonical prompt. The maskdef's parametric fallback is useful as a sanity check, not as the final mask."
Workflow:
vocab show-mask mask_sky— read the canonicalllm_vision_prompt.render_preview(image_id)— surface a current-state JPEG.- Examine the photo against the prompt: "Select the sky region — including clouds, atmosphere, and visible portions of the upper atmosphere. Exclude horizon-bordering land, mountains, or trees that protrude into the sky. Feather the bottom edge slightly where land meets sky."
- Construct a path-form mask (typically 8-15 vertices tracing the horizon and any tree silhouettes that protrude into the sky).
- Apply with the constructed
mask_specinstead of{"kind": "named", "name": "mask_sky"}.
mask_spec (constructed; replaces the named reference):
{
"dt_form": "path",
"dt_params": {
"vertices": [
[0.0, 0.0], [1.0, 0.0],
[1.0, 0.45], [0.78, 0.42], [0.65, 0.48],
[0.42, 0.39], [0.28, 0.45], [0.0, 0.42]
],
"feather": 0.04
}
}
When to use the named reference vs. construct via vision:
| Situation | Use named reference | Construct via LLM-vision |
|---|---|---|
| Grand-vista landscape, clean sky, no foreground intrusion | ✓ — parametric fallback is fine | overkill |
| Sunset / sunrise — sky has bright orange/red zones (color_h fallback fails) | ✗ — fallback leaks | ✓ — construct path |
| Trees / mountains against bright sky | ✗ — fallback includes tree silhouettes | ✓ — trace horizon explicitly |
Portrait — subject region (mask_subject) |
✗ — center-bias luminance fallback is very coarse | ✓ — almost always upgrade |
Eye region (mask_eye_region) |
rarely (highlight-luminance fallback misses iris detail) | ✓ — almost always upgrade |
Skin region (mask_skin_region) |
✓ — color_h on orange band IS the canonical move | rarely needed |
Foliage (mask_foliage_green) |
✓ — color_h band is the move | rarely needed |
Luminosity bands (mask_luminosity_*) |
✓ — these are parametric by design | N/A — no LLM-vision prompt on these |
The maskdefs that ship llm_vision_prompt are the ones where the named reference is "good enough sometimes, but a vision-constructed mask is sharper." The maskdefs without llm_vision_prompt (luminosity bands, skin/foliage/water hue bands) are inherently parametric — no escalation path.
Composition with apply_per_region (RFC-031): named-mask references and constructed mask_specs both work in batched calls. Mix them: per-region dodge-and-burn with one region using mask_sky (parametric fallback fine for that one) and another region using a constructed path (precision needed there).
How to ground the LLM in this workflow¶
When a photographer asks for a content-derived mask, the agent's reasoning chain should be:
- Surface the photo — call
render_preview(image_id)to get a current-state JPEG. - Look at it — the chat client renders the JPEG; the LLM's vision sees it.
- Identify what the photographer described — subject, sky, color region, etc.
- Pick the right shape — rectangle for hard regions, ellipse for circular subjects, path for irregular silhouettes, range_filter for tonal/color selection, both for spatial+content intersection.
- Estimate coordinates — use the photo's spatial layout, not assumed positions.
- Construct
mask_spec— seemask-shapes-from-words.mdfor parameter conventions. - Apply —
apply_primitive(name, value, mask_spec=...). - Verify — render again, examine, adjust if needed.
Image-surfacing in different chat clients¶
| Client | Mechanism |
|---|---|
| Claude Code (this is the primary path) | Agent calls Read on the JPEG path returned by render_preview; image content surfaces in the conversation |
| Claude Desktop | Drag-drop the rendered JPEG; LLM sees attached images directly |
| ChatGPT (Plus / Team) | Same — drag-drop the JPEG file |
| Anthropic Console / API users | Pass image bytes as MCP image content blocks (future enhancement; works via custom tooling today) |
Limitations of LLM-vision (when to fall back)¶
LLM-vision is coarse. It is not the right tool for:
- Pixel-perfect silhouettes. Single-strand hair, fur edges, glasses frames against a complex background.
- Dense spot enumeration. "Find all 200+ small spots" — LLMs lose count and miss spots.
- Per-pixel depth. "Far mountains only" — no spatial-depth estimation in vision LLMs at usable quality.
- Sub-pixel registration. "Move the mask 3 pixels left" — LLMs estimate at percentage-of-image granularity, not pixel granularity.
For these, the photographer falls back to drawn approximations today, and to RFC-030's deployed providers when that lands.
See also¶
- Mask shapes from words — spatial-English-to-
mask_specmapping for build-by-words workflows (RFC-029 / ADR-084) - Mask-applicable controls — which vocabulary primitive can be applied through a mask
- ADR-076 (drawn-mask only architecture)
- ADR-084 / RFC-029 (compositional masks at apply time — the wire LLM-vision uses)
- ADR-085 / RFC-024 (parametric range-filter — composes with LLM-vision for content+region intersection)
- ADR-086 / RFC-026 (this guide's closing ADR / RFC)
- RFC-030 (deferred precision-tier deployed-sibling-provider scaffolding)
src/chemigram/mcp/tools/rendering.py—render_preview(image-surfacing tool)