RFC-030 — Deployed sibling-provider scaffolding for precision-tier AI masks¶
Status · Draft v0.1 (deferred) TA anchor · /components/masking · /contracts/mcp-tools · /constraints/byoa Related · RFC-026 / ADR-086 (LLM-vision-as-provider; the MVP this RFC layers above), ADR-076 (drawn-mask only architecture), ADR-007 (BYOA principle), ADR-084 / RFC-029 (compositional masks at apply time; the wire), ADR-085 / RFC-024 (parametric range-mask substrate), RFC-025 (spot removal; AI content-aware variants land here) Closes into · ADR-NNN (pending — provider Protocol contract); ADR-NNN (pending —
detect_*MCP tool surface); possible additional ADRs for vocabulary-surface integration Why this is an RFC · RFC-026 ships LLM-vision-as-provider as the MVP for AI-derived masks. That covers the dominant content-derived masking use cases (coarse subject regions, sky/foreground splits, color-range estimation) at zero deployment cost. It does not cover precision-tier use cases: pixel-perfect silhouettes (single-strand hair, fine edges), dense content-aware spot detection (200+ small regions), and depth-range masks (per-pixel depth estimation). Those need deployed ML models — SAM-class for subjects, MiDaS-class for depth, specialized detectors for spots. RFC-030 holds the architectural design for that deployed-sibling-provider scaffolding. Deferred to a future milestone — not blocking v1.9.0 — because the LLM-vision workflow already covers the workflows the user named as priority. RFC-030 ships when (a) photographers hit the LLM-vision precision ceiling on real workflows, and (b) the cost of installing a sibling project is justified by the workflow gap.
The question (deferred)¶
When LLM-vision-as-provider (RFC-026) isn't precise enough, what's the architectural shape for deployed sibling providers that produce pixel-precise masks?
Three sub-arcs with materially different cost shapes:
- Subject-mask precision tier. SAM / BiRefNet / U2-Net class. Polygon output via Douglas–Peucker simplification at the provider boundary; vertex budget 50-1000 per mask. Quality: pixel-precise.
- Spot detection. Smaller specialized model. List of
(x, y, radius)tuples. RFC-025 owns the integration into retouch byte serialization. - Depth masks. MiDaS / ZoeDepth class. Returns either a depth raster the engine thresholds into a far/near polygon, or pre-thresholded polygon directly.
All three follow the same provider Protocol shape: separate MCP server, polygon/region output, ADR-076-compliant (drawn-form geometry darktable consumes).
Use cases (precision-tier, beyond LLM-vision)¶
-
Pixel-perfect subject silhouette. Portrait where the photographer wants to lift the face exactly — hair edges, glasses frame, fine features. SAM/BiRefNet returns ~500-vertex polygon; LLM-vision returns ~12 vertices that approximate. RFC-030.
-
Content-aware spot detection. Manta's belly with 200+ scattered white spots. LLM can't enumerate hundreds of small region centers reliably. Specialized spot detector returns the list; engine emits N retouch forms via RFC-025's path. RFC-030.
-
Depth-range mask. "Dehaze only the far mountains." Needs per-pixel depth. LLMs cannot estimate depth well; MiDaS / ZoeDepth do. Returns a depth raster the engine thresholds, or a pre-thresholded far-band polygon. RFC-030.
-
Multi-subject scene with overlapping subjects. "Brighten only the foreground person, not the person in the background." Subject detection with instance segmentation. RFC-030.
-
Real-time iterative refinement. Photographer iterates on a SAM mask through edge-prompt feedback ("trim left edge"); the provider re-runs and returns updated polygon. Long-tail UX, RFC-030.
Goals (deferred design)¶
- Pick the provider Protocol shape that handles subject masks, depth masks, and content-aware spot detection under one interface.
- Honor ADR-076's structural lesson — provider output must land as bytes darktable consumes via the existing
mask_specwire. - Honor ADR-007 (BYOA) — providers are sibling projects;
chemigram.corenever imports torch / onnx / etc. - Bound vertex explosion. SAM-class masks are million-pixel rasters. Polygon simplification (Douglas–Peucker) at the provider boundary, with a target vertex budget ≤1000.
- Compose with RFC-026's LLM-vision workflow. RFC-030's MCP tools (
detect_subjects, etc.) should be drop-in replacements when photographers want precision — samemask_specshape, just a different polygon source. - Minimize installation friction. Realistic UX: photographers shouldn't need to manage torch versions, model checkpoints, or CUDA. Distribution mechanism (pip install, Docker, prebuilt binary) is part of this RFC.
Constraints (carry-over from RFC-026 v0.1)¶
- ADR-076: provider output must be drawn-form bytes darktable's mask system consumes.
- ADR-007: no AI dependencies in
chemigram.core. Providers are sibling projects. - ADR-033 (narrow MCP tool surface): RFC-030 may add 1-3 tools (
detect_subjects/detect_spots/detect_depth_bandor unifiedmask.detect). Each addition justifies its presence. - ADR-006 (single-process MCP): provider runs in its own process, talks via MCP. The engine never invokes the provider in-process.
- CLAUDE.md three foundational disciplines: agent-only-writer, darktable-does-the-photography, BYOA.
Proposed approach (sketch — full design when this RFC unfreezes)¶
Sibling MCP project pattern. A subject-detection provider is a separate Python package (e.g., chemigram-masker-sam, chemigram-masker-birefnet) that:
- Registers as an MCP server under
chemigram.mask_provider.v1. - Exposes
detecttaking image path + optional query. - Returns
[{label, vertices, confidence, bbox}, ...]— vertices in normalized [0, 1], pre-simplified to vertex budget. - Owns its model weights, dependencies (torch, onnxruntime, etc.).
MCP tool surface (sketch)¶
@tool
def detect_subjects(image_id: str, query: str | None = None,
*, provider: str | None = None,
max_subjects: int = 5) -> list[DetectedSubject]:
"""Run the configured subject-detection provider; return polygons."""
@tool
def detect_spots(image_id: str, *, provider: str | None = None,
target: str | None = None) -> list[DetectedSpot]:
"""Run content-aware spot-detection; return (x, y, radius) regions."""
@tool
def detect_depth_band(image_id: str, *, band: str, # "near" | "far" | etc.
provider: str | None = None) -> dict:
"""Run depth-aware band detection; return polygon for the band."""
(Final shape: one tool with a kind parameter, or three separate tools — TBD when this RFC unfreezes.)
Vocabulary-surface integration¶
When deployed providers ship, the agent flow becomes: detect_subjects(image_id, "fish") → polygon vertices → wrap as mask_spec.dt_form="path" → apply_primitive(..., mask_spec=...). Same wire as RFC-026; just the polygon source upgrades.
Per-image caching¶
masks/<image_id>/ai_cache.json keyed by (provider, model_version, query) — avoids re-running expensive inference per apply call.
Provider configuration¶
~/.chemigram/providers.toml registers known providers with model paths, default selection, fallback behavior. Schema TBD.
Alternatives (sketch)¶
- Skip RFC-030 entirely; LLM-vision is enough forever. Rejected — precision-tier workflows are real (portrait retouchers, wildlife photographers wanting exact silhouettes). LLM-vision's ~12-vertex polygon ceiling won't satisfy them.
- Bundle SAM in core. Rejected — violates ADR-007.
- One unified
chemigram-mask-providerfor all three (subject/depth/spots). Possible but couples model lifecycles. Probably three sibling projects, each with focused scope. - Cloud API providers (Anthropic, OpenAI, etc.) instead of local models. API keys / billing / latency / privacy trade-offs. Worth considering when this RFC unfreezes.
When does this RFC unfreeze¶
Triggers — any one of:
- Photographer evidence. Real workflow logs show photographers hitting LLM-vision precision limits and requesting precision-tier masks.
- Compositional gap from RFC-024 / RFC-025. A compositional case (e.g., "AI-subject ∩ drawn-radial") becomes the dominant ask, and LLM-vision polygons aren't precise enough for the intersection to make sense.
- Spot-detection demand. RFC-025's user-marked-spot path proves valuable but photographers want "find all the spots automatically" for high-volume retouching.
- Depth-mask demand. Photographer evidence for "dehaze only the far mountains" / "blur only the background" patterns where the drawn approximation is unsatisfying.
Until one of these triggers, RFC-030 stays Draft and deferred. Phase 2 carry-over content from earlier RFC-026 v0.1 drafts is preserved here for when work begins.
Open questions (deferred)¶
- Provider Protocol details: polygon-vs-raster output, vertex budget ceiling, MCP tool surface shape (one unified vs three separate), provider config schema, per-image caching policy, AI-detection cost amortization across multiple apply calls.
- Distribution mechanism: pip install + bundled checkpoints? Docker images? CUDA-vs-CPU support? macOS Metal support?
- Cloud-API path: lightweight Anthropic-API / GPT-4-Vision-API wrapper as an "easier to install" option distinct from the deployed-local-model arc?
- Cost model: how to surface inference cost (latency, $) to the agent so it can choose between LLM-vision (fast, free in the conversation) and deployed providers (slower, costs deployment + maybe API $).
How this closes¶
When this RFC unfreezes:
- ADR-NNN — Deployed-sibling-provider Protocol contract. The provider interface (registration,
detectshape, polygon output, model-version metadata). - ADR-NNN —
detect_*MCP tool surface. The agent-facing tools that route to providers. - Possibly an ADR for the cloud-API path (Anthropic / OpenAI wrapper) if that path proves preferable.
Links¶
- TA/components/masking
- TA/contracts/mcp-tools
- TA/constraints/byoa
- ADR-076 (drawn-mask only architecture)
- ADR-007 (BYOA principle)
- ADR-084 / RFC-029 (compositional masks at apply time)
- ADR-085 / RFC-024 (parametric mask encoding)
- ADR-086 / RFC-026 (LLM-vision-as-provider; the MVP this RFC layers above)
- RFC-025 (spot removal; AI variants land here)
- capability-survey.md § 7