ADR-049 — Vocabulary-starter ships within chemigram (clarifies ADR-032)¶
Status · Accepted Date · 2026-04-28 TA anchor · /stack Clarifies · ADR-032 (does not supersede; re-affirms the location decision and locks the PyPI boundary)
Context¶
ADR-032 ("Distribution split") established that the chemigram public repo is a monorepo containing the engine source, the starter vocabulary at vocabulary/starter/, and community packs at vocabulary/packs/. Only chemigram-masker-sam was named as an "optional sibling project" — a separate repo with its own release cadence — because it brings PyTorch and SAM model weights into a dependency graph the engine deliberately keeps out of itself (per ADR-007 BYOA, ADR-014).
Subsequent ADRs (ADR-034 build system, ADR-041 versioning scheme, ADR-042 distribution) referred in passing to chemigram-vocabulary-starter as a "sibling project" alongside chemigram-masker-sam. That phrasing was drift — those ADRs were focused on other concerns (build backend, SemVer rules, PyPI mechanics) and treated the vocabulary as a hypothetically-separate distribution without ever explicitly making that decision. ADR-032's monorepo decision was never overturned.
This ADR clarifies: there is one repo, one PyPI distribution, and the starter vocabulary ships inside both. Only chemigram-masker-sam is a true sibling project.
Decision¶
The starter vocabulary lives at vocabulary/starter/ in the chemigram repo and ships as part of the chemigram PyPI distribution. It is not a separate distribution.
chemigram/ # the OSS monorepo
├── src/chemigram/ # engine + MCP server (per ADR-034)
├── vocabulary/
│ ├── starter/ # the starter pack — ships in `chemigram` wheel
│ │ ├── manifest.json
│ │ ├── exposure/
│ │ │ ├── expo_-0.5.dtstyle
│ │ │ └── ...
│ │ ├── wb/
│ │ ├── colorcal/
│ │ └── ... (per docs/starter-vocabulary.md)
│ └── packs/ # community packs (per ADR-032)
│ └── <pack_name>/
│ ├── manifest.json
│ ├── ATTRIBUTION.md
│ └── *.dtstyle
└── ...
# Sibling repos (separate distributions)
chemigram-masker-sam/ # per ADR-032, ships PyTorch + SAM weights
PyPI implications:
chemigram(the package): includesvocabulary/starter/as package data.pip install chemigramgets the engine + the starter pack out of the box.chemigram-masker-sam(separate distribution, separate repo): ships when Phase 4 lands (per ADR-032).chemigram-vocabulary-starteris not a PyPI name we'll claim or use. Earlier mentions of this name in ADR-034, ADR-041, ADR-042 were drift; those ADRs are not edited (append-only) but readers should treat their "sibling project" mentions of vocabulary-starter as obsolete.
Engine API:
chemigram.core.vocab.load_starter()resolves to the bundled starter pack at runtime via package-data lookup. No environment configuration needed; no separate install step; nopip install chemigram-vocabulary-starter.
Rationale¶
The case for the merge (and against treating vocabulary-starter as a separate distribution):
- ADR-032 already said this. The drift in ADR-034/041/042 was unintentional. We are re-affirming, not changing direction.
- Community-pack ecosystem is aspirational. The argument for separating vocabulary-starter ("it sets the pattern for community packs") only matters if community packs materialize. At v1, they don't exist. Splitting now prepares for a future that may never arrive at the cost of friction now.
- Empty action space is not a desirable default. A
pip install chemigramthat produces a working engine with no vocabulary surfaces gaps immediately. Bundling the starter means first-time users have something to do; the path to "what does this thing actually do?" is one step shorter. - Iteration cost. Engine-and-starter cross-changes are common during Phase 1 (a new pipeline stage may require a new vocabulary entry to validate end-to-end). One-repo iteration is cheaper than coordinating two repos and two release cycles.
chemigram-masker-samis unaffected. It stays separate because PyTorch + model weights belong outside the engine's dependency graph. That argument is technical and immediate; the vocabulary-starter argument was aspirational and not.
Alternatives considered¶
- Keep vocabulary-starter as a separate sibling distribution (the position implied by drift in ADR-034/041/042): rejected. Premature for v1; conflicts with ADR-032's explicit monorepo decision; adds release-coordination friction.
- Vocabulary as user-supplied only, no starter ships: considered briefly, rejected for the same reason ADR-032 rejected it — first-time users face an empty action space; the starter is the on-ramp.
- Vocabulary-starter as a separate
extras_requirerather than separate package (pip install chemigram[starter]): considered. Pointlessly clever — the starter is small (~50 KB of.dtstylefiles), useful by default, and there's no dependency cost to pulling it in. Just include it.
Consequences¶
Positive:
- One repo, one CI, one release process for the engine + starter
pip install chemigramworks out of the box- Engine-and-starter cross-changes ship together (common during Phase 1)
- Re-affirms ADR-032's position cleanly; resolves drift in later ADRs
- Less infrastructure to maintain (no second repo, no second PyPI listing)
Negative:
- If the community-pack ecosystem ever materializes, the starter sitting inside chemigram becomes asymmetric vs community packs in their own repos. Mitigation: at that point, splitting is a small refactor; ship it then if needed, not now.
- The
chemigram-vocabulary-starterPyPI name remains available for a squatter. Mitigation: optionally claim it as a placeholder (a 0.0.0 wheel pointing atchemigramitself) at first release. Cheap, defensive, no commitment.
Implementation notes¶
pyproject.tomlis updated to includevocabulary/starter/in the wheel build (via hatchling's[tool.hatch.build.targets.wheel]force-includedirective). Implementation lands in Slice 1 of Phase 1 alongside the rest of the bootstrap config.- Slice 6 of Phase 1 captures the actual
.dtstylefiles intovocabulary/starter/per the spec atdocs/starter-vocabulary.md. No sibling repo is created. docs/IMPLEMENTATION.mdSlice 6 wording is updated: distribution happens via the mainchemigrampackage, not via a separatechemigram-vocabulary-starterpackage.docs/starter-vocabulary.mdis updated: references to "the OSSchemigram-vocabulary-starterpackage" become "the starter pack atvocabulary/starter/."docs/CONTRIBUTING.mdalready documents two contribution paths (code, vocabulary) within the same repo. No changes needed.- ADR-034, ADR-041, ADR-042 are not edited. They were correct on their actual subjects; their references to vocabulary-starter as a sibling project were drift, and ADR-049 makes the correct position explicit.
chemigram-masker-samcontinues unchanged per ADR-032: separate repo, separate distribution, deferred to Phase 4.