Chip Huyen ML / AI Critique Guide¶

A sparring partner for ML and AI product design: datasets, model choice, RAG, evaluation, deployment, monitoring, and cost. When asked to critique a pipeline, provider setup, eval plan, or “LLM feature” design, score it against these themes and return specific, actionable changes — preferably as config diffs, test additions, or documented decisions.

This rubric is inspired by the ideas in Chip Huyen’s books (O’Reilly):

Designing Machine Learning Systems (2022) — end-to-end ML systems, data, training/serving, iteration.
AI Engineering: Building Applications with Foundation Models — prompts, RAG, fine-tuning, agents, evaluation under open-ended outputs, production tradeoffs.

It is not a summary of either book and not a substitute for reading them. Author page and pointers: huyenchip.com — Books.

For charts and dashboards, use Tufte Chart Critique. For runbooks, deploy paths, and reliability culture, use SRE Book Infra Critique.

How to Use This File¶

In Cursor: @docs/guides/CHIP_HUYEN_ML_AI_CRITIQUE.md or @CHIP_HUYEN_ML_AI_CRITIQUE.md, then paste or describe the artifact (profile YAML, provider choice, eval protocol, RAG sketch, agent flow, monitoring plan).

Ask things like:

"Critique this summarization profile against the Chip Huyen rubric"
"Is our eval plan honest for open-ended LLM outputs?"
"Quick ML-systems check — where’s the biggest architectural smell?"

Repo anchors: Experiment Guide, AI Provider Comparison, Pipeline and Workflow, Configuration API.

Two lenses (same seven themes)¶

Chip Huyen’s material spans offline design and live systems; the themes below are shared. You do not need two separate repo files unless you strongly want two different @ mentions in the editor — splitting duplicates the “evaluation vs deployment” bridge and doubles maintenance.

When to emphasize which themes

Lens	Typical artifact	Emphasize themes
Experiments and offline	Held-out sets, baselines, promotion rules, prompt/RAG sketches before ship	1–4 (outcomes, data, adaptation tier, eval honesty)
Production code and operations	Pipeline code paths, retries, monitoring, operator UX, cost in prod	4–7 (eval that matches prod reality, economics, drift, failure UX)

Theme 4 is the hinge: offline eval can be rigorous yet still lie about production if latency, tooling, and provider failures are never exercised. For a full design review, score all seven.

The 7 Themes (with scoring)¶

Each theme scores Pass / Warn / Fail. At the end, give an overall verdict and a prioritized fix list.

1. Outcomes Before Models¶

Pick architectures from measurable product or research outcomes, not from the most impressive model card.

Check for:

“We’ll use the biggest model” before the success metric exists
Offline accuracy celebrated when the user-facing task is different
No explicit good enough bar (latency, cost, quality floor)

Pass: Metric + threshold stated; model tier is a dependent variable. Warn: Metrics exist but live only in chat, not in code or docs. Fail: Model name is the requirement.

2. Data and Context Engineering¶

Garbage in, garbage out — including prompts, retrieved chunks, and training/eval splits.

Check for:

Leakage (eval episodes in train context, future information in “historical” features)
Stale corpus, unversioned snapshots, no reproducible “what text did the model see?”
RAG without chunking strategy, citation discipline, or failure when retrieval misses

Pass: Data lineage and refresh story are documented; eval held-out is honest. Warn: Mostly sound; a few gray zones called out with backlog. Fail: “We’ll eyeball it”; same rows in train and eval.

3. Right Adaptation Tier¶

Prompting, RAG, fine-tuning, and model swap are tools with different cost, risk, and maintenance profiles — not a ladder you climb by default.

Check for:

Fine-tuning proposed when prompt + eval tightening would suffice
RAG bolted on without measuring retrieval precision/recall for the task
Agent + tools where a single batched call would meet SLO

Pass: Alternatives considered; choice tied to metrics and ops burden. Warn: Reasonable default with a dated plan to validate cheaper tiers. Fail: Complexity is the feature.

4. Evaluation Matches Deployment¶

Metrics must reflect real inputs, real latency, and real failure modes — especially for open-ended generation.

Check for:

Single-number leaderboard without variance, slice, or human spot-check
Tests that never exercise timeouts, empty retrieval, or provider errors
“LLM-as-judge” without calibration, baseline, or dispute handling

Pass: Offline + targeted online/human checks; regressions block promotion where agreed. Warn: Good offline suite; prod behavior partially unknown. Fail: Ship on demo vibes.

5. Production Economics¶

Latency, throughput, and $ per unit are part of the design — not post-hoc surprises.

Check for:

No budget for tokens, GPU minutes, or API calls per episode
Synchronous chains that violate viewer or batch SLOs
Missing caching, idempotency, or backoff where providers rate-limit

Pass: Cost and latency estimated per path; fallbacks defined. Warn: Estimates rough but tracked after launch. Fail: “We’ll optimize later” with no measurement hook.

6. Monitoring, Drift, and Iteration¶

Models decay when the world and data shift; systems need feedback loops, not launch-day optimism.

Check for:

No slice of quality by feed, language, length, or provider
Retrain or prompt-update triggers undefined
Incidents that don’t update eval fixtures or acceptance tests

Pass: Dashboards or jobs tied to product metrics; change log when pipeline shifts. Warn: Manual periodic review; partial automation. Fail: Ship and forget.

7. Failure Modes and User Experience¶

When the model is wrong, something predictable still happens — for operators and end users.

Check for:

Hallucinated facts presented without grounding or confidence cues where needed
Agent loops without max steps, tool allowlists, or circuit breakers
Opaque errors to the user when the provider fails

Pass: Degraded modes documented; dangerous outputs bounded by design. Warn: Some sharp edges; mitigations planned. Fail: Best-case-only UX.

Critique Output Format¶

Use one of the shapes below (full is default when both offline and prod matter).

Full (all themes)¶

## Chip Huyen ML / AI Critique

| Theme                     | Score | Notes                          |
|---------------------------|-------|--------------------------------|
| Outcomes before models    |       |                                |
| Data / context            |       |                                |
| Adaptation tier           |       |                                |
| Evaluation vs deployment  |       |                                |
| Production economics      |       |                                |
| Monitoring / drift        |       |                                |
| Failure modes / UX        |       |                                |

**Verdict:** [one paragraph]

## Priority Fixes

1. [CRITICAL] …
2. [HIGH] …
3. [MEDIUM] …
4. [LOW] …

## Suggested edits

[concrete YAML, test cases, eval protocol, or doc bullets — not generic advice]

Experiments and offline only (themes 1–4)¶

## Chip Huyen critique — experiments / offline

| Theme                    | Score | Notes |
|--------------------------|-------|-------|
| Outcomes before models   |       |       |
| Data / context           |       |       |
| Adaptation tier          |       |       |
| Evaluation vs deployment |       |       |

**Verdict:** …
## Priority Fixes
…

Production code and operations only (themes 4–7)¶

## Chip Huyen critique — production / live

| Theme                    | Score | Notes |
|--------------------------|-------|-------|
| Evaluation vs deployment |       |       |
| Production economics     |       |       |
| Monitoring / drift       |       |       |
| Failure modes / UX       |       |       |

**Verdict:** …
## Priority Fixes
…

Quick Reference Cheatsheet¶

Avoid	Prefer
Model-first requirements	Outcome metrics + thresholds first
One aggregate score	Slices, variance, spot-checks
Default to largest LLM	Cheapest tier that meets the bar
Train/eval overlap	Explicit splits and lineage
RAG without retrieval metrics	Measure recall/precision for the task
Launch without cost estimate	Per-unit $ and latency envelope
Silent wrong answers	Grounding, bounds, or honest degradation

Tone for Critiques¶

Be direct. ML and AI systems fail in boring, expensive ways — data bugs, skew, eval lying, unbounded agents. Name the risk, tie it to a theme above, and propose a specific mitigation (test, config, process). Do not substitute stack hype for engineering.

Chip Huyen ML / AI Critique Guide¶

How to Use This File¶

Two lenses (same seven themes)¶

The 7 Themes (with scoring)¶

1. Outcomes Before Models¶

2. Data and Context Engineering¶

3. Right Adaptation Tier¶

4. Evaluation Matches Deployment¶

5. Production Economics¶

6. Monitoring, Drift, and Iteration¶

7. Failure Modes and User Experience¶

Critique Output Format¶

Full (all themes)¶

Experiments and offline only (themes 1–4)¶

Production code and operations only (themes 4–7)¶

Quick Reference Cheatsheet¶

Tone for Critiques¶

See Also¶