Design systems need an eval pipeline, not better documentation

The rules that actually govern a design system (which button variant in a dense layout, which token belongs on this surface, which combination quietly breaks the visual language) rarely make it into the spec. Not because designers don't know them. Because enumerating rules in the abstract is a different cognitive task than making a judgment call on a specific case.

Writing a component spec is doable. Documenting every context that spec applies to, every edge case where the rule bends, every implicit constraint a senior designer would catch instantly — that's harder. And there's no great tooling for it.

The better approach is closer to how LLM alignment works. You don't ask annotators to write rules. You show them outputs and ask which is better and why. Preferences become the model. Apply that to design: ingest the design system tokens and components, generate composition examples — real contexts, real component combinations — and have the right person evaluate them. Designer for taste and cohesion. PM for requirement fit. QA for rule violations. Each annotation adds a dimension. The rules emerge from the corpus instead of being authored upfront.

This matters because the cognitive task finally matches the work. Designers already do this — in Figma comments, in design critique, in the Slack message that says "that's not quite right." A generative eval pipeline just captures that signal instead of letting it evaporate.

The compounding value is the judge: once you have enough annotated examples, an LLM can evaluate new generated outputs against the accumulated rules without a human in the loop. The annotation experience is what gets you there. Design critique, structured.