Axial: qualitative evals for LLM-as-a-Judge

The gap

LLM evaluation tooling is good at metrics. Pass/fail rates, latency, token cost, ROUGE scores. Langfuse and LangSmith are solid for tracing and scoring, but they operate on the assumption that you already know what to measure.

Most teams don't. Especially early. You're shipping prompts, watching outputs, and trying to understand why the model behaves the way it does, not just whether it passes a threshold.

Qualitative analysis of LLM outputs has no good tooling. Teams either skip it or do it by hand: sampling outputs, reading through them, trying to find patterns. It's slow, inconsistent, and doesn't scale past a few hundred traces.

Axial is the tool I designed to close that gap.

zoom

Marketing site — axial.dev

What it does

Axial runs a two-stage analysis pipeline on top of your existing trace data.

Open coding. A secondary LLM agent reads sampled outputs and extracts open codes: short descriptive labels for what's happening in each output. Not evaluations. Observations. "Model deflects with uncertainty language." "Response includes unsolicited caveats." "Factual claim made without hedging."

Axial clustering. Codes are embedded and clustered into higher-order theme families. "Uncertainty language," "over-hedging," and "deflection patterns" might all resolve into a single axis: model confidence calibration. That's the signal you actually care about.

zoom

Cluster map — confidence calibration, 312 codes, 9 clusters

The design problem

The core tension: this is a research-grade analytical tool, but its users are product and ML teams who already live in Langfuse or LangSmith. It has to feel like a layer on top of what already exists, not a replacement.

That shaped every decision. The integration model is pull-based. Connect your workspace, configure a sampling rate, and Axial runs passively. No changes to your logging setup. No new SDKs. The first time you open a dashboard, you already have data.

The main view is a cluster map, not a table. Tables are for metrics. This is about pattern recognition, and spatial layout communicates proximity better than rows. Clusters that share thematic weight sit close. Outlier behaviors live at the edges.

Design decisions

No color for status. Every eval tool uses red/green for pass/fail. Axial doesn't make that judgment; it surfaces patterns and lets the team decide what's good or bad. Using color would imply a verdict the tool doesn't have. Everything is grayscale, with density and proximity doing the communicative work.

The coding agent is visible, not hidden. You can see exactly what the agent extracted and why. Codes are editable. Clusters can be manually merged or split. The AI is a starting point, not an oracle.

Sampling is a first-class concept. You can't run deep qualitative analysis on every trace; it's expensive and unnecessary. Axial makes sampling strategy explicit: stratified by output length, random, or seeded by judge score distribution. The sampling config is always visible so you know what you're looking at.

zoom

Code explorer + judge calibration — 1,847 codes, systematic drift analysis

zoom

Trace feed — per-trace codes, extracted labels, score comparison

Outcome

Axial is a concept I built out fully: product definition, design system, interface, and marketing site. The technical architecture is validated. Open coding via secondary LLM agent is well-established in qualitative research methodology, and the clustering approach maps cleanly onto existing embedding infrastructure.

The gap it addresses is real and getting more acute as teams build more complex LLM systems. Metrics tell you what happened. Axial tells you why.

This is a vision piece. The design is complete; the product is not built.