Axial — Qualitative LLM Evaluation

/ What It Does

The qualitative layer
your eval stack is missing.

LLM-as-a-Judge gives you scores. Axial tells you why those scores exist — and where they drift.

01 / Extract

Open Code Extraction

Auto-surfaces latent themes from LLM outputs without manual labeling. Grounded theory methodology applied at scale across your trace logs.

02 / Cluster

Axial Clustering

Groups related open codes into higher-order category families using embedding similarity and hierarchical clustering. Reveals structure in the noise.

03 / Calibrate

Judge Calibration

Compares your LLM judge scores against cluster-level consensus to surface systematic drift. Catch grade inflation before it corrupts your eval pipeline.

app.axial.dev / clusters

gpt-4o · 7d · 4,218 traces

Confidence calibration

312

Deflection patterns

278

Over-hedging

241

Factual assertion

198

Instruction follow

187

Format deviation

143

Scope creep

119

Refusal cascade

94

Verbosity drift

88

+

−

Confidence calibration

Codes312

Judge avg0.74

Consensus0.61

Drift+0.13 ↑

/ How It Works

Four steps from traces
to qualitative signal.

01

Connect your trace logs

Point Axial at your Langfuse project or LangSmith run. OAuth or API key — no infrastructure changes required. Axial samples at a configurable rate to keep costs low.

02

Open coding across sampled outputs

Axial runs iterative open coding over sampled LLM outputs — the same inductive process a qualitative researcher would use, but applied at the scale of thousands of traces per hour.

03

Axial clustering into theme families

Open codes are embedded and clustered into axial categories — higher-order themes that reveal structural patterns in how your models respond, fail, or drift over time.

04

Judge calibration scores surface eval drift

Axial compares LLM-as-a-Judge scores against cluster consensus to surface systematic miscalibration — grade inflation, topic-specific bias, temporal drift — with full audit trails.

app.axial.dev / codes

Code Freq Cluster J.score Trend

uncertainty language 148 conf. 0.71

unsolicited caveat 122 conf. 0.68

redirects to disclaimer 109 defl. 0.55

hedged factual claim 98 conf. 0.74

lists without synthesis 87 fmt. 0.60

refuses edge case 81 ref. 0.49

unprompted disclaimer 74 defl. 0.52

scope narrowing 68 scope 0.63

app.axial.dev / calibration

Judge Calibration gpt-4o-judge · 7d

By Cluster Temporal By Model

Confidence calib.

+0.13

Deflection patterns

+0.09

Factual assertion

−0.02

Instruction follow

+0.02

Refusal cascade

+0.14

Verbosity drift

−0.05

Judge score

Cluster consensus

app.axial.dev / traces

Sampling: stratified · 12% rate

Trace ID Model Cluster J.score Δ

trace_8f2a3c gpt-4o conf. 0.58 +0.16

trace_1d9b7e gpt-4o conf. 0.62 +0.11

trace_4a6f1b gpt-4o defl. 0.51 +0.21

trace_c3e8a2 gpt-4o conf. 0.69 +0.08

trace_7b2d9f gpt-4o hedge 0.55 +0.18

trace_0e5c3d gpt-4o conf. 0.73 +0.04

trace_9a1b4e gpt-4o defl. 0.44 +0.28

trace_8f2a3c · 2025-01-14 09:23:41

conf. calibration · gpt-4o · temp 0.7

Output excerpt

"I'm not entirely certain, but it's possible that the approach you're describing might work in some cases, though I'd recommend consulting with..."

Extracted codes

uncertainty language conf.

unsolicited caveat conf.

redirects to disclaimer defl.

Scores

Judge score

0.74

gpt-4o-judge

Consensus

0.58

cluster avg

/ Integrations

Drops into your
existing stack.

Connect in under five minutes. No new infrastructure — Axial reads directly from your existing observability tools.

Langfuse Observability / Tracing Native integration

LangSmith Eval / Tracing Native integration

OpenAI Model Provider Supported

Anthropic Model Provider Supported

Custom REST / Webhook JSON schema ingest

Also works with: Helicone Braintrust Arize Phoenix W&B Weave

/ Pricing

Simple, trace-based pricing.
No surprises.

Starter

Free

forever

10,000 traces / mo
Open code extraction
Basic axial clustering
1 integration
7-day data retention
Community support

Get Started

Growth

^$149

per month

100,000 traces / mo
Full axial clustering
Judge calibration
Unlimited integrations
90-day data retention
Drift alerts
Email + Slack support

Request Access

Enterprise

Custom

contact us

Unlimited traces
Self-hosted option
Custom code taxonomies
SLA guarantees
Unlimited retention
SSO / SAML
Dedicated support

Talk to Sales

/ Early Access

Start seeing what your
metrics can't tell you.

No credit card. Ships with your Langfuse or LangSmith account.

Eval your LLMs like a researcher, not a guesser.

The qualitative layeryour eval stack is missing.