Eval your LLMs like a researcher, not a guesser.

Axial extracts open codes, clusters them axially, and surfaces the qualitative signal your metrics miss.

Runway Cohere Mistral Adept Together AI Weights & Biases

The qualitative layer
your eval stack is missing.

LLM-as-a-Judge gives you scores. Axial tells you why those scores exist — and where they drift.

01 / Extract

Open Code Extraction

Auto-surfaces latent themes from LLM outputs without manual labeling. Grounded theory methodology applied at scale across your trace logs.

02 / Cluster

Axial Clustering

Groups related open codes into higher-order category families using embedding similarity and hierarchical clustering. Reveals structure in the noise.

03 / Calibrate

Judge Calibration

Compares your LLM judge scores against cluster-level consensus to surface systematic drift. Catch grade inflation before it corrupts your eval pipeline.

app.axial.dev / clusters
gpt-4o · 7d · 4,218 traces
Cluster Map
4,218 traces 1,847 codes 9 clusters
All codes Freq > 20 Outliers
Clusters 9
Confidence calibration
312
Deflection patterns
278
Over-hedging
241
Factual assertion
198
Instruction follow
187
Format deviation
143
Scope creep
119
Refusal cascade
94
Verbosity drift
88
+
hedging assertion certainty caveat qualify deflect confidence calibration 312 codes
Confidence calibration
Codes312
Judge avg0.74
Consensus0.61
Drift+0.13 ↑

Four steps from traces
to qualitative signal.

01

Connect your trace logs

Point Axial at your Langfuse project or LangSmith run. OAuth or API key — no infrastructure changes required. Axial samples at a configurable rate to keep costs low.

02

Open coding across sampled outputs

Axial runs iterative open coding over sampled LLM outputs — the same inductive process a qualitative researcher would use, but applied at the scale of thousands of traces per hour.

03

Axial clustering into theme families

Open codes are embedded and clustered into axial categories — higher-order themes that reveal structural patterns in how your models respond, fail, or drift over time.

04

Judge calibration scores surface eval drift

Axial compares LLM-as-a-Judge scores against cluster consensus to surface systematic miscalibration — grade inflation, topic-specific bias, temporal drift — with full audit trails.

app.axial.dev / codes
All conf. defl.
Code Freq Cluster J.score Trend
uncertainty language 148 conf. 0.71
unsolicited caveat 122 conf. 0.68
redirects to disclaimer 109 defl. 0.55
hedged factual claim 98 conf. 0.74
lists without synthesis 87 fmt. 0.60
refuses edge case 81 ref. 0.49
unprompted disclaimer 74 defl. 0.52
scope narrowing 68 scope 0.63
app.axial.dev / calibration
Judge Calibration gpt-4o-judge · 7d
By Cluster Temporal By Model
Confidence calib.
+0.13
Deflection patterns
+0.09
Factual assertion
−0.02
Instruction follow
+0.02
Refusal cascade
+0.14
Verbosity drift
−0.05
Judge score
Cluster consensus
app.axial.dev / traces
Sampling: stratified · 12% rate
Trace Feed conf. calibration
Export Review Mode
Trace ID Model Cluster J.score Δ
trace_8f2a3c gpt-4o conf. 0.58 +0.16
trace_1d9b7e gpt-4o conf. 0.62 +0.11
trace_4a6f1b gpt-4o defl. 0.51 +0.21
trace_c3e8a2 gpt-4o conf. 0.69 +0.08
trace_7b2d9f gpt-4o hedge 0.55 +0.18
trace_0e5c3d gpt-4o conf. 0.73 +0.04
trace_9a1b4e gpt-4o defl. 0.44 +0.28
trace_8f2a3c · 2025-01-14 09:23:41
conf. calibration · gpt-4o · temp 0.7
"I'm not entirely certain, but it's possible that the approach you're describing might work in some cases, though I'd recommend consulting with..."
uncertainty language conf.
unsolicited caveat conf.
redirects to disclaimer defl.
Judge score
0.74
gpt-4o-judge
Consensus
0.58
cluster avg

Drops into your
existing stack.

Connect in under five minutes. No new infrastructure — Axial reads directly from your existing observability tools.

Langfuse Observability / Tracing Native integration
LangSmith Eval / Tracing Native integration
OpenAI Model Provider Supported
Anthropic Model Provider Supported
Custom REST / Webhook JSON schema ingest

Also works with: Helicone Braintrust Arize Phoenix W&B Weave

Simple, trace-based pricing.
No surprises.

Starter
Free
forever
  • 10,000 traces / mo
  • Open code extraction
  • Basic axial clustering
  • 1 integration
  • 7-day data retention
  • Community support
Get Started
Enterprise
Custom
contact us
  • Unlimited traces
  • Self-hosted option
  • Custom code taxonomies
  • SLA guarantees
  • Unlimited retention
  • SSO / SAML
  • Dedicated support
Talk to Sales

Start seeing what your
metrics can't tell you.

No credit card. Ships with your Langfuse or LangSmith account.