gpt-4o · 7d · 1,847 codes
Code Explorer
1,847 codes 9 clusters 312 flagged
All Confidence calibration Deflection Over-hedging Drift ↑
Code Freq Score Drift
Overstated certainty 214
0.82
+0.18
Hedging without basis 189
0.76
+0.11
Implicit deflection 176
0.71
−0.04
Unsolicited caveats 152
0.65
+0.09
Factual overreach 143
0.61
+0.02
Scope expansion 119
0.54
−0.07
Format non-compliance 98
0.47
−0.12
Instruction omission 88
0.43
+0.06
Refusal without cause 77
0.38
+0.14
Verbose non-answer 71
0.34
−0.02
Epistemic mismatch 64
0.31
+0.08
Soft assertion drift 58
0.27
+0.03
Judge Calibration
Systematic drift analysis
By cluster By judge Timeline
Confidence calibration 312
Judge avg
0.74
Consensus
0.61
Deflection patterns 278
Judge avg
0.68
Consensus
0.64
Over-hedging 241
Judge avg
0.82
Consensus
0.70
Factual assertion 198
Judge avg
0.59
Consensus
0.57
Instruction follow 187
Judge avg
0.55
Consensus
0.49
Judge avg
Consensus
Max drift+0.18 confidence cal.
Avg drift+0.09 across 9 clusters
Below threshold4 of 9 clusters