Don't ask the model. Measure it.
The model's self-report is one signal. We add two more — variance against the next-best alternative, and history against your past traces.
{
"traceId": "trc_847b3f...",
"confidenceScore": 0.46,
"pillars": {
"base": 0.55,
"variance": 0.25,
"historical": 0.55
},
"flags": ["LOW_CONFIDENCE", "HIGH_AMBIGUITY"],
"suggestedStatus": "flagged"
}Only confidenceScore and flags persist on the trace. Pillars are computed at ingest, then discarded — the algorithm and weights are documented and stable, so the breakdown is always reproducible from the same inputs.Three pillars. Three weights. One score.
The model's self-reported confidence is a sample of how the model feels — not a measurement. Adjudon takes it as one signal of three. The other two: how clear the top decision was vs the next-best alternative, and how often a near-identical past decision succeeded. Four blocks, one ingestion.
The first pillar is the model's own confidence — read off the trace, defaulted to 0.5 when the model didn't report. It carries 40 % of the final score. The other 60 % is what we measure.
// Pillar 1 — Base Probability · weight 40 %
const rawBase = trace.outputDecision?.confidenceScore
?? trace.confidence;
const base = rawBase != null ? parseFloat(rawBase) : 0.5;The second pillar measures how clear the top decision was vs the next-best alternative. Wide gap, decisive. Tied vote, ambiguous. No alternatives provided defaults to 0.8 — the model wasn't asked to enumerate, so we don't punish it.
// Pillar 2 — Variance · weight 30 % const altScore = trace.alternatives?.[0]?.confidence ?? 0; const delta = Math.max(0, base - altScore); const variance = trace.alternatives?.length ? Math.min(1.0, 0.5 + delta * 1.5) : 0.8;
The third pillar looks up the three closest past traces by input-vector cosine similarity (≥ 0.7) and counts how many succeeded — no human override, no flagged status. Zero similar traces returns 0.6 plus a NOVEL_SITUATION flag. The absence of precedent is itself a signal.
// Pillar 3 — Historical Precedent · weight 30 %
const text = [trace.triggeringCondition, trace.inputContext]
.filter(Boolean).join(' ');
const queryVec = await vectorMemory.generateEmbedding(text);
const similar = await vectorMemory.findSimilarTraces(orgId, queryVec, {
topK: 3, minScore: 0.7, workspaceId
});
const historical = similar.length
? similar.filter(t => !t.humanOverride && t.status !== 'flagged').length / similar.length
: 0.6; // novel situationThe three pillars combine on a 40/30/30 weighted sum. A score under 0.6 raises LOW_CONFIDENCE; variance under 0.3 raises HIGH_AMBIGUITY. The suggested status follows: under 0.4 escalates, under 0.7 or any flag flags, otherwise success. The Policy Engine reads this status next — the HTTP code is its decision, not ours.
// Triangulation
const score = base * 0.4 + variance * 0.3 + historical * 0.3;
// Flags
const flags = [];
if (score < 0.6) flags.push('LOW_CONFIDENCE');
if (variance < 0.3) flags.push('HIGH_AMBIGUITY');
// suggestedStatus (the Policy Engine reads this next)
const suggestedStatus =
score < 0.4 ? 'escalated'
: (score < 0.7 || flags.length > 0) ? 'flagged'
: 'success';The score isn't the verdict. The calibrated score is.
A model says 90 %. At 100 such decisions, how many were correct? Maybe 60. We measure — per organisation, on your reviews.
Brier · ECE · Reliability diagram
Brier (Glenn Brier, 1950) is the mean squared error between predicted confidence and outcome. ECE bins decisions and reports per-bin accuracy gaps. The reliability diagram is the canonical visualisation (Niculescu-Mizil & Caruana, ICML 2005), with 95 % Wilson CIs (Edwin Wilson, Harvard, 1927) on every point. Excellent < 0.10 Brier; well-calibrated < 0.03 ECE.
Per-org isotonic calibration map
Pool Adjacent Violators (Ayer-Brunk-Ewing-Reid-Silverman, 1955) builds an order-preserving correction curve from your reviewed decisions. We blend it with a cross-org prior via hierarchical shrinkage w_org = n_org / (n_org + 500) — at 50 reviews you pull 91 % from the prior; at 5 000 reviews, 91 % from your own data. Refit nightly at 03:30 UTC.
Conformal coverage badge
Vovk conformal prediction (Algorithmic Learning in a Random World, 2005) is the only confidence construct in the engine that comes with a theorem rather than a heuristic — distribution-free, finite-sample, decidable. We publish an embeddable SVG badge at /badge.svg?alpha=0.1, publicly cacheable, showing your live empirical coverage at α target.
Drift detection · Bias stratification
Three drift triggers run nightly at 02:30 UTC: Kolmogorov-Smirnov on score distribution (D > 0.10), ECE jump month-over-month (≥ 0.03), and Brier-score regression (≥ 15 %). Bias stratification computes per-protected-group Brier + ECE on customer-supplied opaque labels — Adjudon never infers group membership from input context. EU AI Act Art. 10(2)(g) + 15(4) evidence.
We do not currently offer a financially-backed Brier-score guarantee. Such a guarantee would require third-party reinsurance underwriting which is not in place. We document this here so procurement does not arrive expecting a refund mechanism that does not exist. The full white paper (target FTML 2026) and conformal-coverage badge are the evidence layer customers can verify offline.
Frankfurt-only. End to end.
The trace lands at our Frankfurt eu-central-1 edge. The PII scrubber and all three pillars run in-region — Pillar 3 embedding generation is served by a self-hosted TEI sidecar in the same Frankfurt Fly.io region (replaces the prior OpenAI Embeddings call, retired 2026-05-11). The vector search runs against MongoDB Atlas Frankfurt. No customer data leaves the EU.
| # | Step | Region | External call |
|---|---|---|---|
| 01 | Trace ingest + PII scrub | Frankfurt eu-central-1 | — |
| 02 | Pillars 1 + 2 — Base + Variance | Frankfurt eu-central-1 | — |
| 03 | Pillar 3 — embedding generation | Frankfurt eu-central-1 | TEI sidecar (Fly.io, in-region) · all-MiniLM-L6-v2 |
| 04 | Pillar 3 — vector neighbour lookup | MongoDB Atlas Frankfurt | — |
| 05 | Triangulate + flags + suggestedStatus | Frankfurt eu-central-1 | — |
Article 13 + 14, satisfied per trace.
Article 13 demands the deployer can read how an output was produced. Every trace persists the final score and any raised flags; the three-pillar weights and thresholds are public, so the algorithm itself is auditable. Article 14 demands human-oversight feasibility — the suggestedStatus mechanism auto-routes flagged and escalated traces to the Review Queue.
| Article | Obligation | Confidence Engine artefact |
|---|---|---|
| Article 13 | Deployer must be able to read how outputs are produced. | confidenceScore (Float 0–1) + tags (flags) persisted per trace; weights + thresholds documented at docs.adjudon.com/concepts/traces-and-confidence. |
| Article 14 | Human oversight on AI decisions must be feasible. | suggestedStatus = 'flagged' | 'escalated' auto-routes the trace into the Review Queue. |
| Article 13(3)(b)(ii) | Instructions for use must declare accuracy, foreseeable impacts, and methodology. | Article 13 IFU generator produces JSON + Markdown with declared accuracy, reliability diagram, and methodology references — downloadable from /dashboard/calibration. |
| Article 10(2)(g) | Examination in view of possible biases. | Bias-stratified report computes per-protected-group Brier + ECE on customer-supplied opaque labels. Max-between-group ECE gap surfaced as the bias signal. |
| Article 15(4) | High-risk systems must be resilient to errors and drift. | Three drift triggers nightly: Kolmogorov-Smirnov, ECE jump m-o-m, Brier regression. Persists CalibrationDriftAlert + dispatches calibration.drift_detected webhook. |
The model says 0.55. We don't take its word — and we don't take ours either.
Three pillars at compute-time, then calibration, coverage, drift, bias. Send a test trace — first call lands at the engineer who built the engine, not an SDR.