Triangulated, not trusted

Don't ask the model. Measure it.

The model's self-report is one signal. We add two more — variance against the next-best alternative, and history against your past traces.

Send a test trace Read the math
Three pillars·Per-org calibration·Conformal coverage
Engine output · compute-time
{
  "traceId":         "trc_847b3f...",
  "confidenceScore": 0.46,
  "pillars": {
    "base":       0.55,
    "variance":   0.25,
    "historical": 0.55
  },
  "flags":           ["LOW_CONFIDENCE", "HIGH_AMBIGUITY"],
  "suggestedStatus": "flagged"
}
Only confidenceScore and flags persist on the trace. Pillars are computed at ingest, then discarded — the algorithm and weights are documented and stable, so the breakdown is always reproducible from the same inputs.

Three pillars. Three weights. One score.

The model's self-reported confidence is a sample of how the model feels — not a measurement. Adjudon takes it as one signal of three. The other two: how clear the top decision was vs the next-best alternative, and how often a near-identical past decision succeeded. Four blocks, one ingestion.

01 Pillar — Base Probability · weight 40 %

The first pillar is the model's own confidence — read off the trace, defaulted to 0.5 when the model didn't report. It carries 40 % of the final score. The other 60 % is what we measure.

// Pillar 1 — Base Probability  ·  weight 40 %
const rawBase = trace.outputDecision?.confidenceScore
              ?? trace.confidence;
const base = rawBase != null ? parseFloat(rawBase) : 0.5;
02 Pillar — Variance · weight 30 %

The second pillar measures how clear the top decision was vs the next-best alternative. Wide gap, decisive. Tied vote, ambiguous. No alternatives provided defaults to 0.8 — the model wasn't asked to enumerate, so we don't punish it.

// Pillar 2 — Variance  ·  weight 30 %
const altScore = trace.alternatives?.[0]?.confidence ?? 0;
const delta    = Math.max(0, base - altScore);
const variance = trace.alternatives?.length
  ? Math.min(1.0, 0.5 + delta * 1.5)
  : 0.8;
03 Pillar — Historical Precedent · weight 30 %

The third pillar looks up the three closest past traces by input-vector cosine similarity (≥ 0.7) and counts how many succeeded — no human override, no flagged status. Zero similar traces returns 0.6 plus a NOVEL_SITUATION flag. The absence of precedent is itself a signal.

// Pillar 3 — Historical Precedent  ·  weight 30 %
const text     = [trace.triggeringCondition, trace.inputContext]
                   .filter(Boolean).join(' ');
const queryVec = await vectorMemory.generateEmbedding(text);
const similar  = await vectorMemory.findSimilarTraces(orgId, queryVec, {
  topK: 3, minScore: 0.7, workspaceId
});

const historical = similar.length
  ? similar.filter(t => !t.humanOverride && t.status !== 'flagged').length / similar.length
  : 0.6;  // novel situation
04 Triangulation, flags, status

The three pillars combine on a 40/30/30 weighted sum. A score under 0.6 raises LOW_CONFIDENCE; variance under 0.3 raises HIGH_AMBIGUITY. The suggested status follows: under 0.4 escalates, under 0.7 or any flag flags, otherwise success. The Policy Engine reads this status next — the HTTP code is its decision, not ours.

// Triangulation
const score = base * 0.4 + variance * 0.3 + historical * 0.3;

// Flags
const flags = [];
if (score < 0.6)    flags.push('LOW_CONFIDENCE');
if (variance < 0.3) flags.push('HIGH_AMBIGUITY');

// suggestedStatus (the Policy Engine reads this next)
const suggestedStatus =
    score < 0.4                       ? 'escalated'
  : (score < 0.7 || flags.length > 0) ? 'flagged'
  :                                     'success';
Then we calibrate it

The score isn't the verdict. The calibrated score is.

A model says 90 %. At 100 such decisions, how many were correct? Maybe 60. We measure — per organisation, on your reviews.

01 What we measure

Brier · ECE · Reliability diagram

Brier (Glenn Brier, 1950) is the mean squared error between predicted confidence and outcome. ECE bins decisions and reports per-bin accuracy gaps. The reliability diagram is the canonical visualisation (Niculescu-Mizil & Caruana, ICML 2005), with 95 % Wilson CIs (Edwin Wilson, Harvard, 1927) on every point. Excellent < 0.10 Brier; well-calibrated < 0.03 ECE.

02 How we correct it

Per-org isotonic calibration map

Pool Adjacent Violators (Ayer-Brunk-Ewing-Reid-Silverman, 1955) builds an order-preserving correction curve from your reviewed decisions. We blend it with a cross-org prior via hierarchical shrinkage w_org = n_org / (n_org + 500) — at 50 reviews you pull 91 % from the prior; at 5 000 reviews, 91 % from your own data. Refit nightly at 03:30 UTC.

03 What a regulator can verify offline

Conformal coverage badge

Vovk conformal prediction (Algorithmic Learning in a Random World, 2005) is the only confidence construct in the engine that comes with a theorem rather than a heuristic — distribution-free, finite-sample, decidable. We publish an embeddable SVG badge at /badge.svg?alpha=0.1, publicly cacheable, showing your live empirical coverage at α target.

04 What we surface as failures

Drift detection · Bias stratification

Three drift triggers run nightly at 02:30 UTC: Kolmogorov-Smirnov on score distribution (D > 0.10), ECE jump month-over-month (≥ 0.03), and Brier-score regression (≥ 15 %). Bias stratification computes per-protected-group Brier + ECE on customer-supplied opaque labels — Adjudon never infers group membership from input context. EU AI Act Art. 10(2)(g) + 15(4) evidence.

Honest scope

We do not currently offer a financially-backed Brier-score guarantee. Such a guarantee would require third-party reinsurance underwriting which is not in place. We document this here so procurement does not arrive expecting a refund mechanism that does not exist. The full white paper (target FTML 2026) and conformal-coverage badge are the evidence layer customers can verify offline.

Frankfurt-only. End to end.

The trace lands at our Frankfurt eu-central-1 edge. The PII scrubber and all three pillars run in-region — Pillar 3 embedding generation is served by a self-hosted TEI sidecar in the same Frankfurt Fly.io region (replaces the prior OpenAI Embeddings call, retired 2026-05-11). The vector search runs against MongoDB Atlas Frankfurt. No customer data leaves the EU.

#StepRegionExternal call
01Trace ingest + PII scrubFrankfurt eu-central-1
02Pillars 1 + 2 — Base + VarianceFrankfurt eu-central-1
03Pillar 3 — embedding generationFrankfurt eu-central-1TEI sidecar (Fly.io, in-region) · all-MiniLM-L6-v2
04Pillar 3 — vector neighbour lookupMongoDB Atlas Frankfurt
05Triangulate + flags + suggestedStatusFrankfurt eu-central-1

Article 13 + 14, satisfied per trace.

Article 13 demands the deployer can read how an output was produced. Every trace persists the final score and any raised flags; the three-pillar weights and thresholds are public, so the algorithm itself is auditable. Article 14 demands human-oversight feasibility — the suggestedStatus mechanism auto-routes flagged and escalated traces to the Review Queue.

ArticleObligationConfidence Engine artefact
Article 13Deployer must be able to read how outputs are produced.confidenceScore (Float 0–1) + tags (flags) persisted per trace; weights + thresholds documented at docs.adjudon.com/concepts/traces-and-confidence.
Article 14Human oversight on AI decisions must be feasible.suggestedStatus = 'flagged' | 'escalated' auto-routes the trace into the Review Queue.
Article 13(3)(b)(ii)Instructions for use must declare accuracy, foreseeable impacts, and methodology.Article 13 IFU generator produces JSON + Markdown with declared accuracy, reliability diagram, and methodology references — downloadable from /dashboard/calibration.
Article 10(2)(g)Examination in view of possible biases.Bias-stratified report computes per-protected-group Brier + ECE on customer-supplied opaque labels. Max-between-group ECE gap surfaced as the bias signal.
Article 15(4)High-risk systems must be resilient to errors and drift.Three drift triggers nightly: Kolmogorov-Smirnov, ECE jump m-o-m, Brier regression. Persists CalibrationDriftAlert + dispatches calibration.drift_detected webhook.

The model says 0.55. We don't take its word — and we don't take ours either.

Three pillars at compute-time, then calibration, coverage, drift, bias. Send a test trace — first call lands at the engineer who built the engine, not an SDR.