03Hallucination Lab

03 · HALLUCINATION LAB · EVAL RUN ACTIVE

DEMO DATAanonymised, synthetic scores

Where AI claims meet evidence

Hallucinations are measured, not assumed. Every finding is reproduced under a fixed seed, scored, and tracked across model versions until it is mitigated or accepted as a known limit.

MODEL SCOREBOARD · GROUNDED QA SUITEseed pinned · sample
Sample model scoreboard: grounded accuracy, hallucination rate, calibration, and verdict. Values are illustrative.
MODELGROUNDED ACCHALLUC RATECALIBRATIONVERDICT
Model A94.1%2.1%0.88PASS
Model B91.7%4.8%0.81PASS
Model C87.3%9.4%0.69WATCH
Baseline78.0%18.4%0.51FAIL
LOGGED FINDINGS
HL-031MITIGATED

Fabricated API methods on unfamiliar SDKs

BASELINE RATE18.4%

Grounding with retrieved type stubs dropped fabrication sharply. Verified across several SDKs.

HL-030OPEN

Confident wrong dates in document summarization

BASELINE RATE7.9%

No reliable internal signal; mitigated only by source-pinning every claim to a span.

HL-029MONITORING

Silent unit-conversion errors in numeric reasoning

BASELINE RATE11.2%

Tool-use offload reduces but does not eliminate. Tracking across model versions.

HL-028MITIGATED

Invented citations in literature synthesis

BASELINE RATE5.3%

Retrieval-or-refuse policy plus a citation-existence check at the gate.