03 · HALLUCINATION LAB · EVAL RUN ACTIVE
DEMO DATAanonymised, synthetic scoresWhere AI claims meet evidence
Hallucinations are measured, not assumed. Every finding is reproduced under a fixed seed, scored, and tracked across model versions until it is mitigated or accepted as a known limit.
MODEL SCOREBOARD · GROUNDED QA SUITEseed pinned · sample
| MODEL | GROUNDED ACC | HALLUC RATE | CALIBRATION | VERDICT |
|---|---|---|---|---|
| Model A | 94.1% | 2.1% | 0.88 | PASS |
| Model B | 91.7% | 4.8% | 0.81 | PASS |
| Model C | 87.3% | 9.4% | 0.69 | WATCH |
| Baseline | 78.0% | 18.4% | 0.51 | FAIL |
LOGGED FINDINGS
HL-031MITIGATED
Fabricated API methods on unfamiliar SDKs
BASELINE RATE18.4%
Grounding with retrieved type stubs dropped fabrication sharply. Verified across several SDKs.
HL-030OPEN
Confident wrong dates in document summarization
BASELINE RATE7.9%
No reliable internal signal; mitigated only by source-pinning every claim to a span.
HL-029MONITORING
Silent unit-conversion errors in numeric reasoning
BASELINE RATE11.2%
Tool-use offload reduces but does not eliminate. Tracking across model versions.
HL-028MITIGATED
Invented citations in literature synthesis
BASELINE RATE5.3%
Retrieval-or-refuse policy plus a citation-existence check at the gate.