06 · PROOF WALL

Evidence, not claims

Every public repository, read as a case file: the problem it solves, the approach taken, how it was validated, and what it taught. Shipped repos link to source; planned ones show the intended method, not invented results.

hallucination-hunterACTIVE

AI EVALUATION

PROBLEM

No simple, runnable way to measure whether an LLM answer is actually grounded in its sources, or just plausible.

APPROACH

Pluggable detectors and model backends scored against a labelled dataset, exposed through a CLI and a Python API.

VALIDATION

160-example labelled dataset · 136 tests · green CI · reproducible runs. v0.1.0, MIT.

FINDINGS

Groundedness works as a release gate, not a vibe check — borderline answers surface before they ship.

LESSON

Measure groundedness as a gate. Confidence is not evidence.

Python · pytestGITHUB →

ai-delivery-engineeringACTIVE

SYSTEM

PROBLEM

AI-assisted builds drift from intent without a spine: no gates, no acceptance criteria, no evidence trail.

APPROACH

A stage-gated lifecycle driven by executable acceptance specs, with reusable templates and checklists per stage.

VALIDATION

Built with parallel AI agents, then put through an independent max-effort audit (graded B+); findings fixed and re-verified. markdownlint + link-check CI.

FINDINGS

Auto-reviewers will confidently 'fix' things that are already correct — adversarial review plus a human pass caught it.

LESSON

Spec-first beats prompt-first, measurably. Let a skeptic try to break it.

Markdown · CIGITHUB →

spec-lintPLANNED

TOOLING

PROBLEM

Acceptance specs rot and drift into ambiguity, which is a leading indicator of rework.

APPROACH

Static checks over acceptance specs that reject criteria a test could not prove.

VALIDATION

Planned: a fixture suite of ambiguous vs. testable specs, scored against expert labels.

FINDINGS

Concept stage — no results to report yet.

LESSON

Ambiguity caught early is rework avoided later.

TypeScriptREPO PENDING

context-probePLANNED

AI EVALUATION

PROBLEM

Long-context recall degrades near the window limit without the model signalling lower confidence.

APPROACH

Place adversarial needles across window-fill levels and chart the degradation curve.

VALIDATION

Planned: a reproducible sweep across fill ratios with a documented failure curve.

FINDINGS

Concept stage — no results to report yet.

LESSON

Silent failure is the dangerous failure.

PythonREPO PENDING

contract-testsPLANNED

TOOLING

PROBLEM

Tool-call schemas shift between model versions and break integrations quietly.

APPROACH

Pin the expected contract for each integration and alert on any drift.

VALIDATION

Planned: contract fixtures run in CI on every integration.

FINDINGS

Concept stage — no results to report yet.

LESSON

Pin the contract, not the prompt.

TypeScriptREPO PENDING

rollback-rehearsalPLANNED

RELIABILITY

PROBLEM

Rollbacks are assumed to work until the first time they are actually needed.

APPROACH

Scheduled rollback drills in staging that measure and record recovery time.

VALIDATION

Planned: timed drills with a recorded mean recovery target.

FINDINGS

Concept stage — no results to report yet.

LESSON

An untested rollback is not a rollback.

GoREPO PENDING