What months of measuring confabulation and hallucination in LLM agents taught me
Five lessons from benchmarking memory systems against hallucination — including the time my own metric lied to me by 4×, and the false-memory attack that fools every entailment judge.
For months, most of my work on Verimem hasn’t been writing features — it’s been measuring failure. Building harnesses, running adversarial benchmarks, and watching numbers fall apart under scrutiny. These are the five lessons that survived. Every number below is in the public repo.
1. Hallucination and confabulation are different failures
Hallucination is ungrounded generation: the model produces text unsupported by any source. Confabulation is worse: a false memory — the system stored something untrue and now recalls it with full confidence, provenance-free. A hallucination is a one-off lie; a confabulation is a lie with tenure. Retrieval tricks can mitigate the first. Only write-time admission control prevents the second, because once the false fact is stored, every downstream read inherits it.
2. The fix doesn’t make agents right — it makes them honest
The single most misunderstood result in this space. Verimem’s write gate cut hallucinated answers in an adversarial test from 95.9% to 12.2%. Sounds like the agent got smarter — it didn’t. Correctness stayed nearly flat by construction of the test; what changed is that confabulation became abstention (omission went 3%→85%). The agent learned to say “I don’t know.”
If a vendor shows you hallucination reduction without showing you the abstention and recall numbers next to it, they’re showing you a third of the picture.
3. Your metric will lie to you unless an independent judge calibrates it
The most expensive lesson. On HaluMem’s memory-updating benchmark, my local embedding-similarity matcher scored our accuracy at 0.66. A stratified pass with an independent LLM judge, using the official rubric, corrected that to 0.24 — the matcher was conflating “same topic” with “same fact”. Recalibrating on the judge’s verdicts pushed the conservative floor to 0.16.
Three layers deep, the same data: 0.66 → 0.24 → 0.16. Since then my rule is absolute: local matchers are for relative ranking only; absolute numbers come from calibrated, independent judges. I publish the corrected number, not the flattering one — that’s also why I retracted an early headline stat (a pooled p-value from a mispaired harness) and re-ran it fairly.
4. The nastiest attack is the false memory hiding in plain sight
The attribution problem: an assistant’s wrong claim sits verbatim in the conversation history. Every entailment-only judge admits it — the text really is “supported by” the context, because the context contains the lie. In our tests a strong LLM judge admitted 40% of these injected false memories. A small local model, fine-tuned with interference negatives, admitted 8.6% — the student beat the teacher, because the teacher was structurally blind to the axis that mattered: not “is this entailed?” but “who said it, and does the source deserve trust?“
5. Negative results are load-bearing
The write-repair idea — on a gate rejection, swap in the closest verbatim source span and re-admit it — sounded obviously good. Benchmarked, it laundered false memories into the store (re-gating a span against the source that contains it is a tautology) and its supposed benefit had evaporated anyway. It was falsified and never shipped. The graveyard of falsified ideas is documented in the repo, because a system that only reports its wins is doing marketing, not engineering.
The through-line: trust in AI systems isn’t a model property — it’s an architecture property. You get it from admission control, provenance, calibrated measurement, and the willingness to publish the numbers that hurt.
Sources: STATE.md and BENCHMARKS.md in the repo. Self-run, reproducible, not third-party audited.