Why I built Verimem — a memory layer with a notary at the door

There were already five popular memory layers for AI agents. I built a sixth anyway, because all five share the same missing piece: none of them checks whether a memory is true before storing it.

When I started building agents that work for days across sessions, I hit the wall everyone hits: LLMs have no memory. The ecosystem’s answer is a family of “memory layers” — mem0, Zep, Letta, Cognee, MemOS. I studied them, ran them, reverse-engineered their design choices. They differ in retrieval strategy, in graph structure, in pricing. They share one thing: they store whatever their extractor emits.

That’s not a detail. An agent’s memory is upstream of everything the agent does. If a false “fact” enters the store, every future retrieval is poisoned, and the agent doesn’t just repeat the error — it builds on it, with the calm confidence of someone quoting their own notes. In psychology that’s called confabulation. In production it’s called an incident.

The bet

Verimem is my bet that the right place to fight this is at write time. Every add() goes through an admission gate that asks one question: does the cited source actually entail this fact? Admitted facts get stored with provenance and a grounding score. Unsupported claims — the “it works”, the “task completed”, the numbers nobody measured — get downgraded. Contradicted ones get refused.

On SNLI, the source⊢fact entailment check reaches AUROC 0.971, and that number is judge-independent — no LLM grading its own homework.

The second half of the bet: provenance on every read. A retrieval doesn’t return naked text; it returns the fact’s status and its write-time grounding score, so downstream code can trust-condition instead of trusting. update() never destroys the old fact — it supersedes it, leaving an auditable history() chain. And facts carry bi-temporal validity: a memory that knows when it stopped being true.

What the gate is worth, honestly

In an adversarial downstream test, the gate cut hallucinated answers from 95.9% to 12.2% (replicated on two seeds, pooled McNemar p≈6e-44). The honest mechanism: it doesn’t make the agent more often right — it converts confabulation into abstention (omission 3%→85%). “I don’t know” instead of a confident invention. For an agent acting on real infrastructure, that trade is the product.

The honest cost: the first-generation gate over-rejected 30–39% of clean facts at the strict threshold. A distilled local gate (v2) now admits 92.4% of clean facts at zero API cost — but I spent months with that bill, and I document it, because every anti-hallucination mechanism has a recall bill.

What Verimem is not

It’s brand-new and has zero adoption beyond my own daily use. It’s a single-node SQLite system, not a distributed store. Its benchmarks are self-run and reproducible from the repo, not third-party audited. On raw extraction/QA accuracy against mature systems, it is currently behind — its axis is trust, not leaderboard position. All of this is written on the site and in the repo, because the product’s entire thesis is that unsupported claims shouldn’t survive — starting with mine.

It’s open source (MIT), it runs locally, and it plugs into Claude Code, Cursor, Cline or Zed via MCP in two minutes: verimem.com.

Numbers: github.com/aureliocpr-ctrl/verimem, STATE.md and BENCHMARKS.md.