February 28, 2026 · Hugh Pyle

Benchmarking Keep with LoCoMo

Photo: Charing Cross benchmark by GrindtXX

keep is a skills practice wrapped around an implementation of "memory for AI agents".

The practice is this: repeated reflection on means and outcomes, so that skillful action improves over time. But the raw implementation of memory is its foundation. Without working memory, you can't iterate.

Similarly, without benchmarks, you can't tell what works. Today we're publishing results for the LoCoMo benchmark.

BLUF: keep LoCoMo scores: 76.2% overall

This run used local models for embeddings and analysis (nomic-embed-text and llama3.2:3b), and gpt-4o-mini for the query and judge.

If you recall everything all at once, that becomes unmanageable very quickly (even with the enormous context windows in current models!). So the job of a memory system is to capture what happened, and retrieve it later in an actionable way. Usability -- by the agent using the memory system -- is key.

Retrieval is more difficult than it seems. And this is where benchmarks can be useful. Some recent benchmarks for AI memory are worth mentioning here: LoCoMo and LongMemEval. They include tests of varying scale and difficulty, but a similar flavor:

Several long series of chat conversations, with different participants and a range of topics. This dataset is loaded into the memory system.
Give an LLM access to the memory, and a benchmark question that should be answered.
Judge the answers, generally using a second LLM - either to report yes/no, or a grade, for each.

Keep is intended for "lightweight agentic memory", which is broader than just conversations. Its goals overlap substantially with RAG -- we want to track conversations and commitments, but also URLs and documents and artifacts as they are encountered or produced.

These chat-type conversation benchmarks focus on short plaintext messages, and occasionally image descriptions. For evaluating the core store and retrieval functions, they are a good place to start.

Conversations are messy. They're full of fluff ("OK, see ya!"), indirect and looping references, context that unfolds in multiple places over time.

"Why did Gina decide to start her own clothing store?"

The agent runs a query, gets a set of data back, and tries to answer the question. The job is to get the right amount of relevant information into the query result.

- Gina (0.50)
    - conv1-session14@p4  [2023-06-16]  Gina shared her own entrepreneurial journey, losing her job and starting an online clothing store, which she highly recommends.
    (...more...)

The benchmark answer is "She always loved fashion trends and finding unique pieces and she lost her job so decided it was time to start her own business". Just from the first line of the result, we can get most of the way there.

In normal usage, the agent will run a query, look at what it returned, and iterate ("agentic RAG"): fetching documents, or performing additional queries, until it's satisfied. For this benchmark we took the simpler single-pass approach: just ask the question, and use the results to write down an answer.

Search and context assembly uses a combination of embeddings, FTS, and structured traversal. Keep includes a "deep retrieval" query mode, where it follows edge-tags to find related documents beyond the top-k, assembling a rich context window for a single zero-shot generation pass.

There's much more to be done. Deeper agentic cycles. More comprehensive inference, with benchmarks such as CRAG and BEAM. Measuring the benefits of local processing, away from synthetic accuracy percentages, towards the costs in joules and milliseconds. Later -- we're just getting started.

Does it work today? Yes. Here are the full results, along with some published results from other memory systems:

keep and other LoCoMo scores

This is keep, open-source code, local models, on a Mac mini, with gpt-4o-mini for the query and judge.

Try it yourself. Let me know where it's useful.