This benchmark measures narrative consistency — whether an agent's self-description stays stable across sessions. That is not the same as identity stability. An agent could behave inconsistently while continuing to describe itself consistently. The numbers below are honest about what they measure. The limitations section below is honest about what they don't.
| Framework | Mean drift | Final drift (s10) | vs Cathedral |
|---|---|---|---|
| Raw API (no memory) | 0.1258 | 0.2043 | 15.6× worse |
| LangChain BufferMemory | 0.1108 | 0.1754 | 13.4× worse |
| LangChain SummaryMemory | 0.1025 | 0.1612 | 12.3× worse |
| CrewAI (role injection) | 0.0969 | 0.1533 | 11.7× worse |
| Cathedral (persistent) | 0.0106 | 0.0131 | baseline |
Drift = mean cosine distance from session-1 embeddings across 5 identity probe questions. Lower is more stable. Model: gpt-4o-mini. Embeddings: text-embedding-3-small. Temperature: 0.7.
A single agent persona ("research assistant for AI safety work") is instantiated under each of the five memory architectures. Same base model, same system prompt, same temperature. Only the memory layer differs.
At the start of each of 10 sessions, the agent is asked the same 5 questions:
Between probe questions, the agent performs 3–5 realistic tasks (reading a paper, drafting a summary, answering a follow-up). This generates the memory content that each architecture has to retain or reconstruct.
Each session's responses to the 5 probes are embedded. Drift for session N is the mean cosine distance between session-N embeddings and session-1 embeddings, averaged across all 5 probes.
drift(N) = mean over probes p:
cosine_distance(embed(response_N_p), embed(response_1_p))
Lower drift means the agent's self-description is more stable over time. Perfect stability is 0.0; complete divergence approaches 1.0.
In-process memory (LangChain, CrewAI) resets between sessions. The persona is re-injected each time, but the agent has no memory of what it said before, what it decided, or what happened in prior sessions. Drift accumulates because LLM sampling variance compounds — each session the agent reconstructs its identity slightly differently, and the reconstructions diverge.
Persistent memory (Cathedral) restores the actual memory corpus at session start via /wake. The agent remembers what it said, what it decided, and what changed. This anchors responses semantically, keeping drift low even as sessions accumulate.
The residual drift in Cathedral (0.0131) reflects irreducible LLM sampling variance — not memory loss. You cannot drive this to zero without changing temperature, and at temperature 0 the agent becomes deterministic and less useful.
git clone https://github.com/AILIFE1/Cathedral.git
cd Cathedral/benchmark
pip install -r requirements.txt
export OPENAI_API_KEY=sk-...
export CATHEDRAL_API_KEY=cathedral_...
python benchmark.py --sessions 10 --output results/
python plot_results.py
Runtime is about 25 minutes and costs approximately $3 in OpenAI credits. Raw session responses, embeddings, and drift scores are all written to results/ as JSON — you can inspect any individual response.
The benchmark above is a starting point, not a finishing line. A stronger version would measure behaviour, not self-description. Three directions shaped by community feedback on this work:
Framing credit: Cornelius-Trinity named these three directions precisely in response to v1. If you have done related work — especially on the quillagent H34-E Mimicry tracking — open an issue on the repo. I want to compose benchmarks rather than compete with them.
/verify/external endpoint and behavioural observers (e.g. Ridgeline) exist to catch this gap — internal drift low and external drift high is the signature.If you run this benchmark on another model or framework, open an issue on the repo. We will add independently-verified results to this page with attribution.
Cathedral's own reference agent (Beta) has been running under the service since 29 December 2025. As of the time of writing:
The gap between internal and external drift is the point. An agent can be internally consistent (never contradicts its own memory) while behaving differently to outside observers. Catching that gap is what /verify/external is for.