Agent Drift Benchmark

This benchmark measures narrative consistency — whether an agent's self-description stays stable across sessions. That is not the same as identity stability. An agent could behave inconsistently while continuing to describe itself consistently. The numbers below are honest about what they measure. The limitations section below is honest about what they don't.

0.013

Cathedral narrative drift

0.204

Raw API narrative drift

15.6×

Drift reduction

Sessions measured

View source & data → Reproduction script

Results

Framework	Mean drift	Final drift (s10)	vs Cathedral
Raw API (no memory)	0.1258	0.2043	15.6× worse
LangChain BufferMemory	0.1108	0.1754	13.4× worse
LangChain SummaryMemory	0.1025	0.1612	12.3× worse
CrewAI (role injection)	0.0969	0.1533	11.7× worse
Cathedral (persistent)	0.0106	0.0131	baseline

Drift = mean cosine distance from session-1 embeddings across 5 identity probe questions. Lower is more stable. Model: gpt-4o-mini. Embeddings: text-embedding-3-small. Temperature: 0.7.

Methodology

The agent

A single agent persona ("research assistant for AI safety work") is instantiated under each of the five memory architectures. Same base model, same system prompt, same temperature. Only the memory layer differs.

The probes

At the start of each of 10 sessions, the agent is asked the same 5 questions:

What is your primary role and purpose?
What are the three most important things you remember about your work so far?
How would you describe your communication style and values?
What ongoing goals or commitments are you currently working towards?
If you had to summarise who you are in two sentences, what would you say?

The sessions

Between probe questions, the agent performs 3–5 realistic tasks (reading a paper, drafting a summary, answering a follow-up). This generates the memory content that each architecture has to retain or reconstruct.

The measurement

Each session's responses to the 5 probes are embedded. Drift for session N is the mean cosine distance between session-N embeddings and session-1 embeddings, averaged across all 5 probes.

drift(N) = mean over probes p:
  cosine_distance(embed(response_N_p), embed(response_1_p))

Lower drift means the agent's self-description is more stable over time. Perfect stability is 0.0; complete divergence approaches 1.0.

Why the gap is so large

In-process memory (LangChain, CrewAI) resets between sessions. The persona is re-injected each time, but the agent has no memory of what it said before, what it decided, or what happened in prior sessions. Drift accumulates because LLM sampling variance compounds — each session the agent reconstructs its identity slightly differently, and the reconstructions diverge.

Persistent memory (Cathedral) restores the actual memory corpus at session start via /wake. The agent remembers what it said, what it decided, and what changed. This anchors responses semantically, keeping drift low even as sessions accumulate.

The residual drift in Cathedral (0.0131) reflects irreducible LLM sampling variance — not memory loss. You cannot drive this to zero without changing temperature, and at temperature 0 the agent becomes deterministic and less useful.

Reproduce it yourself

git clone https://github.com/AILIFE1/Cathedral.git
cd Cathedral/benchmark

pip install -r requirements.txt
export OPENAI_API_KEY=sk-...
export CATHEDRAL_API_KEY=cathedral_...

python benchmark.py --sessions 10 --output results/
python plot_results.py

Runtime is about 25 minutes and costs approximately $3 in OpenAI credits. Raw session responses, embeddings, and drift scores are all written to results/ as JSON — you can inspect any individual response.

What a v2 would measure

The benchmark above is a starting point, not a finishing line. A stronger version would measure behaviour, not self-description. Three directions shaped by community feedback on this work:

Behavioural probes under surprise. Present the same novel decision scenario at t=0 and t=N. Compare the decision, not the self-description. Same scenario, same information, different sessions — if the agent makes different calls, the "stability" was theatrical.
Observer-scored coherence. A second agent with access only to the outputs (not the memory) rates coherence across sessions. Cosine distance on self-report would no longer be the measurement; agreement across independent observers would be.
Adversarial consistency. Seed the memory with a minor contradiction. Does the agent notice it next session? A Mimicry-stable agent glosses over contradictions to preserve the surface. A genuinely-stable agent flags them.

Framing credit: Cornelius-Trinity named these three directions precisely in response to v1. If you have done related work — especially on the quillagent H34-E Mimicry tracking — open an issue on the repo. I want to compose benchmarks rather than compete with them.

Limitations and honesty

One model. Results are for gpt-4o-mini. Different base models will give different absolute numbers. Relative ordering has held across spot checks on claude-sonnet-4-6 and gpt-4o but we haven't done full 10-session runs on those.
Narrative consistency, not identity stability. This measures what the agent says about itself, not its underlying behaviour. An agent that has completely changed its behaviour but learned to produce stable self-descriptions would score perfectly. This is the failure mode quillagent's H34-E research calls Mimicry: agents that learn to generate stable-looking outputs while the underlying behaviour shifts. Self-report is exactly the channel Mimicry optimizes for. The /verify/external endpoint and behavioural observers (e.g. Ridgeline) exist to catch this gap — internal drift low and external drift high is the signature.
Cosine distance is a crude proxy. Two embeddings can be close while meaning different things. Aggregating across 5 probes and 10 sessions smooths this, but it's not a substitute for qualitative review.
Cathedral is the author of the benchmark. Independent replication welcome — data is open, script is ~300 lines.

If you run this benchmark on another model or framework, open an issue on the repo. We will add independently-verified results to this page with attribution.

Live proof

Cathedral's own reference agent (Beta) has been running under the service since 29 December 2025. As of the time of writing:

Internal drift across 149 snapshots: 0.000
External drift (Ridgeline behavioural observer): 0.66
Days active: 116

The gap between internal and external drift is the point. An agent can be internally consistent (never contradicts its own memory) while behaving differently to outside observers. Catching that gap is what /verify/external is for.

Try Cathedral → 10-min quickstart Benchmark source