
In 2023 a research paper called Lost in the Middle showed large language models lose accuracy as relevant information moves toward the middle of their context window. Two years on, frontier context windows have grown by an order of magnitude. The middle problem is still there. The model is bigger, the context is bigger, the system still misses what it needs.
What an AI system can do is decided by the model. What it actually does, in production, is decided by what it can see.
Model choice gets the attention. Every vendor publishes benchmarks, every release gets a blog post, every CTO has an opinion. Memory architecture gets almost none. It produces no benchmark and no press release. It is also where most of the leverage is.
What “memory” actually means
The word does too much work. Inside a production AI system, memory is not one thing. It is a stack of things, each with a different job, and the architecture decides which jobs exist at all.
Working memory is what the model is looking at right now. The prompt, the system instructions, the recent turns of conversation. Useful, expensive, easy to overrun.
Session memory is what a single user is doing across a single session. The questions they asked. The choices they made. What the system already told them. Most systems carry some version of this. Many carry it badly.
Domain context is the body of knowledge the system needs to operate in your world. Your product, your data shape, your codebase, your conventions, your customers. The model generates from generic training, not your specifics. Without this layer, every answer is a guess by something that has never seen your business.
Durable knowledge is the long-form structured material the system can reference deliberately. Decision records. Policies. Specifications. The places teams write things down for the future to find.
Temporal facts are the things that were true at a particular time, and are still true and might stop being true. Who reports to whom this week. Which version is in production. What price changed last month. Memory that does not know it is time-bound is memory that lies confidently.
A good architecture names these layers. Decides which ones exist. Decides how they are written, how they are read, what scopes them and what evicts them. A bad architecture stuffs the working memory until it overflows and calls that the design.
The two traps
When a team’s AI feature is failing, the visible symptoms look like model problems. Responses are inconsistent. Costs are creeping. Latency is rising. The system tells different users different things about the same product.
Two instincts are wrong.
The first is to upgrade the model. Sometimes that helps. More often it does not. A bigger model paired with no better memory architecture gives you a slightly more articulate version of the same wrong answer. The cost increase is real. The behaviour change is not. The team has spent a budget cycle raising the floor without raising the ceiling. The ceiling is set by what the system can see, not by how clever the system is at generating from what it sees.
The second is to reach for a memory product. Drop in mem0 (or one of its competitors), point a vector store at it and call the system memory-aware. The marketing makes this look like the architectural decision. It is not. It is one decision, a choice of retrieval backend, dressed up as the whole stack.
mem0 can be a useful component inside a memory architecture. It is not a memory architecture. The questions it does not answer are the ones that matter most. What scopes a memory. What evicts it. How it enters the system. What distinguishes trusted state from untrusted state. Those decisions are upstream of the retrieval choice, and they belong to the system designer, not the vendor.
Teams that skip them end up with retrieval that works in demos and degrades in production. The system surfaces stale facts confidently. Memory leaks between contexts that should be separate. When the retrieval fails, the team upgrades the model. Neither move addressed the architecture.
The opposite move often works. Switch to a cheaper model paired with deliberate memory architecture and you beat either shortcut. Costs drop. Behaviour improves. The team stops paying for capability it did not need and starts paying for the architecture it should have built first.
Frontier models and memory products are both useful. Neither substitutes for the architectural work.
The questions a memory architecture answers
A working memory architecture answers four questions before the system is ever called.
What does the model see in a single call, and why. The working-memory budget is finite. Filling it with the wrong things is the cheapest way to make an expensive system stupid.
What survives across calls, and at what scope. Session-scoped, user-scoped, project-scoped, organisation-scoped. The wrong scope leaks knowledge between contexts that should be separate, or fails to retain knowledge that should persist.
How information enters the system and how it leaves. Memory is not just retrieval. It is also writing. The teams that get this wrong often get retrieval right and ingestion wrong, then wonder why the system keeps surfacing stale facts as current truth.
Where the boundary sits between trusted state and untrusted state. The model can read the memory. So can a tool response. So can a document an attacker controlled. A memory architecture that does not distinguish between these is a system that can be talked into believing things, and then defending them. This is the structural-guardrails argument applied to memory state.
None of this is exotic engineering. It is the same kind of architectural thinking that goes into any production system. The difference is that with AI, the architecture is often built last and least, on the assumption that the model will compensate. It will not.
The diagnostic
Founders evaluating a CTO for AI work can use the first sprint as the test. Ask the candidate what the memory architecture looks like.
If they answer with a stack of named layers, with rules about what gets written where, with a clear distinction between what the model sees in a single call and what survives across calls, with a position on scoping and eviction, you have someone who has shipped this before. Move on.
If they answer with a model name, a context window size and an opinion about embeddings, you have someone who has read about AI but not built with it. The architectural thinking is not yet there. That gap will surface again every time the system has to handle something the model cannot solve on its own.
The same test works from the other side. If you are three months into a build and the team conversation is still about model choice, raise the memory question. Not as criticism. As a redirect. The conversation often moves once the question is on the table.
What follows
The next piece puts an architecture on the table. After that, the series continues into failure modes, trust, organisational memory and governance. Specifics as those pieces get written.
The teams whose AI features fail loudest are not undone by the model. They are undone by what the model cannot see, cannot retrieve and cannot remember. The next time an AI feature feels like a model problem, ask what the system was looking at when it failed.
Leave a Reply