Press ESC to close

More Than a Model: A Memory Architecture for an AI Coding Agent

The first file my coding agent reads, every session, is a markdown file called CLAUDE.md. It is not configuration. It is not a prompt. It is the working-memory layer of a five-layer memory architecture, and it is the only one of those layers I write by hand.

The other four are tools. Each one stores a different kind of thing, on a different cadence, for a different consumer. Together they make the difference between an agent that remembers what it is doing and a chat session with files attached.

Last week’s piece said memory is the AI variable most teams are missing. Naming the variable is not the same as building it. This piece puts the wiring on the table.

What an agent actually sees

What an AI system can do is decided by the model. What it does in production is decided by what it can see. That was the closing line of the previous piece. It is the opening claim of this one.

An AI coding agent has to know more than the file it is editing. By that I mean an agent that ships code, not one that completes a function inside an IDE. It has to know which conventions this project uses, which decisions were made last sprint, which mistakes were corrected three weeks ago, what the architecture says about the module it is about to touch and what the team has agreed in a retro that no one in the current session attended.

Default architecture gives the agent one layer for all of that: the prompt. A vector store bolted on for retrieval. The system gets called memory-aware. The result behaves like a clever amnesiac. Confident, articulate, surprising in flashes and unable to hold a thread across two sessions.

The previous piece named five layers: working memory, session memory, domain context, durable knowledge and temporal facts. Each maps onto a different tool here, because each layer has a different cadence and a different trust model. They cannot share storage. They cannot share write rules. Putting them in one place is where the architecture collapses.

The five layers, with the wiring

The agent harness is Claude Code. Everything below runs inside that runtime: hooks, MCP servers, skills, local files. The architecture itself is harness-independent, but the names of the moving parts are not.

Here is what I run. None of it is exotic. All of it is deliberate.

Working memory: CLAUDE.md

There are two CLAUDE.md files in play. A global one at ~/.claude/CLAUDE.md declares the cross-project stack: model assignments, the four-other-layers map, the non-negotiables (UUID v7 primary keys, conventional commits with gitmoji, no MongoDB anywhere). A project-local CLAUDE.md extends it with the project slug, the active MCP servers, the GitHub org, the Linear team and any project-specific constraints.

The agent reads both before doing anything else. This is the only layer I write by hand. The cost of getting it wrong is high. The project slug declared here determines the namespace every other layer scopes to.

Failure mode if missing: the agent has no anchor. It guesses the project name, guesses the conventions and falls back to whatever generic patterns it learned at training time. The system feels generic because, in this layer, it is.

Session memory: memsearch, with hand-curated reinforcement

Without a dedicated session memory layer, the model holds the current turn, the previous turn, perhaps the last twenty. Everything before that is gone.

The primary tool is memsearch, an open-source plugin from Zilliz. It installs as a Claude Code plugin in two commands. After every conversation turn, a Stop hook fires, an LLM summarises the turn into a few lines and appends them to a daily markdown file at .memsearch/memory/.md with a session anchor. A file watcher re-indexes the change into Milvus, a local vector database that acts as a shadow index over the Markdown. The Markdown is the source of truth. Milvus is a rebuildable cache.

Recall is the part that matters for the architecture. When the agent senses the question needs history, it invokes a skill called memory-recall. The skill’s frontmatter has one line that does the work: context: fork. The recall runs in a forked context. It searches, expands the relevant chunks and returns a curated summary to the main agent. The main context never sees the raw haystack, never absorbs the irrelevant chunks, never wastes its working-memory budget on retrieval noise.

Capture happens on a hook. Recall happens in isolation.

Alongside memsearch I keep a second, smaller layer: hand-curated feedback notes at ~/.claude/projects//memory/. Short markdown files. “Always use gitmoji in commits.” “Review before PR.” “Use subagents for MCP work.” “Tests go in __tests__/, not test/.” memsearch captures the trace of what happened. These notes capture what must not happen again. The difference is that memsearch is automatic and probabilistic; the notes are deliberate and non-negotiable. A learning that costs something stays in the notes. A learning that captures itself stays in memsearch.

Failure mode if missing: the agent repeats every mistake the team has already corrected. You can tell a stack has no session memory because the same review comments come back PR after PR.

Domain context: code-memory MCP and OpenWolf

For a coding agent the domain is the code. This layer answers two different questions and uses two different tools to answer them.

The first question is “where in this codebase is the pattern for X?” That is semantic search, and it runs against the code-memory MCP. The system-architect agent runs index_codebase once at the start of a pipeline run. Downstream agents call search_code, search_docs (for docstrings and comments) and search_history (for past commits and PR rationale). The instruction is hard: search before deciding. If a pattern already exists, follow it. Inventing a parallel pattern is what produces two error base classes and three factories in the same repo.

The second question is “do I even need to open this file?” That is the job of OpenWolf, a multi-file system that sits inside the project at .wolf/. anatomy.md handles file navigation. It is an auto-maintained index of every tracked file with a two- or three-line description and a token estimate. A pre-tool hook injects it before any Read. If the description is enough, the agent does not open the file. If it is not, the agent opens the file and a post-tool hook updates the entry.

OpenWolf does more than that. cerebrum.md is a structured learning ledger with four sections: User Preferences, Key Learnings, Do-Not-Repeat (dated) and Decision Log. The agent must update it whenever it learns something useful. memory.md is a chronological action log appended after every significant action, one line each: time, description, files touched, outcome, token estimate. buglog.json is a structured bug ledger; every error, every fix, every recurrence gets logged before the fix is attempted. The combined effect: anatomy controls what is read, cerebrum controls what is followed, memory traces what is done, buglog catches what fails.

That is more than a domain-context layer. OpenWolf overlaps session memory (cerebrum is the project’s permanent learning surface, parallel to memsearch’s daily logs) and adds a failure-memory layer the five-layer model does not name. Different stack, same intent: make sure the agent is operating with current, accurate, project-specific context every time it touches the code.

Failure mode if missing: the agent reads everything to find anything. Token cost balloons. The actual answer arrives slower and worse because the working-memory budget is full of irrelevant code. Worse: the agent repeats fixes that the project has already discarded.

Durable knowledge: the Obsidian vault

Decisions live in Obsidian with their reasoning intact. PRDs. Architecture decision records. Retros. Scenarios. Decomposition documents. Each gets written by a specific pipeline agent: the system-architect writes ADRs and architecture.md; the bdd-scenario-writer writes scenarios.md; the retrospective-agent writes the weekly retro.

If the temporal-facts layer (next) tells the agent that the project currently uses Result types for error handling, Obsidian tells it why, when the decision was made, what the alternatives were and what would have to be true to change it.

Failure mode if missing: the agent has facts without reasoning. It follows conventions without understanding them. When it has to make a new decision in the same domain, it has nothing to anchor to.

Temporal facts: MemoryGraph MCP

Durable knowledge looks like it covers this layer. It does not.

Take a concrete failure. In February the team writes ADR-7: auth service uses Clerk, with three paragraphs explaining why. Six months later the team migrates to WorkOS. ADR-12 supersedes ADR-7. Both ADRs still exist in the Obsidian vault. The agent searches “auth service”. The older ADR comes back first. The agent starts producing code against Clerk patterns in a codebase that no longer has Clerk.

Temporal facts solve this. MemoryGraph is a local, SQLite-backed knowledge graph that holds the current state of every architectural decision, every named entity (service, module, ticket) and every relationship between them. When the agent needs to know what the auth service uses right now, it does not search documents. It queries the graph. The graph answers with what is currently true.

Bi-temporal is the property that makes this work for an agentic pipeline. Every fact carries four timestamps: valid_from and valid_until describe when the fact was true in the project; recorded_at and invalidated_by describe when it was written to the graph and (if ever) superseded. The agent can ask “what does the auth service use today?” and get the current answer. It can also ask “what did the auth service use on the first of March?” and get the answer that was current on that date. ADRs cannot answer the second question without the agent reasoning across superseded documents and inferring which was authoritative when.

Every write is scoped to the project namespace declared in CLAUDE.md. Never to the global default.

Failure mode if missing: the agent treats Obsidian as the live source. Older ADRs get cited as current. Decisions that have been superseded get re-applied. The system hallucinates with footnotes.

A sixth thing that isn’t a layer

A per-feature scratchpad sits alongside the stack: the memory-bank MCP. Each project gets active-context.md, progress.md, decisions.md and blockers.md. Every pipeline agent reads active-context.md on start and appends to progress.md on completion.

It is not a memory layer in the sense the previous piece used. It is short-term operational state for a multi-agent pipeline: the audit trail of which agent ran when, which decision got made about a config value too small for an ADR, which blocker is currently active.

The layer model is the architecture. memory-bank is the connective tissue between agents working on the same feature in the same week.

The two rules the architecture enforces, not the prompt

Two rules emerge that no model upgrade will give you.

MemoryGraph first, Obsidian for full context. This is a cadence rule. MemoryGraph holds the current state with history. It is cheap to query and updated continuously by pipeline agents as decisions are made. Obsidian holds the formal narrative. It is expensive to write, slow to change and authoritative on reasoning. The agent’s default move is to ask MemoryGraph what is currently true, then go to Obsidian only when it needs the full story. Reversing this is how stacks end up citing stale ADRs.

Namespace boundary is structural, not prompt-level. The project slug declared in CLAUDE.md scopes every read and every write to MemoryGraph. “Never read or write the global namespace” is not an instruction the agent could choose to ignore. It is a constraint at the architecture layer. The namespace is supplied by the runtime context, not by the agent’s discretion.

That second rule is the structural-guardrails argument applied to memory state. A prompt that says “do not use the global namespace” is an instruction the system can violate. A namespace bound at the architecture layer is a constraint the system cannot cross. Only the second kind is real enforcement, and memory is exactly the surface where the difference matters most. The whole point of memory is to persist, which means a single bad write contaminates every future read.

What this layer is replacing

A senior engineer used to do this work. Knew which file to open. Knew which decision was current and which had been quietly walked back. Remembered the conventions the team had agreed three months ago even though they were not written down. Could tell when an architectural pattern was being misapplied because they had been in the room when the original decision was made.

That work was real. Most of it was invisible. None of it was in the codebase.

Article 2 in this series called this kind of work compensating middleware. The informal judgement that holds an imperfect system stable. The argument then was that AI removes the buffer and forces systems to operate as they actually are. The memory architecture in this piece is one half of the answer to that. It is the deliberate, designed replacement for what the senior engineer’s working memory used to do informally.

The replacement is not optional. If a coding agent is going to make decisions that ship to production, the work the senior engineer used to do has to live somewhere. If it lives in the prompt, it is gone the moment the session ends. If it lives in a vector store with no scope and no write rule, it leaks between contexts and rots into noise. If it lives in five layers with defined tools, scopes and cadences, the agent has something to operate on.

Benchmarks measure model capability against fixed inputs. Memory architecture is what makes the system perform in a place benchmarks do not reach: across sessions, across projects, across the boundary between what the team decided yesterday and what it has to do today.

What I have not measured yet

I have not yet instrumented this stack end-to-end against a model-only baseline. Token cost per task, latency, accuracy across a fixed task set: the numbers would close the loop on the argument. That work is queued.

I am putting the architecture on the table without the numbers because the architecture is the missing piece. The numbers will sharpen the case. They will not change it. A working memory architecture beats a model-only stack on the metrics that matter once the demo is over. Instrumentation will quantify the gap. It will not reveal it.

What this means for a team building one

If you are building an AI coding agent and you are not sure where to start, start with CLAUDE.md. Write the working-memory layer first, by hand, and put the project slug in it. Then decide what goes in each of the other four layers. What is durable. What is temporal. What is per-session. What is codebase-specific. Pick tools that scope cleanly to each. Resist the urge to collapse two layers into one because the tooling is convenient. Each layer exists because it has a different cadence and a different trust model. Collapsing them costs you the architecture.

Then write the rules the architecture enforces. Where the namespace is bound. Where reads default. What gets written to MemoryGraph versus what gets written to Obsidian. These are not prompt instructions. They are constraints the system cannot cross. The whole point of the structural-guardrails argument is that this is the only kind of rule the system reliably obeys.

The previous piece said memory is the variable. This piece shows what the variable looks like once it has been built. The next time an AI coding agent forgets something it should know, the question is not which model is loaded. The question is which layer was supposed to hold that fact, and why it didn’t.

Tools

The full stack referenced in this piece. Each tool is open source or freely available; links go to the project repo.

  • memsearch (Zilliz). Session memory. Claude Code plugin. Stop-hook capture, Markdown source of truth, Milvus shadow index, fork-context recall via the memory-recall skill.
  • code-memory. Domain context, part one. Local vector search over the codebase. index_codebase runs once; search_code, search_docs and search_history cover the day-to-day reads.
  • OpenWolf. Domain context, part two. The .wolf/ multi-file system used here for the codebase file index (anatomy.md), learning memory (cerebrum.md), action log (memory.md) and failure memory (buglog.json). Hooks enforce the pre-read, post-write and bug-logging behaviour.
  • Obsidian plus mcpvault. Durable knowledge. Obsidian holds the vault; mcpvault is the MCP shim that exposes it to the agent. PRDs, ADRs, retros, scenarios and decomposition documents are written by named pipeline agents.
  • MemoryGraph. Temporal facts. Bi-temporal knowledge graph; SQLite by default; no API key required. Four temporal fields per fact: valid_from, valid_until, recorded_at, invalidated_by.
  • memory-bank. Per-feature pipeline scratchpad. Four files: active-context.md, progress.md, decisions.md and blockers.md. Read on agent start, appended on completion. The sixth thing that is not a layer.

Sources

Leave a Reply

Your email address will not be published. Required fields are marked *