Make your coding agents self-evolving

Coding agents work well when they have the right context, and badly when they do not. With a clear picture of the conventions and the gotchas, modern models can carry out multi-step changes that were out of reach a year ago. Without it, the same models trip on small things.

Anthropic's post on context engineering describes this as a discipline of its own: managing what enters the model's window through compaction, structured note-taking, and just-in-time retrieval. The framing is useful because it treats context as a real engineering problem instead of something that just falls out of how you prompt the model.

AGENTS.md, skills, and their problems

Two conventions have settled in for shipping persistent project context to agents. One is a top-level AGENTS.md (or CLAUDE.md) that the agent reads at the start of every session. The other is skills: small bundles of markdown plus optional code that the agent loads on demand when their description matches the task.

Both are nice to work with in a team setting. They are plain markdown files in version control, so they diff cleanly in code review and a person can edit them by hand. There is no embedding model, no vector index, and no retrieval layer between what the team writes and what the agent reads. Skills add hierarchical loading on top of that: only descriptions are in the prompt by default, and the full body is pulled in when the agent decides a skill applies. The result is something that scales fairly well as the library of skills grows, while still being just files on disk.

The hard part is producing these files in the first place, and two problems tend to show up in practice. The first is content quality. Recent work suggests hand-written context files help less than people assume. A 2025 evaluation on AgentBench found that developer-written context files improved task success by only about 4% on average while raising inference costs by close to 20%. The authors imply that it is beneficial to keep these files to non-inferable details, since most other content overlaps with what the agent can read from the code. A separate study of 2,303 agent manifests in real repositories suggests that rarely happens in practice though: the dominant content is build and run instructions, implementation details, and architecture, much of which the agent could derive on its own.

The useful contents of an AGENTS.md are the things the agent could not have figured out from the code itself. This matches what context-engineering practitioners have argued for some time.

The second problem is keeping the files up-to-date. The information that belongs in AGENTS.md or in a skill is the kind of thing you learn the hard way: a debugging session that uncovered a non-obvious invariant, a code review where a teammate caught a recurring mistake, a near-miss in production. By the time anyone might write it down, the session is over and the lesson has started to fade. Across a team it is worse, because different people learn different lessons and there is no shared way to pool them.

Self-evolving context

Dreamer is an approach to address both of these problems, with a reference implementation that runs as a self-hostable framework. Agents submit small notes during their sessions, while the lesson is still fresh, and a background process consolidates those notes across the team into the AGENTS.md and skills that the next session reads. The shape is borrowed from a feature Anthropic recently shipped to Claude Code called Auto Dream, where a background sub-agent reviews local memory files, prunes contradictions, and consolidates notes during downtime. Dreamer extends the same idea in three ways. Any coding agent that speaks MCP can submit memories, so a team running a mix of Claude Code, Cursor, and other agents can plug all of them into the same server. Submissions pool across a whole team and not a single workstation, which is what makes the framework useful as a way to gather lessons across people. The output is a versioned context bundle that everyone reads, so the consolidated knowledge ends up in the same git repository as the code it describes.

The framework has three concepts and one workflow that ties them together. Short-term memory is the raw material: small episodes that an agent submits when it encounters something worth remembering. Long-term memory is the distilled knowledge that builds up across many of those episodes over time. Context is the only thing the agents actually read, and it is just an AGENTS.md plus a set of skills generated from long-term memory. Agents do not query long-term memory directly. The whole system is arranged so that the only artifact reaching an agent is the context bundle, which keeps the agent side narrow and the long-term memory side free to evolve.

The flow is easy to describe in one sentence: a coding agent submits short-term memories through the Dreamer MCP server, the dream phase folds them into long-term memory and regenerates the AGENTS.md and skills bundle, and the agent reads that bundle at the start of the next session.

The cycle has two halves. Submission of short-term memory is continuous and happens per-agent, in the middle of whatever the agent was working on. Production of long-term memory and context is periodic, batched, and centralised on the server, and that second half is what we call the dreaming phase. The two halves run on very different timescales: a single agent might submit a handful of memories over the course of a session, while a dream might run every few days and consume what every agent in the team has submitted since the last one.

Structuring short-term memory

Submission goes through an MCP server with one main tool, which the agent calls whenever it judges that something is worth keeping. That judgement is more important than the mechanism. If the agent dumps every tool call and every file it looked at, the consolidation phase ends up with the same bloat problem hand-written AGENTS.md files already have. The rule is that an episode is only worth submitting when it is genuinely new, meaning that it is not already covered by an existing skill and cannot be inferred from the code. In the default setup the agent applies this filter itself, and is prompted to submit memories of one of three types.

Observations are insights the agent decides on its own are worth keeping. Most of them come from the user: a piece of business context, a domain rule, an unwritten convention that nobody has written down because everyone already knows it. Observations are worth recording precisely because the agent could not have derived them from the code on its own.

Failures are sessions where the agent reached a wrong conclusion or took a wrong action, and they are worth recording even when the agent did not think the situation was novel beforehand. The mistake itself is the signal, since it shows that the current context did not cover the case. A failure is also a two-part record: the failure itself, and the fix that has been found. If only the failure is submitted, the consolidation phase has nothing to learn from.

Code snippets are fragments of code with a short note on why they matter, and they are the weakest of the three types because the code is already in the repository and the agent can read it. They are useful when the commentary attached to them adds something the consolidation phase can turn into a skill, but bare snippets without context tend to add noise rather than information. The full set of memory types is configurable, so a team that wants a "deploy incident" or "design decision" type can declare one and the MCP tool will start accepting it.

Dreaming

The dream consists of two sub-phases. The first one folds the new batch into long-term memory. The framework does not prescribe what long-term memory should look like. The default is a directory of interlinked markdown files, but a graph database, a vector store, or a relational schema would all work, since the dream engine only sees long-term memory through a workspace abstraction. In concrete terms, the engine receives a workspace pointing at the current long-term memory and a serialised view of the new batch, and produces a mutated workspace as output. The second sub-phase then walks over that updated long-term memory and evolves the context (the AGENTS.md and the skills directory) from it. Once both sub-phases have run, a post-dream hook can, for example, commit the result to git.

The full sequence of a dream looks like this:

Two design choices in this loop are worth pointing out, since neither one is forced by the problem itself. The first is going through long-term memory at all. A direct path from short-term memory to context would be shorter and would probably work for a small team, but it would couple the agent-facing output to whatever representation the consolidation step happens to use internally. With long-term memory in the middle, the internal representation can be as sophisticated as the team needs (a knowledge graph with explicit relations between observations, a vector store with semantic clustering, a relational schema with provenance tracking) while the consumable output stays standardised. There is no settled answer yet for what a long-term memory store should look like as the field is rapidly evolving and all approaches have their own tradeoffs. Routing everything through context lets the team swap the underlying store later without touching anything on the agent side.

The second is using AGENTS.md and skills as the agent-facing format. The format suits a team-scale knowledge base in several practical ways: it is plain markdown that anyone can review or hand-edit, skills partition the knowledge into units that can be added or revised independently, every modern coding agent CLI already understands the format, hierarchical loading covers knowledge bases too large to fit in a single prompt, and the whole bundle ships through the same git and CI pipelines the team already uses for code.

Implementation and configuration

The reference implementation's defaults are picked so that a team can install Dreamer and have it running in a matter of minutes, without first having to make choices that require knowing the framework. Short-term memory goes into a SQLite database, and long-term memory and the generated context both live as markdown files on disk under memory/ and context/. The dream engine is built on the Claude Agent SDK and runs the consolidation as a coding-agent session against a checkout of memory/. Authentication is a small token scheme with a CLI for issuing and revoking tokens, and an optional post-dream hook can be used to commit the result to git and open a PR through the GitHub API. None of these defaults are fixed, however. Every component sits behind a Python Protocol, and the running server is wired up from a YAML config that names the class and parameters for each slot.

# Example config
stm_store:
  class: dreamer.contrib.stm.sqlite.SQLiteSTMStore
  params:
    path: ./data/stm.db
 
ltm_store:
  class: dreamer.contrib.ltm.markdown.MarkdownLTMStore
  params:
    root: ./workspace/memory
 
context_store:
  class: dreamer.contrib.context.markdown.MarkdownContextStore
  params:
    root: ./workspace/context
 
dream_engine:
  class: dreamer.contrib.dream.claude_agent.ClaudeAgentDreamEngine
 
triggers:
  - class: dreamer.contrib.triggers.cron.CronTrigger
    params:
      schedule: "0 */6 * * *"

Replacing any of these defaults is a matter of installing a different package and pointing the config at it. The same shape of change covers a graph-backed long-term memory store, an OIDC auth backend, or a Slack notification in place of the git commit.

Summary

Dreamer moves the writing-down of AGENTS.md and skills inside the agent loop. Agents submit short memories when they encounter something the existing context did not cover. A scheduled dream consolidates those memories across the team into long-term memory, then regenerates the context bundle from that long-term memory. What the next session sees is a versioned AGENTS.md plus skills directory that updates incrementally, gets reviewed through normal pull requests, and starts every fresh agent off with whatever the team's agents have collectively learned since the last dream.

The reference implementation is open source and available at:
github.com/luml-ai/dreamer.

Make your coding agents self-evolving

AGENTS.md, skills, and their problems

Self-evolving context

Structuring short-term memory

Dreaming

Implementation and configuration

Summary

Similar posts

Make your coding agents self-evolving

LLM Agents vs. Classical HPO: The Search Space Is the Whole Question

Introducing LUML: One Platform for Your Entire AI Lifecycle