Automatically loading the right context at the right time, without user intervention.
Every AI coding assistant with persistent memory faces the same challenge:
3-5 memories is the sweet spot. Beyond that, attention degradation sets in, especially for memories buried in the middle of the context window.
From Zep's 50-experiment study: optimize for precision over recall in memory routing. A false negative (missing a memory) is recoverable — the model can always ask for more context. A false positive (loading irrelevant memory) wastes tokens AND dilutes attention, and is not recoverable.
Relevance is not always keyword-obvious. A task saying "fix the build" could need deploy docs, project-specific conventions, background server management rules, or all three. Only the semantic content of the task reveals which.
| Aspect | Details |
|---|---|
| How | Define explicit rules: "if task mentions Svelte, load svelte5-pitfalls.md" |
| Latency | <5ms |
| Pros | Zero latency, deterministic, debuggable |
| Cons | Brittle, misses indirect references, doesn't scale past ~20 files |
| Coverage | ~80% for small memory sets with clear project boundaries |
This is essentially what your current CLAUDE.md does with its "Before X, read Y" instructions.
| Aspect | Details |
|---|---|
| How | Embed memory files into vector space, find top-K by cosine similarity to task |
| Latency | 50-500ms depending on implementation |
| Pros | Handles indirect references, no manual rule maintenance, scales to thousands |
| Cons | Requires embedding model + vector store, can miss implication-based relevance |
| Local models | nomic-embed-text via Ollama: sub-50ms on CPU for ~20 files |
| Aspect | Details |
|---|---|
| How | Stage 1: vector/keyword search returns 10-20 candidates. Stage 2: LLM re-ranks to top 3-5 |
| Latency | 200ms-2s |
| Pros | Best accuracy, handles both semantic and implication-based relevance |
| Cons | Higher latency, more complex, LLM step costs tokens |
| Aspect | Details |
|---|---|
| How | Load all one-line summaries (cheap), let LLM decide which full files to load |
| Latency | Adds one round-trip (LLM reads summaries, decides, reads files) |
| Pros | Low initial token cost, LLM makes the routing decision with full reasoning |
| Cons | Summary quality matters — bad summary = skipped relevant file |
This is already how Claude Code works with MEMORY.md — but the summaries need to be better.
| Aspect | Details |
|---|---|
| How | Begin with zero memory. Retrieve and inject as conversation clarifies the task |
| Used by | Cursor's "Dynamic Context Discovery" (Jan 2026) |
| Pros | Minimizes wasted tokens, adapts as task evolves |
| Cons | Risk of writing code before discovering the pitfalls file |
| Aspect | Details |
|---|---|
| How | Tag memories with topics/projects, classify the task, match |
| Latency | <10ms if pre-tagged |
| Pros | Fast, deterministic, more flexible than keyword matching |
| Cons | Tags need maintenance, cross-cutting concerns need special handling |
| Tool | Context Selection |
|---|---|
| Cursor | AST-based chunking + vector embeddings in Turbopuffer. Merkle trees for incremental re-indexing. Dynamic context discovery in agent mode (2026). |
| GitHub Copilot | Workspace indexing + LSP intelligence + semantic search. Server-side. Copilot Spaces for curated context. |
| ChatGPT | RAG over saved memories with recency/frequency/relevance scoring. Black box. |
| Cline/Roo Code | All-or-nothing: loads entire Memory Bank at session start. Works because it's project-scoped and small. |
| Claude Code | Memory selection agent picks up to 5 files by filename + one-line description match. No semantic search. |
The simplest practical approach — a shell script that runs as a SessionStart hook:
#!/bin/bash
# memory-router.sh - keyword-based memory routing
INPUT=$(cat)
QUERY=$(echo "$INPUT" | jq -r '.user_prompt // empty')
MEMORIES=""
# Project detection
[[ "$QUERY" =~ svelte|Svelte|SvelteKit ]] && \
MEMORIES+=$(cat ~/.claude/projects/*/memory/svelte5-pitfalls.md)
[[ "$QUERY" =~ hetzner|deploy|Hetzner ]] && \
MEMORIES+=$(cat ~/.claude/projects/*/memory/hetzner-deploy.md)
[[ "$QUERY" =~ video|manga|gallery ]] && \
MEMORIES+=$(cat ~/.claude/projects/*/memory/local-apps.md)
[[ "$QUERY" =~ cv|CV|resume ]] && \
MEMORIES+=$(cat ~/.claude/projects/*/memory/cv.md)
# Output as hook response
jq -n --arg ctx "$MEMORIES" '{
"hookSpecificOutput": {
"hookEventName": "SessionStart",
"additionalContext": $ctx
}
}'
Give Claude an explicit search_memories tool it can call on demand:
| MCP Server | Backend | Privacy |
|---|---|---|
memorious-mcp | ChromaDB (local) | Full |
mcp-server-qdrant | Qdrant (local/cloud) | Configurable |
code-memory | Tree-sitter + sentence-transformers | Full |
vector-memory-mcp | LanceDB + local embeddings | Full |
Use a fast, small model (Haiku or local Phi-3) as a dedicated router. Feed it the task description + memory file titles/summaries, ask which files are relevant. Latency: ~200-500ms API, ~1-2s local.
| Project | What It Does |
|---|---|
| ContextStream | Automatic memory injection via hooks. Captures decisions, lessons, patterns. Injects relevant context per message. |
| Claude-Mem | Tool usage capture, compress into ~500-token observations via Claude SDK, store in SQLite + ChromaDB. Progressive disclosure. ~10x token efficiency. |
| Continuous-Claude-v3 | State via ledgers and handoffs, isolated context windows for MCP execution. |
Catches obvious project names, file paths, tool names. Loads "must-have" memories (always-on rules).
Embed task description, find top-10 candidate memories. Local model (nomic-embed-text via Ollama).
Small model scores candidates against task. Returns top 3-5 with confidence scores.
High-confidence memories injected at session start. Medium-confidence noted as "available if needed."
As conversation progresses, model can request additional memories via MCP tool call.
Total latency budget: <600ms. Imperceptible to the user. Claude Code's SessionStart hook timeout (default 5s) is more than enough.
| Approach | Privacy | Speed | Quality |
|---|---|---|---|
| Local embeddings (Ollama) | Full | Fast | Good |
| Local small LLM router | Full | Medium | Better |
| API embeddings (OpenAI/Voyage) | Cloud | Fast | Best |
| API LLM router (Haiku) | Cloud | Medium | Best |
nomic-embed-text locally and OpenAI's text-embedding-3-small is marginal for a corpus of ~20 short markdown files. The privacy benefit is absolute.