The Memory Routing Problem

Automatically loading the right context at the right time, without user intervention.

Contents

The Problem Statement

Every AI coding assistant with persistent memory faces the same challenge:

  1. User has N memory files covering different projects/topics
  2. A new session starts with a task
  3. The harness needs to decide which memories are relevant WITHOUT the user telling it
  4. Current approaches just dump everything into context (wasteful, noisy)
  5. The user shouldn't have to be the "memory manager" — that's the harness's job
Key Research Finding Indiscriminate memory loading performs worse than no memory at all. In tests with three different agent architectures, storing and retrieving every available memory ("add-all") degraded performance compared to using zero memory. Bad or irrelevant memories create a "propagating error feedback loop" where the model attends to noise instead of signal. (Harvard D3 Institute, Adaptive Memory Admission Control, 2026)

What the Research Says

The "Lost in the Middle" Effect

3-5 memories is the sweet spot. Beyond that, attention degradation sets in, especially for memories buried in the middle of the context window.

Precision Over Recall

From Zep's 50-experiment study: optimize for precision over recall in memory routing. A false negative (missing a memory) is recoverable — the model can always ask for more context. A false positive (loading irrelevant memory) wastes tokens AND dilutes attention, and is not recoverable.

The Relevance Paradox

Relevance is not always keyword-obvious. A task saying "fix the build" could need deploy docs, project-specific conventions, background server management rules, or all three. Only the semantic content of the task reveals which.

Approaches to Solving It

1. Keyword / Trigger Matching

AspectDetails
HowDefine explicit rules: "if task mentions Svelte, load svelte5-pitfalls.md"
Latency<5ms
ProsZero latency, deterministic, debuggable
ConsBrittle, misses indirect references, doesn't scale past ~20 files
Coverage~80% for small memory sets with clear project boundaries

This is essentially what your current CLAUDE.md does with its "Before X, read Y" instructions.

2. Semantic Similarity (Embedding-Based)

AspectDetails
HowEmbed memory files into vector space, find top-K by cosine similarity to task
Latency50-500ms depending on implementation
ProsHandles indirect references, no manual rule maintenance, scales to thousands
ConsRequires embedding model + vector store, can miss implication-based relevance
Local modelsnomic-embed-text via Ollama: sub-50ms on CPU for ~20 files

3. Two-Stage Retrieval (Fast Retrieve + LLM Re-rank)

AspectDetails
HowStage 1: vector/keyword search returns 10-20 candidates. Stage 2: LLM re-ranks to top 3-5
Latency200ms-2s
ProsBest accuracy, handles both semantic and implication-based relevance
ConsHigher latency, more complex, LLM step costs tokens

4. Hierarchical Summaries (Summary-First, Drill-Down)

AspectDetails
HowLoad all one-line summaries (cheap), let LLM decide which full files to load
LatencyAdds one round-trip (LLM reads summaries, decides, reads files)
ProsLow initial token cost, LLM makes the routing decision with full reasoning
ConsSummary quality matters — bad summary = skipped relevant file

This is already how Claude Code works with MEMORY.md — but the summaries need to be better.

5. Dynamic Context Windows (Start Minimal, Expand)

AspectDetails
HowBegin with zero memory. Retrieve and inject as conversation clarifies the task
Used byCursor's "Dynamic Context Discovery" (Jan 2026)
ProsMinimizes wasted tokens, adapts as task evolves
ConsRisk of writing code before discovering the pitfalls file

6. Pre-Classification (Tag + Match)

AspectDetails
HowTag memories with topics/projects, classify the task, match
Latency<10ms if pre-tagged
ProsFast, deterministic, more flexible than keyword matching
ConsTags need maintenance, cross-cutting concerns need special handling

What Current Tools Do

ToolContext Selection
CursorAST-based chunking + vector embeddings in Turbopuffer. Merkle trees for incremental re-indexing. Dynamic context discovery in agent mode (2026).
GitHub CopilotWorkspace indexing + LSP intelligence + semantic search. Server-side. Copilot Spaces for curated context.
ChatGPTRAG over saved memories with recency/frequency/relevance scoring. Black box.
Cline/Roo CodeAll-or-nothing: loads entire Memory Bank at session start. Works because it's project-scoped and small.
Claude CodeMemory selection agent picks up to 5 files by filename + one-line description match. No semantic search.

Build-Your-Own Solutions

Option A: SessionStart Hook with Keyword Matching

The simplest practical approach — a shell script that runs as a SessionStart hook:

#!/bin/bash
# memory-router.sh - keyword-based memory routing
INPUT=$(cat)
QUERY=$(echo "$INPUT" | jq -r '.user_prompt // empty')
MEMORIES=""

# Project detection
[[ "$QUERY" =~ svelte|Svelte|SvelteKit ]] && \
  MEMORIES+=$(cat ~/.claude/projects/*/memory/svelte5-pitfalls.md)
[[ "$QUERY" =~ hetzner|deploy|Hetzner ]] && \
  MEMORIES+=$(cat ~/.claude/projects/*/memory/hetzner-deploy.md)
[[ "$QUERY" =~ video|manga|gallery ]] && \
  MEMORIES+=$(cat ~/.claude/projects/*/memory/local-apps.md)
[[ "$QUERY" =~ cv|CV|resume ]] && \
  MEMORIES+=$(cat ~/.claude/projects/*/memory/cv.md)

# Output as hook response
jq -n --arg ctx "$MEMORIES" '{
  "hookSpecificOutput": {
    "hookEventName": "SessionStart",
    "additionalContext": $ctx
  }
}'

Option B: MCP Server with Semantic Search

Give Claude an explicit search_memories tool it can call on demand:

MCP ServerBackendPrivacy
memorious-mcpChromaDB (local)Full
mcp-server-qdrantQdrant (local/cloud)Configurable
code-memoryTree-sitter + sentence-transformersFull
vector-memory-mcpLanceDB + local embeddingsFull

Option C: Small Model as Router

Use a fast, small model (Haiku or local Phi-3) as a dedicated router. Feed it the task description + memory file titles/summaries, ask which files are relevant. Latency: ~200-500ms API, ~1-2s local.

Existing Implementations

ProjectWhat It Does
ContextStreamAutomatic memory injection via hooks. Captures decisions, lessons, patterns. Injects relevant context per message.
Claude-MemTool usage capture, compress into ~500-token observations via Claude SDK, store in SQLite + ChromaDB. Progressive disclosure. ~10x token efficiency.
Continuous-Claude-v3State via ledgers and handoffs, isolated context windows for MCP execution.

The Ideal Architecture

0
Fast keyword/regex match <5ms

Catches obvious project names, file paths, tool names. Loads "must-have" memories (always-on rules).

1
Embedding search <100ms

Embed task description, find top-10 candidate memories. Local model (nomic-embed-text via Ollama).

2
LLM re-rank / filter <500ms

Small model scores candidates against task. Returns top 3-5 with confidence scores.

3
Inject into context

High-confidence memories injected at session start. Medium-confidence noted as "available if needed."

4
Dynamic expansion ongoing

As conversation progresses, model can request additional memories via MCP tool call.

Total latency budget: <600ms. Imperceptible to the user. Claude Code's SessionStart hook timeout (default 5s) is more than enough.

Precision vs. Recall Trade-offs

ApproachPrivacySpeedQuality
Local embeddings (Ollama)FullFastGood
Local small LLM routerFullMediumBetter
API embeddings (OpenAI/Voyage)CloudFastBest
API LLM router (Haiku)CloudMediumBest
For your setup With ~20 memory files, clear project boundaries, and an RTX 3060 — local embeddings are the right default. The quality gap between nomic-embed-text locally and OpenAI's text-embedding-3-small is marginal for a corpus of ~20 short markdown files. The privacy benefit is absolute.
Practical recommendation Keyword matching gets you 80% of the way for <30 files with clear project boundaries. Semantic search closes the remaining 20%. Build the pipeline incrementally: Phase 1 (CLAUDE.md triggers) → Phase 2 (keyword hook) → Phase 3 (semantic search) → Phase 4 (full pipeline). See Recommendations.