The Memory Routing Problem

Automatically loading the right context at the right time, without user intervention.

The Problem Statement
What the Research Says
Approaches to Solving It
What Current Tools Do
Build-Your-Own Solutions
The Ideal Architecture
Precision vs. Recall Trade-offs

The Problem Statement

Every AI coding assistant with persistent memory faces the same challenge:

User has N memory files covering different projects/topics
A new session starts with a task
The harness needs to decide which memories are relevant WITHOUT the user telling it
Current approaches just dump everything into context (wasteful, noisy)
The user shouldn't have to be the "memory manager" — that's the harness's job

Key Research Finding Indiscriminate memory loading performs worse than no memory at all. In tests with three different agent architectures, storing and retrieving every available memory ("add-all") degraded performance compared to using zero memory. Bad or irrelevant memories create a "propagating error feedback loop" where the model attends to noise instead of signal. (Harvard D3 Institute, Adaptive Memory Admission Control, 2026)

What the Research Says

The "Lost in the Middle" Effect

3-5 memories is the sweet spot. Beyond that, attention degradation sets in, especially for memories buried in the middle of the context window.

Precision Over Recall

From Zep's 50-experiment study: optimize for precision over recall in memory routing. A false negative (missing a memory) is recoverable — the model can always ask for more context. A false positive (loading irrelevant memory) wastes tokens AND dilutes attention, and is not recoverable.

The Relevance Paradox

Relevance is not always keyword-obvious. A task saying "fix the build" could need deploy docs, project-specific conventions, background server management rules, or all three. Only the semantic content of the task reveals which.

Approaches to Solving It

1. Keyword / Trigger Matching

Aspect	Details
How	Define explicit rules: "if task mentions Svelte, load svelte5-pitfalls.md"
Latency	<5ms
Pros	Zero latency, deterministic, debuggable
Cons	Brittle, misses indirect references, doesn't scale past ~20 files
Coverage	~80% for small memory sets with clear project boundaries

This is essentially what your current CLAUDE.md does with its "Before X, read Y" instructions.

2. Semantic Similarity (Embedding-Based)

Aspect	Details
How	Embed memory files into vector space, find top-K by cosine similarity to task
Latency	50-500ms depending on implementation
Pros	Handles indirect references, no manual rule maintenance, scales to thousands
Cons	Requires embedding model + vector store, can miss implication-based relevance
Local models	`nomic-embed-text` via Ollama: sub-50ms on CPU for ~20 files

3. Two-Stage Retrieval (Fast Retrieve + LLM Re-rank)

Aspect	Details
How	Stage 1: vector/keyword search returns 10-20 candidates. Stage 2: LLM re-ranks to top 3-5
Latency	200ms-2s
Pros	Best accuracy, handles both semantic and implication-based relevance
Cons	Higher latency, more complex, LLM step costs tokens

4. Hierarchical Summaries (Summary-First, Drill-Down)

Aspect	Details
How	Load all one-line summaries (cheap), let LLM decide which full files to load
Latency	Adds one round-trip (LLM reads summaries, decides, reads files)
Pros	Low initial token cost, LLM makes the routing decision with full reasoning
Cons	Summary quality matters — bad summary = skipped relevant file

This is already how Claude Code works with MEMORY.md — but the summaries need to be better.

5. Dynamic Context Windows (Start Minimal, Expand)

Aspect	Details
How	Begin with zero memory. Retrieve and inject as conversation clarifies the task
Used by	Cursor's "Dynamic Context Discovery" (Jan 2026)
Pros	Minimizes wasted tokens, adapts as task evolves
Cons	Risk of writing code before discovering the pitfalls file

6. Pre-Classification (Tag + Match)

Aspect	Details
How	Tag memories with topics/projects, classify the task, match
Latency	<10ms if pre-tagged
Pros	Fast, deterministic, more flexible than keyword matching
Cons	Tags need maintenance, cross-cutting concerns need special handling

What Current Tools Do

Tool	Context Selection
Cursor	AST-based chunking + vector embeddings in Turbopuffer. Merkle trees for incremental re-indexing. Dynamic context discovery in agent mode (2026).
GitHub Copilot	Workspace indexing + LSP intelligence + semantic search. Server-side. Copilot Spaces for curated context.
ChatGPT	RAG over saved memories with recency/frequency/relevance scoring. Black box.
Cline/Roo Code	All-or-nothing: loads entire Memory Bank at session start. Works because it's project-scoped and small.
Claude Code	Memory selection agent picks up to 5 files by filename + one-line description match. No semantic search.

Build-Your-Own Solutions

Option A: SessionStart Hook with Keyword Matching

The simplest practical approach — a shell script that runs as a SessionStart hook:

#!/bin/bash
# memory-router.sh - keyword-based memory routing
INPUT=$(cat)
QUERY=$(echo "$INPUT" | jq -r '.user_prompt // empty')
MEMORIES=""

# Project detection
[[ "$QUERY" =~ svelte|Svelte|SvelteKit ]] && \
  MEMORIES+=$(cat ~/.claude/projects/*/memory/svelte5-pitfalls.md)
[[ "$QUERY" =~ hetzner|deploy|Hetzner ]] && \
  MEMORIES+=$(cat ~/.claude/projects/*/memory/hetzner-deploy.md)
[[ "$QUERY" =~ video|manga|gallery ]] && \
  MEMORIES+=$(cat ~/.claude/projects/*/memory/local-apps.md)
[[ "$QUERY" =~ cv|CV|resume ]] && \
  MEMORIES+=$(cat ~/.claude/projects/*/memory/cv.md)

# Output as hook response
jq -n --arg ctx "$MEMORIES" '{
  "hookSpecificOutput": {
    "hookEventName": "SessionStart",
    "additionalContext": $ctx
  }
}'

Option B: MCP Server with Semantic Search

Give Claude an explicit search_memories tool it can call on demand:

MCP Server	Backend	Privacy
`memorious-mcp`	ChromaDB (local)	Full
`mcp-server-qdrant`	Qdrant (local/cloud)	Configurable
`code-memory`	Tree-sitter + sentence-transformers	Full
`vector-memory-mcp`	LanceDB + local embeddings	Full

Option C: Small Model as Router

Use a fast, small model (Haiku or local Phi-3) as a dedicated router. Feed it the task description + memory file titles/summaries, ask which files are relevant. Latency: ~200-500ms API, ~1-2s local.

Existing Implementations

Project	What It Does
ContextStream	Automatic memory injection via hooks. Captures decisions, lessons, patterns. Injects relevant context per message.
Claude-Mem	Tool usage capture, compress into ~500-token observations via Claude SDK, store in SQLite + ChromaDB. Progressive disclosure. ~10x token efficiency.
Continuous-Claude-v3	State via ledgers and handoffs, isolated context windows for MCP execution.

The Ideal Architecture

Fast keyword/regex match <5ms

Catches obvious project names, file paths, tool names. Loads "must-have" memories (always-on rules).

Embedding search <100ms

Embed task description, find top-10 candidate memories. Local model (nomic-embed-text via Ollama).

LLM re-rank / filter <500ms

Small model scores candidates against task. Returns top 3-5 with confidence scores.

Inject into context

High-confidence memories injected at session start. Medium-confidence noted as "available if needed."

Dynamic expansion ongoing

As conversation progresses, model can request additional memories via MCP tool call.

Total latency budget: <600ms. Imperceptible to the user. Claude Code's SessionStart hook timeout (default 5s) is more than enough.

Precision vs. Recall Trade-offs

Approach	Privacy	Speed	Quality
Local embeddings (Ollama)	Full	Fast	Good
Local small LLM router	Full	Medium	Better
API embeddings (OpenAI/Voyage)	Cloud	Fast	Best
API LLM router (Haiku)	Cloud	Medium	Best

For your setup With ~20 memory files, clear project boundaries, and an RTX 3060 — local embeddings are the right default. The quality gap between nomic-embed-text locally and OpenAI's text-embedding-3-small is marginal for a corpus of ~20 short markdown files. The privacy benefit is absolute.

Practical recommendation Keyword matching gets you 80% of the way for <30 files with clear project boundaries. Semantic search closes the remaining 20%. Build the pipeline incrementally: Phase 1 (CLAUDE.md triggers) → Phase 2 (keyword hook) → Phase 3 (semantic search) → Phase 4 (full pipeline). See Recommendations.

The Memory Routing Problem

Contents

The Problem Statement

What the Research Says

The "Lost in the Middle" Effect

Precision Over Recall

The Relevance Paradox

Approaches to Solving It

1. Keyword / Trigger Matching

2. Semantic Similarity (Embedding-Based)

3. Two-Stage Retrieval (Fast Retrieve + LLM Re-rank)

4. Hierarchical Summaries (Summary-First, Drill-Down)

5. Dynamic Context Windows (Start Minimal, Expand)

6. Pre-Classification (Tag + Match)

What Current Tools Do

Build-Your-Own Solutions

Option A: SessionStart Hook with Keyword Matching

Option B: MCP Server with Semantic Search

Option C: Small Model as Router

Existing Implementations

The Ideal Architecture

Precision vs. Recall Trade-offs