The Memory Routing Problem

Automatically loading the right context at the right time, without user intervention.

The Problem

You have N memory files covering different projects and topics. A new session starts. The harness needs to decide which memories are relevant without you telling it. Current approaches just dump everything into context or rely on you to say "read X.md".

Loading everything is worse than loading nothing Research from Harvard's D3 Institute and Adaptive Memory Admission Control papers: indiscriminate memory loading degrades performance compared to zero memory. Irrelevant memories create a "propagating error feedback loop" where the model attends to noise.

Why the Built-in System Falls Short

Approaches

ApproachLatencyCoverageComplexity
Keyword / trigger matching
CLAUDE.md: "Before X, read Y"
<5ms~80% for <30 filesLow
Semantic similarity
Embed memories, cosine similarity to task
50-500ms~95%Medium
Two-stage retrieval
Fast vector search → LLM re-rank
200ms-2s~98%High
Hierarchical summaries
Load index, LLM decides which to expand
1 round-tripDepends on summary qualityLow
Dynamic expansion
Start minimal, retrieve as needed
OngoingHigh (but delayed)Medium

The Practical Sweet Spot

For most developers with <50 memory files, a layered approach covers 95%+ of cases:

1
Always-on rules in CLAUDE.md 0ms

Inline your 5-6 most universal behavioral rules directly. Always loaded, survives compaction.

2
Path-scoped rules 0ms, harness-enforced

.claude/rules/*.md with paths: frontmatter. Auto-triggered when Claude touches matching files. Most reliable mechanism — the harness enforces it, not the LLM.

3
SessionStart hook <5ms

Shell script that injects behavioral rules + dynamic context (git state, project detection) at every session start.

4
Sonnet memory selection built-in

The built-in selector picks up to 5 topic files per query. Write good one-line descriptions in MEMORY.md to help it.

5
MCP semantic search (optional) <100ms

For 50+ files: local embeddings via Ollama + ChromaDB/LanceDB. Claude calls search_memories() on demand.

Precision Over Recall

Optimize for precision A false negative (missing a memory) is recoverable — the model can ask for more context or you can point it to the file. A false positive (loading irrelevant memory) wastes tokens AND dilutes attention, and is not recoverable. This is why "load everything" fails.