Local LLM Setup Guide
Optimal models and configs for RTX 3060 12GB + Xeon E5-1650v3 + 32GB DDR4 — April 2026
The Tier Strategy (data-driven)
Each tier targets a different use case. One tier runs at a time, switched via llm switch <tier>.
| Tier | Model | TG Speed | Quality | Use Case |
|---|---|---|---|---|
| Interactive | Qwen3.5-9B IQ4_NL (5.37GB) | 41 t/s @34K | 15/25 | Code generation, Svelte 5, tool calling, quick tasks |
| Workhorse | Qwen3.5-35B-A3B IQ4_NL (17.8GB) | 21 t/s (regex) / 15 t/s (-cmoe) | 17/25 | Code review, bug finding, deeper analysis. Dual mode. |
| Planner | Qwen3.5-122B-A10B IQ3_XXS (~44.7GB) | ~10 t/s (est.) | TBD | Deep reasoning, planning (deferred — needs NVMe swap) |
| API Fallback | Haiku 4.5 ($0.80/M) | ~100 t/s | 22/25 | Quality baseline. Only model that found the 30s timing race. |
--parallel 5, but speed without quality is useless. API fallback (Haiku at 22/25, Groq at 500+ t/s) replaces this tier.
Key Insight: MoE vs Dense
The defining constraint of this hardware is 12GB VRAM + 34 GB/s DDR4 bandwidth. MoE models win because only active params traverse memory per token:
- Qwen3.5-35B-A3B (MoE, 3B active): 21 t/s with regex offload, 17/25 quality — best local model
- Qwen3.5-27B (dense, 27B active): 19 t/s at IQ2_M, 14/25 quality — MoE beats dense even at lower quant
- Qwen3.5-9B (dense, 9B active): 41 t/s at IQ4_NL, 15/25 quality — speed king for interactive work
Key Insight: Quant Matters More Than Model Size
At small scales, quantization quality dominates over parameter count:
- 4B Q8 and 9B IQ4_NL have nearly identical speed (~42 t/s) — 9B wins on quality, 4B has no reason to exist
- 27B dense at IQ2_M scores LOWER than 35B MoE at IQ4_NL (14 vs 17/25). MoE preserves knowledge under aggressive quant.
- 0.8B Q8 (5/25) beats 2B IQ2 (garbage) on real code analysis
Critical Flags
These llama.cpp flags make or break performance on this hardware:
| Flag | What it does | Why it matters |
|---|---|---|
--reasoning off | Disable thinking tokens globally | Mandatory. Without it, thinking tokens consume entire generation budget — responses are empty. |
-cmoe | Keep all MoE expert weights on CPU | 35B: 6.7GB VRAM, handles 262K context. 15 t/s TG. |
-ot 'regex=CPU' | Selectively place tensors on CPU/GPU | 35B: 39% faster than -cmoe (21 vs 15 t/s). OOMs >188K. |
--no-mmap | Force model into RAM (no memory-mapping) | mmap causes slow/stalled loading on this hardware. |
--cache-type-k q8_0 | Quantize KV cache | All models fit 200K+ context in 12GB VRAM with Q8 KV. |
--jinja | Enable Jinja chat templates | Required for tool calling. |
Harness Comparison
Three agent frameworks to test as the coding harness around local models:
| Harness | Strengths | Risk |
|---|---|---|
| Pi Coding Agent | Most extensible, 735+ ecosystem repos, SKILL.md workflows, multi-agent | Dev created company — future uncertain |
| OpenCode | Needs testing — alternative to Pi | Less ecosystem, unclear maturity |
| Gastown | 20-30 agent orchestration, git-backed hooks, highest ceiling | Even SOTA LLMs struggle with it — ambitious but worth trying |
Resolved: MoE Bugs
--reasoning off
The <think> tag reinsertion and cache corruption issues are avoided by using --reasoning off globally, with per-request override via "chat_template_kwargs": {"enable_thinking": false} in the API body. Both 9B and 35B-A3B run reliably in production with this flag.