Local LLM Setup Guide

Optimal models and configs for RTX 3060 12GB + Xeon E5-1650v3 + 32GB DDR4 — April 2026

12 GB
VRAM (RTX 3060, 360 GB/s bandwidth)
32 GB
DDR4 2133 MT/s (~34 GB/s bandwidth)
6C/12T
Xeon E5-1650v3 (Haswell, no AVX-512)
~44 GB
Total usable memory (VRAM + RAM)
Updated with real benchmark data (April 2026) All numbers below are measured on this hardware using the polecat quality suite and real context tests. Estimates have been replaced with actual measurements.

The Tier Strategy (data-driven)

Each tier targets a different use case. One tier runs at a time, switched via llm switch <tier>.

Tier Model TG Speed Quality Use Case
Interactive Qwen3.5-9B IQ4_NL (5.37GB) 41 t/s @34K 15/25 Code generation, Svelte 5, tool calling, quick tasks
Workhorse Qwen3.5-35B-A3B IQ4_NL (17.8GB) 21 t/s (regex) / 15 t/s (-cmoe) 17/25 Code review, bug finding, deeper analysis. Dual mode.
Planner Qwen3.5-122B-A10B IQ3_XXS (~44.7GB) ~10 t/s (est.) TBD Deep reasoning, planning (deferred — needs NVMe swap)
API Fallback Haiku 4.5 ($0.80/M) ~100 t/s 22/25 Quality baseline. Only model that found the 30s timing race.
Speed tier (0.8B-4B) retired 0.8B-4B models scored 5-10/25 on the polecat suite — they fabricate bugs, use wrong frameworks, and are not viable for agentic coding. 0.8B achieves 343 t/s aggregate with --parallel 5, but speed without quality is useless. API fallback (Haiku at 22/25, Groq at 500+ t/s) replaces this tier.

Key Insight: MoE vs Dense

The defining constraint of this hardware is 12GB VRAM + 34 GB/s DDR4 bandwidth. MoE models win because only active params traverse memory per token:

Key Insight: Quant Matters More Than Model Size

At small scales, quantization quality dominates over parameter count:

Critical Flags

These llama.cpp flags make or break performance on this hardware:

FlagWhat it doesWhy it matters
--reasoning offDisable thinking tokens globallyMandatory. Without it, thinking tokens consume entire generation budget — responses are empty.
-cmoeKeep all MoE expert weights on CPU35B: 6.7GB VRAM, handles 262K context. 15 t/s TG.
-ot 'regex=CPU'Selectively place tensors on CPU/GPU35B: 39% faster than -cmoe (21 vs 15 t/s). OOMs >188K.
--no-mmapForce model into RAM (no memory-mapping)mmap causes slow/stalled loading on this hardware.
--cache-type-k q8_0Quantize KV cacheAll models fit 200K+ context in 12GB VRAM with Q8 KV.
--jinjaEnable Jinja chat templatesRequired for tool calling.
Cache hits are 15x faster Prompt cache is on by default in llama-server. Same-prefix follow-up queries skip prompt eval: cold 34K context = 5.4s, cached follow-up = 346ms. Load a project once, then rapid-fire questions.

Harness Comparison

Three agent frameworks to test as the coding harness around local models:

HarnessStrengthsRisk
Pi Coding Agent Most extensible, 735+ ecosystem repos, SKILL.md workflows, multi-agent Dev created company — future uncertain
OpenCode Needs testing — alternative to Pi Less ecosystem, unclear maturity
Gastown 20-30 agent orchestration, git-backed hooks, highest ceiling Even SOTA LLMs struggle with it — ambitious but worth trying

Resolved: MoE Bugs

MoE bugs resolved via --reasoning off

The <think> tag reinsertion and cache corruption issues are avoided by using --reasoning off globally, with per-request override via "chat_template_kwargs": {"enable_thinking": false} in the API body. Both 9B and 35B-A3B run reliably in production with this flag.