Local LLM Setup Guide

Optimal models and configs for RTX 3060 12GB + Xeon E5-1650v3 + 32GB DDR4 — April 2026

12 GB

VRAM (RTX 3060, 360 GB/s bandwidth)

32 GB

DDR4 2133 MT/s (~34 GB/s bandwidth)

6C/12T

Xeon E5-1650v3 (Haswell, no AVX-512)

~44 GB

Total usable memory (VRAM + RAM)

Updated with real benchmark data (April 2026) All numbers below are measured on this hardware using the polecat quality suite and real context tests. Estimates have been replaced with actual measurements.

The Tier Strategy (data-driven)

Each tier targets a different use case. One tier runs at a time, switched via llm switch <tier>.

Tier	Model	TG Speed	Quality	Use Case
Interactive	Qwen3.5-9B IQ4_NL (5.37GB)	41 t/s @34K	15/25	Code generation, Svelte 5, tool calling, quick tasks
Workhorse	Qwen3.5-35B-A3B IQ4_NL (17.8GB)	21 t/s (regex) / 15 t/s (-cmoe)	17/25	Code review, bug finding, deeper analysis. Dual mode.
Planner	Qwen3.5-122B-A10B IQ3_XXS (~44.7GB)	~10 t/s (est.)	TBD	Deep reasoning, planning (deferred — needs NVMe swap)
API Fallback	Haiku 4.5 ($0.80/M)	~100 t/s	22/25	Quality baseline. Only model that found the 30s timing race.

Speed tier (0.8B-4B) retired 0.8B-4B models scored 5-10/25 on the polecat suite — they fabricate bugs, use wrong frameworks, and are not viable for agentic coding. 0.8B achieves 343 t/s aggregate with --parallel 5, but speed without quality is useless. API fallback (Haiku at 22/25, Groq at 500+ t/s) replaces this tier.

Key Insight: MoE vs Dense

The defining constraint of this hardware is 12GB VRAM + 34 GB/s DDR4 bandwidth. MoE models win because only active params traverse memory per token:

Qwen3.5-35B-A3B (MoE, 3B active): 21 t/s with regex offload, 17/25 quality — best local model
Qwen3.5-27B (dense, 27B active): 19 t/s at IQ2_M, 14/25 quality — MoE beats dense even at lower quant
Qwen3.5-9B (dense, 9B active): 41 t/s at IQ4_NL, 15/25 quality — speed king for interactive work

Key Insight: Quant Matters More Than Model Size

At small scales, quantization quality dominates over parameter count:

4B Q8 and 9B IQ4_NL have nearly identical speed (~42 t/s) — 9B wins on quality, 4B has no reason to exist
27B dense at IQ2_M scores LOWER than 35B MoE at IQ4_NL (14 vs 17/25). MoE preserves knowledge under aggressive quant.
0.8B Q8 (5/25) beats 2B IQ2 (garbage) on real code analysis

Critical Flags

These llama.cpp flags make or break performance on this hardware:

Flag	What it does	Why it matters
`--reasoning off`	Disable thinking tokens globally	Mandatory. Without it, thinking tokens consume entire generation budget — responses are empty.
`-cmoe`	Keep all MoE expert weights on CPU	35B: 6.7GB VRAM, handles 262K context. 15 t/s TG.
`-ot 'regex=CPU'`	Selectively place tensors on CPU/GPU	35B: 39% faster than -cmoe (21 vs 15 t/s). OOMs >188K.
`--no-mmap`	Force model into RAM (no memory-mapping)	mmap causes slow/stalled loading on this hardware.
`--cache-type-k q8_0`	Quantize KV cache	All models fit 200K+ context in 12GB VRAM with Q8 KV.
`--jinja`	Enable Jinja chat templates	Required for tool calling.

Cache hits are 15x faster Prompt cache is on by default in llama-server. Same-prefix follow-up queries skip prompt eval: cold 34K context = 5.4s, cached follow-up = 346ms. Load a project once, then rapid-fire questions.

Harness Comparison

Three agent frameworks to test as the coding harness around local models:

Harness	Strengths	Risk
Pi Coding Agent	Most extensible, 735+ ecosystem repos, SKILL.md workflows, multi-agent	Dev created company — future uncertain
OpenCode	Needs testing — alternative to Pi	Less ecosystem, unclear maturity
Gastown	20-30 agent orchestration, git-backed hooks, highest ceiling	Even SOTA LLMs struggle with it — ambitious but worth trying

Resolved: MoE Bugs

MoE bugs resolved via --reasoning off

The <think> tag reinsertion and cache corruption issues are avoided by using --reasoning off globally, with per-request override via "chat_template_kwargs": {"enable_thinking": false} in the API body. Both 9B and 35B-A3B run reliably in production with this flag.

10 t/s Tier →