Interactive & Workhorse Tiers

Real benchmarks from polecat quality suite and context scaling tests. All numbers measured on this hardware.

Updated with real benchmark data (April 2026) Previous estimates replaced with measured numbers. Winner is 9B IQ4_NL (not Q5_K_M). Quality scored against polecat suite (25-point scale). Haiku 4.5 API is the 22/25 quality baseline.

Benchmarked Models (measured)

ModelQuantSizeTG t/s @34KPP t/s @34KQualityVRAM peak @200K
0.8B Q8_0 764MB 148 6,913 5/25 ~2.8GB
2B Q8_0 1.86GB 96 4,972 9/25 ~3.9GB
4B Q8_0 4.48GB 43 2,060 10/25 11.1GB
9B IQ4_NL 5.37GB 41 1,518 15/25 11.5GB
27B UD-IQ2_M 9.49GB 19 ~300 14/25 ~11.2GB
35B-A3B IQ4_NL 17.8GB 21 (regex) 420 (regex) 17/25 11.9GB
Haiku 4.5 API N/A ~100 N/A 22/25 N/A ($0.80/M)

Workhorse: 35B-A3B Dual Mode (17/25 quality)

Best local model: 35B-A3B IQ4_NL at 17/25 quality

Found response.ok bug, correct Svelte 5 patterns, solid retry code. MoE preserves knowledge under aggressive quant — 27B dense at IQ2_M only scores 14/25.

Regex vs -cmoe (measured)

ModePP t/s @34KTG t/s @34KVRAMMax ContextWhen to use
Regex (experts 0-35 CPU, 36-39 GPU) 420 21 11.5GB ~188K (OOMs above) Single-project loads (<150K)
-cmoe (all experts CPU) 309 15 6.7GB 262K Multi-project loads (>150K)

Regex is 39% faster but OOMs above ~188K context. Strategy: regex for single-project, -cmoe for multi-project.

Interactive: Qwen3.5-9B IQ4_NL (15/25 quality)

The interactive tier winner. At 5.37GB, fits in 12GB VRAM with Q8 KV cache and 262K context (11.5GB peak at 200K). Perfect Svelte 5 runes (5/5) but cannot analyze bugs or diagnose complex issues. Minimum viable model for code generation.

5.37 GB
Model size (IQ4_NL quant)
41 t/s
TG @34K context (measured)
15/25
Polecat quality score
262K
Max context (11.5GB peak @200K)

Context Scaling (9B IQ4_NL + Q8 KV + 262K context)

ContextPP t/sTG t/sVRAM
34K1,5184110,807MB
93K1,1783011,073MB
200K8722111,454MB
241K77819

PP and TG degrade ~2x from 34K to 200K. All fits in 12GB with no spilling.

# Interactive tier (port 8100)
llama-server -m qwen3.5-9b-iq4nl.gguf \
  -ngl 99 --no-mmap --jinja --reasoning off \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -c 262144

Heretic vs Base 9B (tested: tie 14/20)

Heretic tested, base wins on accuracy

Opus evaluation: 14/20 tie. Heretic has better presentation (code quotes, markdown, line numbers) but misdiagnoses bugs — fabricates a race that cancel-on-acquire prevents. Base 9B is plainer but correctly identifies real timing windows. Both get line numbers wrong by 80-100+ lines. Neither understands ownership patterns.

Verdict: base 9B edges ahead for code review — correctness > presentation.

The HERETIC variant (fine-tuned with Claude 4.6 distilled data, abliterated) has 6/100 refusal rate and Claude-style reasoning, but the quality gap vs base is not significant enough to justify switching.

Why Not 4B or Smaller?

0.8B-4B retired from agentic work

4B Q8 and 9B IQ4_NL have nearly identical TG speed (~42-43 t/s). But 4B scores only 10/25 vs 9B's 15/25. The 4B has no reason to exist for code tasks. Smaller models (0.8B at 5/25, 2B at 9/25) fabricate bugs and use wrong frameworks.

Tool Calling Configuration

--reasoning off is mandatory. Without it, thinking tokens consume entire generation budget and responses are empty. Per-request override: "chat_template_kwargs": {"enable_thinking": false}.

# Workhorse tier: fast mode (port 8010)
llama-server -m qwen3.5-35b-a3b-iq4nl.gguf -ngl 99 \
  -ot "blk\.([0-2][0-9]|3[0-5])\.ffn_.*_exps\.weight=CPU" \
  --no-mmap --jinja --reasoning off \
  --cache-type-k q8_0 --cache-type-v q8_0 -c 188000

# Workhorse tier: full context mode (port 8010)
llama-server -m qwen3.5-35b-a3b-iq4nl.gguf -ngl 99 -cmoe \
  --no-mmap --jinja --reasoning off \
  --cache-type-k q8_0 --cache-type-v q8_0 -c 262144
Flags tested (none improve speed)

Official Qwen sampling params for tool calling:

temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0

Sources