Interactive & Workhorse Tiers

Real benchmarks from polecat quality suite and context scaling tests. All numbers measured on this hardware.

Updated with real benchmark data (April 2026) Previous estimates replaced with measured numbers. Winner is 9B IQ4_NL (not Q5_K_M). Quality scored against polecat suite (25-point scale). Haiku 4.5 API is the 22/25 quality baseline.

Benchmarked Models (measured)

Model	Quant	Size	TG t/s @34K	PP t/s @34K	Quality	VRAM peak @200K
0.8B	Q8_0	764MB	148	6,913	5/25	~2.8GB
2B	Q8_0	1.86GB	96	4,972	9/25	~3.9GB
4B	Q8_0	4.48GB	43	2,060	10/25	11.1GB
9B	IQ4_NL	5.37GB	41	1,518	15/25	11.5GB
27B	UD-IQ2_M	9.49GB	19	~300	14/25	~11.2GB
35B-A3B	IQ4_NL	17.8GB	21 (regex)	420 (regex)	17/25	11.9GB
Haiku 4.5	API	N/A	~100	N/A	22/25	N/A ($0.80/M)

Workhorse: 35B-A3B Dual Mode (17/25 quality)

Best local model: 35B-A3B IQ4_NL at 17/25 quality

Found response.ok bug, correct Svelte 5 patterns, solid retry code. MoE preserves knowledge under aggressive quant — 27B dense at IQ2_M only scores 14/25.

Regex vs -cmoe (measured)

Mode	PP t/s @34K	TG t/s @34K	VRAM	Max Context	When to use
Regex (experts 0-35 CPU, 36-39 GPU)	420	21	11.5GB	~188K (OOMs above)	Single-project loads (<150K)
-cmoe (all experts CPU)	309	15	6.7GB	262K	Multi-project loads (>150K)

Regex is 39% faster but OOMs above ~188K context. Strategy: regex for single-project, -cmoe for multi-project.

Interactive: Qwen3.5-9B IQ4_NL (15/25 quality)

The interactive tier winner. At 5.37GB, fits in 12GB VRAM with Q8 KV cache and 262K context (11.5GB peak at 200K). Perfect Svelte 5 runes (5/5) but cannot analyze bugs or diagnose complex issues. Minimum viable model for code generation.

5.37 GB

Model size (IQ4_NL quant)

41 t/s

TG @34K context (measured)

15/25

Polecat quality score

262K

Max context (11.5GB peak @200K)

Context Scaling (9B IQ4_NL + Q8 KV + 262K context)

Context	PP t/s	TG t/s	VRAM
34K	1,518	41	10,807MB
93K	1,178	30	11,073MB
200K	872	21	11,454MB
241K	778	19	—

PP and TG degrade ~2x from 34K to 200K. All fits in 12GB with no spilling.

# Interactive tier (port 8100)
llama-server -m qwen3.5-9b-iq4nl.gguf \
  -ngl 99 --no-mmap --jinja --reasoning off \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -c 262144

Heretic vs Base 9B (tested: tie 14/20)

Heretic tested, base wins on accuracy

Opus evaluation: 14/20 tie. Heretic has better presentation (code quotes, markdown, line numbers) but misdiagnoses bugs — fabricates a race that cancel-on-acquire prevents. Base 9B is plainer but correctly identifies real timing windows. Both get line numbers wrong by 80-100+ lines. Neither understands ownership patterns.

Verdict: base 9B edges ahead for code review — correctness > presentation.

The HERETIC variant (fine-tuned with Claude 4.6 distilled data, abliterated) has 6/100 refusal rate and Claude-style reasoning, but the quality gap vs base is not significant enough to justify switching.

Why Not 4B or Smaller?

0.8B-4B retired from agentic work

4B Q8 and 9B IQ4_NL have nearly identical TG speed (~42-43 t/s). But 4B scores only 10/25 vs 9B's 15/25. The 4B has no reason to exist for code tasks. Smaller models (0.8B at 5/25, 2B at 9/25) fabricate bugs and use wrong frameworks.

Tool Calling Configuration

--reasoning off is mandatory. Without it, thinking tokens consume entire generation budget and responses are empty. Per-request override: "chat_template_kwargs": {"enable_thinking": false}.

# Workhorse tier: fast mode (port 8010)
llama-server -m qwen3.5-35b-a3b-iq4nl.gguf -ngl 99 \
  -ot "blk\.([0-2][0-9]|3[0-5])\.ffn_.*_exps\.weight=CPU" \
  --no-mmap --jinja --reasoning off \
  --cache-type-k q8_0 --cache-type-v q8_0 -c 188000

# Workhorse tier: full context mode (port 8010)
llama-server -m qwen3.5-35b-a3b-iq4nl.gguf -ngl 99 -cmoe \
  --no-mmap --jinja --reasoning off \
  --cache-type-k q8_0 --cache-type-v q8_0 -c 262144

Flags tested (none improve speed)

Flash attention: no measurable difference (auto-enabled)
KV cache quant (q8_0, q4_0): slightly HURTS speed (~5%)
--direct-io: no effect on inference speed
--clear-idle: no effect on single requests
--spec-self 1: segfaults on Qwen3.5 (incompatible with Gated DeltaNet)
--spec-type ngram-mod: slightly slower (~2%)
--reasoning-budget 0: bug #21487, ignored. Use --reasoning off instead.

Official Qwen sampling params for tool calling:

temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0

Sources

← 10 t/s Tier 1000 t/s Tier →