Interactive & Workhorse Tiers
Real benchmarks from polecat quality suite and context scaling tests. All numbers measured on this hardware.
Benchmarked Models (measured)
| Model | Quant | Size | TG t/s @34K | PP t/s @34K | Quality | VRAM peak @200K |
|---|---|---|---|---|---|---|
| 0.8B | Q8_0 | 764MB | 148 | 6,913 | 5/25 | ~2.8GB |
| 2B | Q8_0 | 1.86GB | 96 | 4,972 | 9/25 | ~3.9GB |
| 4B | Q8_0 | 4.48GB | 43 | 2,060 | 10/25 | 11.1GB |
| 9B | IQ4_NL | 5.37GB | 41 | 1,518 | 15/25 | 11.5GB |
| 27B | UD-IQ2_M | 9.49GB | 19 | ~300 | 14/25 | ~11.2GB |
| 35B-A3B | IQ4_NL | 17.8GB | 21 (regex) | 420 (regex) | 17/25 | 11.9GB |
| Haiku 4.5 | API | N/A | ~100 | N/A | 22/25 | N/A ($0.80/M) |
Workhorse: 35B-A3B Dual Mode (17/25 quality)
Found response.ok bug, correct Svelte 5 patterns, solid retry code. MoE preserves knowledge under aggressive quant — 27B dense at IQ2_M only scores 14/25.
Regex vs -cmoe (measured)
| Mode | PP t/s @34K | TG t/s @34K | VRAM | Max Context | When to use |
|---|---|---|---|---|---|
| Regex (experts 0-35 CPU, 36-39 GPU) | 420 | 21 | 11.5GB | ~188K (OOMs above) | Single-project loads (<150K) |
| -cmoe (all experts CPU) | 309 | 15 | 6.7GB | 262K | Multi-project loads (>150K) |
Regex is 39% faster but OOMs above ~188K context. Strategy: regex for single-project, -cmoe for multi-project.
Interactive: Qwen3.5-9B IQ4_NL (15/25 quality)
The interactive tier winner. At 5.37GB, fits in 12GB VRAM with Q8 KV cache and 262K context (11.5GB peak at 200K). Perfect Svelte 5 runes (5/5) but cannot analyze bugs or diagnose complex issues. Minimum viable model for code generation.
Context Scaling (9B IQ4_NL + Q8 KV + 262K context)
| Context | PP t/s | TG t/s | VRAM |
|---|---|---|---|
| 34K | 1,518 | 41 | 10,807MB |
| 93K | 1,178 | 30 | 11,073MB |
| 200K | 872 | 21 | 11,454MB |
| 241K | 778 | 19 | — |
PP and TG degrade ~2x from 34K to 200K. All fits in 12GB with no spilling.
# Interactive tier (port 8100)
llama-server -m qwen3.5-9b-iq4nl.gguf \
-ngl 99 --no-mmap --jinja --reasoning off \
--cache-type-k q8_0 --cache-type-v q8_0 \
-c 262144
Heretic vs Base 9B (tested: tie 14/20)
Opus evaluation: 14/20 tie. Heretic has better presentation (code quotes, markdown, line numbers) but misdiagnoses bugs — fabricates a race that cancel-on-acquire prevents. Base 9B is plainer but correctly identifies real timing windows. Both get line numbers wrong by 80-100+ lines. Neither understands ownership patterns.
Verdict: base 9B edges ahead for code review — correctness > presentation.
The HERETIC variant (fine-tuned with Claude 4.6 distilled data, abliterated) has 6/100 refusal rate and Claude-style reasoning, but the quality gap vs base is not significant enough to justify switching.
Why Not 4B or Smaller?
4B Q8 and 9B IQ4_NL have nearly identical TG speed (~42-43 t/s). But 4B scores only 10/25 vs 9B's 15/25. The 4B has no reason to exist for code tasks. Smaller models (0.8B at 5/25, 2B at 9/25) fabricate bugs and use wrong frameworks.
Tool Calling Configuration
--reasoning off is mandatory. Without it, thinking tokens consume entire generation budget and responses are empty. Per-request override: "chat_template_kwargs": {"enable_thinking": false}.
# Workhorse tier: fast mode (port 8010)
llama-server -m qwen3.5-35b-a3b-iq4nl.gguf -ngl 99 \
-ot "blk\.([0-2][0-9]|3[0-5])\.ffn_.*_exps\.weight=CPU" \
--no-mmap --jinja --reasoning off \
--cache-type-k q8_0 --cache-type-v q8_0 -c 188000
# Workhorse tier: full context mode (port 8010)
llama-server -m qwen3.5-35b-a3b-iq4nl.gguf -ngl 99 -cmoe \
--no-mmap --jinja --reasoning off \
--cache-type-k q8_0 --cache-type-v q8_0 -c 262144
- Flash attention: no measurable difference (auto-enabled)
- KV cache quant (q8_0, q4_0): slightly HURTS speed (~5%)
- --direct-io: no effect on inference speed
- --clear-idle: no effect on single requests
- --spec-self 1: segfaults on Qwen3.5 (incompatible with Gated DeltaNet)
- --spec-type ngram-mod: slightly slower (~2%)
- --reasoning-budget 0: bug #21487, ignored. Use --reasoning off instead.
Official Qwen sampling params for tool calling:
temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0