Decisions & Tradeoffs
Data-driven decisions from real benchmarks. Updated April 2026 with measured numbers.
Previous estimates replaced with measured numbers from polecat quality suite and context scaling tests. Key findings: MoE bugs resolved via --reasoning off, quant matters more than model size, regex vs -cmoe tradeoff quantified.
Decision 1: MoE Bug Status — RESOLVED
--reasoning off prevents thinking token issues. Both 9B and 35B-A3B run reliably.
| Bug | Impact | Status | Resolution |
|---|---|---|---|
| Cache reuse / think tag corruption | Was breaking multi-turn and tool calling | Resolved | --reasoning off globally, per-request override via API body |
--reasoning-budget regression |
Budget value gets ignored | Open bug | Use --reasoning off instead. #21487 / PR #21594 |
Decision 2: Tier Winners (data-driven)
| Tier | Winner | Speed | Quality | Why |
|---|---|---|---|---|
| Interactive | 9B IQ4_NL (5.37GB) | 41 t/s @34K | 15/25 | Same speed as 4B Q8 (43 t/s) but +5 quality points. Minimum for code gen. |
| Workhorse | 35B-A3B IQ4_NL (17.8GB) | 21 t/s (regex) | 17/25 | Found response.ok bug, correct Svelte 5. Best local model. |
| Speed | RETIRED | — | — | 0.8B-4B: 5-10/25 quality. Not viable for agentic work. |
| API Fallback | Haiku 4.5 ($0.80/M) | ~100 t/s | 22/25 | Only model that found the 30s timing race. |
Decision 3: Regex vs -cmoe (35B-A3B)
| Mode | TG t/s @34K | PP t/s @34K | VRAM | Max Context |
|---|---|---|---|---|
| Regex (layers 0-35 experts CPU, 36-39 GPU) | 21 | 420 | 11.5GB | ~188K (OOMs above) |
| -cmoe (all experts CPU) | 15 | 309 | 6.7GB | 262K |
Decision 4: Heretic vs Base 9B — Base Wins
Opus evaluation: 14/20 tie. Heretic has better presentation (markdown, line numbers) but misdiagnoses bugs (fabricates races). Base 9B is plainer but more accurate. Correctness > presentation for code review.
Decision 5: Quant Matters More Than Model Size
| Comparison | Result |
|---|---|
| 4B Q8 vs 9B IQ4_NL | Same speed (~42 t/s). 9B wins quality 15 vs 10. 4B has no reason to exist. |
| 27B IQ2_M vs 35B-A3B IQ4_NL | MoE at IQ4 beats dense at IQ2 (17 vs 14/25). MoE preserves knowledge. |
| 0.8B Q8 vs 2B IQ2 | Q8 is usable (5/25), IQ2 produces repeating garbage. |
Decision 6: Harness Selection
| Option | Pros | Cons | Next Step |
|---|---|---|---|
| Pi | Proven with Qwen 3.5, 735+ repos, extensible | Company risk, may enshittify | Connect to llm switcher |
| OpenCode | Alternative to Pi | Unknown maturity | Investigate: does it support llama.cpp server? |
| Gastown | Highest ceiling, multi-agent orchestration | Even SOTA LLMs struggle | Install, try with cloud model as Mayor + local Polecats |
Decision 7: RAM Upgrade
Adding 32GB DDR4 ($30-40 used) unlocks:
- Qwen3.5-122B-A10B at better quant — currently deferred (needs NVMe swap with 32GB)
- 35B-A3B Q8_0 (36.9GB) with -cmoe — near-lossless quality
ROI calculation: $35 for a potential quality improvement on the planner tier. Best hardware upgrade available.
Decision 8: Engine Strategy — DECIDED
ExLlamaV2 archived. ExLlamaV3 slower on Ampere (confirmed). TensorRT-LLM FP8 unavailable on RTX 3060 (cc 8.6 needs 8.9+). vLLM/SGLang not practical for 12GB single-user.
Project Sizes (real tokenizer counts)
| Project | Tokens | Fits in 262K? |
|---|---|---|
| km-explorer | 33,886 | Easy |
| cashback-v3 | 35,171 | Easy |
| gallery-reader | 46,685 | Easy |
| manga-reader | 56,192 | Easy |
| video-platform | 97,925 | Yes |
| trader | 154,047 | Yes |
| Total | 424K | No (need subsets) |
Open Questions
- Reasoning ON for 35B-A3B: How to use thinking tokens effectively with large context? What max_tokens budget?
- PR #20700 (MTP): Built-in multi-token prediction for Qwen3.5 — could 1.5-2x generation speed
- Sub-4B use cases: Failed at code review but may excel at autocomplete, structured extraction, JSON formatting, draft generation
- Gastown + local models: Can a local model reliably coordinate as Mayor?
Priority Order (updated)
Build per-project test suitesDONERun context scaling on 9B, 35BDONETest regex vs -cmoeDONE (regex 39% faster, OOMs >188K)Test heretic vs base 9BDONE (tie 14/20, base more accurate)- Next: Run more heretic quality tests (different tasks)
- Next: Run full per-project test suite on base 9B and 35B
- Next: Set up Pi harness connecting to llm switcher
- Next: Test PR #20700 (MTP) — fork, cherry-pick, rebuild, benchmark
- Later: Test reasoning ON for 35B-A3B with appropriate budget