Decisions & Tradeoffs

Data-driven decisions from real benchmarks. Updated April 2026 with measured numbers.

Major update: all decisions now backed by real benchmark data

Previous estimates replaced with measured numbers from polecat quality suite and context scaling tests. Key findings: MoE bugs resolved via --reasoning off, quant matters more than model size, regex vs -cmoe tradeoff quantified.

Decision 1: MoE Bug Status — RESOLVED

No longer blocking. --reasoning off prevents thinking token issues. Both 9B and 35B-A3B run reliably.

Bug	Impact	Status	Resolution
Cache reuse / think tag corruption	Was breaking multi-turn and tool calling	Resolved	`--reasoning off` globally, per-request override via API body
`--reasoning-budget` regression	Budget value gets ignored	Open bug	Use `--reasoning off` instead. #21487 / PR #21594

Decision 2: Tier Winners (data-driven)

Tier	Winner	Speed	Quality	Why
Interactive	9B IQ4_NL (5.37GB)	41 t/s @34K	15/25	Same speed as 4B Q8 (43 t/s) but +5 quality points. Minimum for code gen.
Workhorse	35B-A3B IQ4_NL (17.8GB)	21 t/s (regex)	17/25	Found response.ok bug, correct Svelte 5. Best local model.
Speed	RETIRED	—	—	0.8B-4B: 5-10/25 quality. Not viable for agentic work.
API Fallback	Haiku 4.5 ($0.80/M)	~100 t/s	22/25	Only model that found the 30s timing race.

Decision 3: Regex vs -cmoe (35B-A3B)

Mode	TG t/s @34K	PP t/s @34K	VRAM	Max Context
Regex (layers 0-35 experts CPU, 36-39 GPU)	21	420	11.5GB	~188K (OOMs above)
-cmoe (all experts CPU)	15	309	6.7GB	262K

Strategy: regex for single-project loads (<150K context), -cmoe for multi-project loads (>150K). Regex is 39% faster but OOMs above ~188K.

Decision 4: Heretic vs Base 9B — Base Wins

Opus evaluation: 14/20 tie. Heretic has better presentation (markdown, line numbers) but misdiagnoses bugs (fabricates races). Base 9B is plainer but more accurate. Correctness > presentation for code review.

Decision 5: Quant Matters More Than Model Size

Comparison	Result
4B Q8 vs 9B IQ4_NL	Same speed (~42 t/s). 9B wins quality 15 vs 10. 4B has no reason to exist.
27B IQ2_M vs 35B-A3B IQ4_NL	MoE at IQ4 beats dense at IQ2 (17 vs 14/25). MoE preserves knowledge.
0.8B Q8 vs 2B IQ2	Q8 is usable (5/25), IQ2 produces repeating garbage.

Decision 6: Harness Selection

Option	Pros	Cons	Next Step
Pi	Proven with Qwen 3.5, 735+ repos, extensible	Company risk, may enshittify	Connect to llm switcher
OpenCode	Alternative to Pi	Unknown maturity	Investigate: does it support llama.cpp server?
Gastown	Highest ceiling, multi-agent orchestration	Even SOTA LLMs struggle	Install, try with cloud model as Mayor + local Polecats

Decision 7: RAM Upgrade

Adding 32GB DDR4 ($30-40 used) unlocks:

Qwen3.5-122B-A10B at better quant — currently deferred (needs NVMe swap with 32GB)
35B-A3B Q8_0 (36.9GB) with -cmoe — near-lossless quality

ROI calculation: $35 for a potential quality improvement on the planner tier. Best hardware upgrade available.

Decision 8: Engine Strategy — DECIDED

llama.cpp is the only engine. GGUF is the only format.

ExLlamaV2 archived. ExLlamaV3 slower on Ampere (confirmed). TensorRT-LLM FP8 unavailable on RTX 3060 (cc 8.6 needs 8.9+). vLLM/SGLang not practical for 12GB single-user.

Project Sizes (real tokenizer counts)

Project	Tokens	Fits in 262K?
km-explorer	33,886	Easy
cashback-v3	35,171	Easy
gallery-reader	46,685	Easy
manga-reader	56,192	Easy
video-platform	97,925	Yes
trader	154,047	Yes
Total	424K	No (need subsets)

Open Questions

Reasoning ON for 35B-A3B: How to use thinking tokens effectively with large context? What max_tokens budget?
PR #20700 (MTP): Built-in multi-token prediction for Qwen3.5 — could 1.5-2x generation speed
Sub-4B use cases: Failed at code review but may excel at autocomplete, structured extraction, JSON formatting, draft generation
Gastown + local models: Can a local model reliably coordinate as Mayor?

Priority Order (updated)

~~Build per-project test suites~~ DONE
~~Run context scaling on 9B, 35B~~ DONE
~~Test regex vs -cmoe~~ DONE (regex 39% faster, OOMs >188K)
~~Test heretic vs base 9B~~ DONE (tie 14/20, base more accurate)
Next: Run more heretic quality tests (different tasks)
Next: Run full per-project test suite on base 9B and 35B
Next: Set up Pi harness connecting to llm switcher
Next: Test PR #20700 (MTP) — fork, cherry-pick, rebuild, benchmark
Later: Test reasoning ON for 35B-A3B with appropriate budget

Sources

← Command Reference Back to Overview →