Decisions & Tradeoffs

Data-driven decisions from real benchmarks. Updated April 2026 with measured numbers.

Major update: all decisions now backed by real benchmark data

Previous estimates replaced with measured numbers from polecat quality suite and context scaling tests. Key findings: MoE bugs resolved via --reasoning off, quant matters more than model size, regex vs -cmoe tradeoff quantified.

Decision 1: MoE Bug Status — RESOLVED

No longer blocking. --reasoning off prevents thinking token issues. Both 9B and 35B-A3B run reliably.
BugImpactStatusResolution
Cache reuse / think tag corruption Was breaking multi-turn and tool calling Resolved --reasoning off globally, per-request override via API body
--reasoning-budget regression Budget value gets ignored Open bug Use --reasoning off instead. #21487 / PR #21594

Decision 2: Tier Winners (data-driven)

TierWinnerSpeedQualityWhy
Interactive 9B IQ4_NL (5.37GB) 41 t/s @34K 15/25 Same speed as 4B Q8 (43 t/s) but +5 quality points. Minimum for code gen.
Workhorse 35B-A3B IQ4_NL (17.8GB) 21 t/s (regex) 17/25 Found response.ok bug, correct Svelte 5. Best local model.
Speed RETIRED 0.8B-4B: 5-10/25 quality. Not viable for agentic work.
API Fallback Haiku 4.5 ($0.80/M) ~100 t/s 22/25 Only model that found the 30s timing race.

Decision 3: Regex vs -cmoe (35B-A3B)

ModeTG t/s @34KPP t/s @34KVRAMMax Context
Regex (layers 0-35 experts CPU, 36-39 GPU) 21 420 11.5GB ~188K (OOMs above)
-cmoe (all experts CPU) 15 309 6.7GB 262K
Strategy: regex for single-project loads (<150K context), -cmoe for multi-project loads (>150K). Regex is 39% faster but OOMs above ~188K.

Decision 4: Heretic vs Base 9B — Base Wins

Opus evaluation: 14/20 tie. Heretic has better presentation (markdown, line numbers) but misdiagnoses bugs (fabricates races). Base 9B is plainer but more accurate. Correctness > presentation for code review.

Decision 5: Quant Matters More Than Model Size

ComparisonResult
4B Q8 vs 9B IQ4_NLSame speed (~42 t/s). 9B wins quality 15 vs 10. 4B has no reason to exist.
27B IQ2_M vs 35B-A3B IQ4_NLMoE at IQ4 beats dense at IQ2 (17 vs 14/25). MoE preserves knowledge.
0.8B Q8 vs 2B IQ2Q8 is usable (5/25), IQ2 produces repeating garbage.

Decision 6: Harness Selection

OptionProsConsNext Step
Pi Proven with Qwen 3.5, 735+ repos, extensible Company risk, may enshittify Connect to llm switcher
OpenCode Alternative to Pi Unknown maturity Investigate: does it support llama.cpp server?
Gastown Highest ceiling, multi-agent orchestration Even SOTA LLMs struggle Install, try with cloud model as Mayor + local Polecats

Decision 7: RAM Upgrade

Adding 32GB DDR4 ($30-40 used) unlocks:

ROI calculation: $35 for a potential quality improvement on the planner tier. Best hardware upgrade available.

Decision 8: Engine Strategy — DECIDED

llama.cpp is the only engine. GGUF is the only format.

ExLlamaV2 archived. ExLlamaV3 slower on Ampere (confirmed). TensorRT-LLM FP8 unavailable on RTX 3060 (cc 8.6 needs 8.9+). vLLM/SGLang not practical for 12GB single-user.

Project Sizes (real tokenizer counts)

ProjectTokensFits in 262K?
km-explorer33,886Easy
cashback-v335,171Easy
gallery-reader46,685Easy
manga-reader56,192Easy
video-platform97,925Yes
trader154,047Yes
Total424KNo (need subsets)

Open Questions

Priority Order (updated)

  1. Build per-project test suites DONE
  2. Run context scaling on 9B, 35B DONE
  3. Test regex vs -cmoe DONE (regex 39% faster, OOMs >188K)
  4. Test heretic vs base 9B DONE (tie 14/20, base more accurate)
  5. Next: Run more heretic quality tests (different tasks)
  6. Next: Run full per-project test suite on base 9B and 35B
  7. Next: Set up Pi harness connecting to llm switcher
  8. Next: Test PR #20700 (MTP) — fork, cherry-pick, rebuild, benchmark
  9. Later: Test reasoning ON for 35B-A3B with appropriate budget

Sources