10 t/s — The Planner Tier
Large MoE models for deep reasoning, planning, and complex code generation
Winner: Qwen3.5-122B-A10B
This is the best model for your hardware at this tier. The MoE architecture (10B active params out of 122B total) means you get frontier-class quality while only computing ~10B params per token. Dense 70B models (Llama 3.3, DeepSeek R1 distills) would be 3-5 t/s because all 70B params are active.
| Quantization | File Size | Fits? | Quality |
|---|---|---|---|
| UD-IQ3_XXS | 44.7 GB | Best fit | Unsloth Dynamic — allocates bits where they matter |
| Q3_K_S | 52.5 GB | No | Better quality per bit, but doesn't fit in 44GB |
| Q3_K_M | 56.4 GB | No (needs 64GB RAM) | Significant quality jump — upgrade path |
| i1-IQ3_XXS (mradermacher) | 47.2 GB | Tight | imatrix quant, slightly too large |
ffn_up_expsandffn_gate_exps: 3-bit is fine (bulk of model, least sensitive)ffn_down_exps: slightly more sensitive, gets more bitsattn_*layers: "especially sensitive" for hybrid architectures — gets highest bitsssm_out: quantizing dramatically increases loss with minimal space savings — preserved
IQ3_XXS vs Q3_K_S Quality
From Artefact2's perplexity study:
| Format | Bits/weight | ln(PPL) | Note |
|---|---|---|---|
| IQ3_XXS | 3.21 | 0.0589 | Your current quant |
| Q3_K_S | 3.49 | 0.0511 | Better but 52.5GB — doesn't fit |
| Q3_K_M | 3.89 | 0.0258 | Big jump — 56.4GB, needs 64GB RAM |
Alternatives Ranked
| Model | Total/Active | Architecture | RAM Needed | Expected t/s | Verdict |
|---|---|---|---|---|---|
| Qwen3.5-122B-A10B | 122B / 10B | MoE + Gated Delta | ~45GB | ~10 | Best |
| Llama 4 Scout 109B | 109B / 17B | MoE (16E) | ~37-42GB | ~6-8 | Viable — more active params = slower |
| Llama-3.3-70B | 70B / 70B | Dense | ~34-43GB | ~3-5 | Too slow — all 70B active over DDR4 |
| DeepSeek-R1-Distill-70B | 70B / 70B | Dense | ~34-43GB | ~3-5 | Too slow |
| DeepSeek-V3 (671B) | 671B | MoE | 131GB+ | N/A | Impossible |
| Qwen3-235B-A22B | 235B / 22B | MoE | 85.7GB+ | N/A | Impossible |
Tensor Offloading: The 2.7x Speedup
From this Reddit post (tested on your exact PC before the reinstall):
- Layer offloading (
-ngl): puts ALL tensors in a layer on GPU or CPU. Binary choice. - Tensor offloading (
-ot): puts specific tensors (attention, FFN, experts) on GPU/CPU independently. - Result: attention on GPU (small, latency-sensitive) + FFN experts on CPU (large, compute-light for MoE) = 2.7x faster
Before (59/65 layers offloaded): 3.95 t/s
After (all layers, experts on CPU): 10.61 t/s
For Qwen3-30B-A3B (your old tested command)
# Old: 13/48 layers, 6.5 t/s
llama-cli -m Qwen3-30B-A3B-UD-Q8_K_XL.gguf -ngl 13 -c 40960 -fa -t 5 \
-b 256 -ub 256 --temp 0.7 --top-k 40 --top-p 0.95 --min-p 0.05 \
--repeat-penalty 1.1
# New: all layers, specific expert tensors on CPU, 10 t/s
llama-cli -m Qwen3-30B-A3B-UD-Q8_K_XL.gguf -ngl 48 \
-ot "blk\.(0?[2-9]|1[2-9]|2[1-9]|3[1-9]|4[1-7])\.ffn_.*_exps\.=CPU" \
-c 40960 -fa -t 5 -b 256 -ub 256 --temp 0.7 --top-k 40 --top-p 0.95 \
--min-p 0.05 --repeat-penalty 1.1
For Qwen3.5-122B-A10B (recommended new commands)
# Option 1: -cmoe (simplest, purpose-built for MoE)
llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
--no-mmap --no-mmproj --jinja \
-ngl 99 -cmoe \
--cache-type-k q8_0 --cache-type-v q8_0 \
-b 2048 -ub 2048 \
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.01 \
-fitc 131072 -fitt 256 \
--cache-ram 0 --parallel 1 \
--reasoning-budget 1024 \
--reasoning-budget-message "... thinking budget exceeded, let's answer now."
# Option 2: -ncmoe N (partial, first N layers' experts on CPU)
llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
--no-mmap --no-mmproj --jinja \
-ngl 99 -ncmoe 33 \
--cache-type-k q8_0 --cache-type-v q8_0 \
-b 2048 -ub 2048 -t 6 \
--cache-ram 0 --parallel 1
Qwen3.5 may not use _exps tensor naming like Qwen3. Verify with:
llama-cli --model your-model.gguf --verbose 2>&1 | grep "blk\." | head -50
Use -cmoe / -ncmoe instead of -ot regex for MoE models — they are purpose-built and don't require knowing tensor names.
Your Current Command (reference)
llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
--no-mmap --no-mmproj --jinja \
-b 2048 -ub 2048 \
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.01 \
-fitc 131072 --fit on -fitt 256 \
--cache-ram 0 --parallel 1 \
--reasoning-budget 1024 \
--reasoning-budget-message "... thinking budget exceeded, let's answer now."
Key additions to try: -cmoe and --cache-type-k q8_0 --cache-type-v q8_0 for better VRAM utilization.
Upgrade Path
- Qwen3.5-122B-A10B Q3_K_M (56.4GB) — significant quality jump
- Qwen3.5-122B-A10B UD-Q3_K_XL (57GB) — Unsloth Dynamic version
- Llama 4 Scout UD-Q3_K_XL (49GB) — at a good quantization level