10 t/s — The Planner Tier

Large MoE models for deep reasoning, planning, and complex code generation

122B
Total parameters (10B active per token)
~10 t/s
Generation speed (CPU+GPU hybrid)
86.7
MMLU-Pro (frontier-class)
44.7 GB
Model size (IQ3_XXS)

Winner: Qwen3.5-122B-A10B

This is the best model for your hardware at this tier. The MoE architecture (10B active params out of 122B total) means you get frontier-class quality while only computing ~10B params per token. Dense 70B models (Llama 3.3, DeepSeek R1 distills) would be 3-5 t/s because all 70B params are active.

QuantizationFile SizeFits?Quality
UD-IQ3_XXS44.7 GBBest fitUnsloth Dynamic — allocates bits where they matter
Q3_K_S52.5 GBNoBetter quality per bit, but doesn't fit in 44GB
Q3_K_M56.4 GBNo (needs 64GB RAM)Significant quality jump — upgrade path
i1-IQ3_XXS (mradermacher)47.2 GBTightimatrix quant, slightly too large
Why UD-IQ3_XXS is optimal Unsloth Dynamic quantization allocates bits per tensor based on sensitivity analysis:

IQ3_XXS vs Q3_K_S Quality

From Artefact2's perplexity study:

FormatBits/weightln(PPL)Note
IQ3_XXS3.210.0589Your current quant
Q3_K_S3.490.0511Better but 52.5GB — doesn't fit
Q3_K_M3.890.0258Big jump — 56.4GB, needs 64GB RAM

Alternatives Ranked

ModelTotal/ActiveArchitectureRAM NeededExpected t/sVerdict
Qwen3.5-122B-A10B 122B / 10B MoE + Gated Delta ~45GB ~10 Best
Llama 4 Scout 109B 109B / 17B MoE (16E) ~37-42GB ~6-8 Viable — more active params = slower
Llama-3.3-70B 70B / 70B Dense ~34-43GB ~3-5 Too slow — all 70B active over DDR4
DeepSeek-R1-Distill-70B 70B / 70B Dense ~34-43GB ~3-5 Too slow
DeepSeek-V3 (671B) 671B MoE 131GB+ N/A Impossible
Qwen3-235B-A22B 235B / 22B MoE 85.7GB+ N/A Impossible

Tensor Offloading: The 2.7x Speedup

From this Reddit post (tested on your exact PC before the reinstall):

Layer offloading vs tensor offloading

Before (59/65 layers offloaded): 3.95 t/s
After (all layers, experts on CPU): 10.61 t/s

For Qwen3-30B-A3B (your old tested command)

# Old: 13/48 layers, 6.5 t/s
llama-cli -m Qwen3-30B-A3B-UD-Q8_K_XL.gguf -ngl 13 -c 40960 -fa -t 5 \
  -b 256 -ub 256 --temp 0.7 --top-k 40 --top-p 0.95 --min-p 0.05 \
  --repeat-penalty 1.1

# New: all layers, specific expert tensors on CPU, 10 t/s
llama-cli -m Qwen3-30B-A3B-UD-Q8_K_XL.gguf -ngl 48 \
  -ot "blk\.(0?[2-9]|1[2-9]|2[1-9]|3[1-9]|4[1-7])\.ffn_.*_exps\.=CPU" \
  -c 40960 -fa -t 5 -b 256 -ub 256 --temp 0.7 --top-k 40 --top-p 0.95 \
  --min-p 0.05 --repeat-penalty 1.1

For Qwen3.5-122B-A10B (recommended new commands)

# Option 1: -cmoe (simplest, purpose-built for MoE)
llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
  --no-mmap --no-mmproj --jinja \
  -ngl 99 -cmoe \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -b 2048 -ub 2048 \
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.01 \
  -fitc 131072 -fitt 256 \
  --cache-ram 0 --parallel 1 \
  --reasoning-budget 1024 \
  --reasoning-budget-message "... thinking budget exceeded, let's answer now."

# Option 2: -ncmoe N (partial, first N layers' experts on CPU)
llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
  --no-mmap --no-mmproj --jinja \
  -ngl 99 -ncmoe 33 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -b 2048 -ub 2048 -t 6 \
  --cache-ram 0 --parallel 1
Check tensor names first

Qwen3.5 may not use _exps tensor naming like Qwen3. Verify with:

llama-cli --model your-model.gguf --verbose 2>&1 | grep "blk\." | head -50

Use -cmoe / -ncmoe instead of -ot regex for MoE models — they are purpose-built and don't require knowing tensor names.

Your Current Command (reference)

llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
  --no-mmap --no-mmproj --jinja \
  -b 2048 -ub 2048 \
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.01 \
  -fitc 131072 --fit on -fitt 256 \
  --cache-ram 0 --parallel 1 \
  --reasoning-budget 1024 \
  --reasoning-budget-message "... thinking budget exceeded, let's answer now."

Key additions to try: -cmoe and --cache-type-k q8_0 --cache-type-v q8_0 for better VRAM utilization.

Upgrade Path

Add 32GB RAM (~$30-40 used DDR4)

Sources