10 t/s — The Planner Tier

Large MoE models for deep reasoning, planning, and complex code generation

122B

Total parameters (10B active per token)

~10 t/s

Generation speed (CPU+GPU hybrid)

86.7

MMLU-Pro (frontier-class)

44.7 GB

Model size (IQ3_XXS)

Winner: Qwen3.5-122B-A10B

This is the best model for your hardware at this tier. The MoE architecture (10B active params out of 122B total) means you get frontier-class quality while only computing ~10B params per token. Dense 70B models (Llama 3.3, DeepSeek R1 distills) would be 3-5 t/s because all 70B params are active.

Quantization	File Size	Fits?	Quality
UD-IQ3_XXS	44.7 GB	Best fit	Unsloth Dynamic — allocates bits where they matter
Q3_K_S	52.5 GB	No	Better quality per bit, but doesn't fit in 44GB
Q3_K_M	56.4 GB	No (needs 64GB RAM)	Significant quality jump — upgrade path
i1-IQ3_XXS (mradermacher)	47.2 GB	Tight	imatrix quant, slightly too large

Why UD-IQ3_XXS is optimal Unsloth Dynamic quantization allocates bits per tensor based on sensitivity analysis:

ffn_up_exps and ffn_gate_exps: 3-bit is fine (bulk of model, least sensitive)
ffn_down_exps: slightly more sensitive, gets more bits
attn_* layers: "especially sensitive" for hybrid architectures — gets highest bits
ssm_out: quantizing dramatically increases loss with minimal space savings — preserved

IQ3_XXS vs Q3_K_S Quality

From Artefact2's perplexity study:

Format	Bits/weight	ln(PPL)	Note
IQ3_XXS	3.21	0.0589	Your current quant
Q3_K_S	3.49	0.0511	Better but 52.5GB — doesn't fit
Q3_K_M	3.89	0.0258	Big jump — 56.4GB, needs 64GB RAM

Alternatives Ranked

Model	Total/Active	Architecture	RAM Needed	Expected t/s	Verdict
Qwen3.5-122B-A10B	122B / 10B	MoE + Gated Delta	~45GB	~10	Best
Llama 4 Scout 109B	109B / 17B	MoE (16E)	~37-42GB	~6-8	Viable — more active params = slower
Llama-3.3-70B	70B / 70B	Dense	~34-43GB	~3-5	Too slow — all 70B active over DDR4
DeepSeek-R1-Distill-70B	70B / 70B	Dense	~34-43GB	~3-5	Too slow
DeepSeek-V3 (671B)	671B	MoE	131GB+	N/A	Impossible
Qwen3-235B-A22B	235B / 22B	MoE	85.7GB+	N/A	Impossible

Tensor Offloading: The 2.7x Speedup

From this Reddit post (tested on your exact PC before the reinstall):

Layer offloading vs tensor offloading

Layer offloading (-ngl): puts ALL tensors in a layer on GPU or CPU. Binary choice.
Tensor offloading (-ot): puts specific tensors (attention, FFN, experts) on GPU/CPU independently.
Result: attention on GPU (small, latency-sensitive) + FFN experts on CPU (large, compute-light for MoE) = 2.7x faster

Before (59/65 layers offloaded): 3.95 t/s
After (all layers, experts on CPU): 10.61 t/s

For Qwen3-30B-A3B (your old tested command)

# Old: 13/48 layers, 6.5 t/s
llama-cli -m Qwen3-30B-A3B-UD-Q8_K_XL.gguf -ngl 13 -c 40960 -fa -t 5 \
  -b 256 -ub 256 --temp 0.7 --top-k 40 --top-p 0.95 --min-p 0.05 \
  --repeat-penalty 1.1

# New: all layers, specific expert tensors on CPU, 10 t/s
llama-cli -m Qwen3-30B-A3B-UD-Q8_K_XL.gguf -ngl 48 \
  -ot "blk\.(0?[2-9]|1[2-9]|2[1-9]|3[1-9]|4[1-7])\.ffn_.*_exps\.=CPU" \
  -c 40960 -fa -t 5 -b 256 -ub 256 --temp 0.7 --top-k 40 --top-p 0.95 \
  --min-p 0.05 --repeat-penalty 1.1

For Qwen3.5-122B-A10B (recommended new commands)

# Option 1: -cmoe (simplest, purpose-built for MoE)
llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
  --no-mmap --no-mmproj --jinja \
  -ngl 99 -cmoe \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -b 2048 -ub 2048 \
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.01 \
  -fitc 131072 -fitt 256 \
  --cache-ram 0 --parallel 1 \
  --reasoning-budget 1024 \
  --reasoning-budget-message "... thinking budget exceeded, let's answer now."

# Option 2: -ncmoe N (partial, first N layers' experts on CPU)
llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
  --no-mmap --no-mmproj --jinja \
  -ngl 99 -ncmoe 33 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -b 2048 -ub 2048 -t 6 \
  --cache-ram 0 --parallel 1

Check tensor names first

Qwen3.5 may not use _exps tensor naming like Qwen3. Verify with:

llama-cli --model your-model.gguf --verbose 2>&1 | grep "blk\." | head -50

Use -cmoe / -ncmoe instead of -ot regex for MoE models — they are purpose-built and don't require knowing tensor names.

Your Current Command (reference)

llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
  --no-mmap --no-mmproj --jinja \
  -b 2048 -ub 2048 \
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.01 \
  -fitc 131072 --fit on -fitt 256 \
  --cache-ram 0 --parallel 1 \
  --reasoning-budget 1024 \
  --reasoning-budget-message "... thinking budget exceeded, let's answer now."

Key additions to try: -cmoe and --cache-type-k q8_0 --cache-type-v q8_0 for better VRAM utilization.

Upgrade Path

Add 32GB RAM (~$30-40 used DDR4)

Qwen3.5-122B-A10B Q3_K_M (56.4GB) — significant quality jump
Qwen3.5-122B-A10B UD-Q3_K_XL (57GB) — Unsloth Dynamic version
Llama 4 Scout UD-Q3_K_XL (49GB) — at a good quantization level

Sources

← Overview 100 t/s Tier →