Command Reference

Copy-paste commands for every tier and use case

Build llama.cpp for RTX 3060

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86"
cmake --build build --config Release -j$(nproc)

10 t/s Tier — Planner

Your current command (reference baseline)

llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
  --no-mmap --no-mmproj --jinja \
  -b 2048 -ub 2048 \
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.01 \
  -fitc 131072 --fit on -fitt 256 \
  --cache-ram 0 --parallel 1 \
  --reasoning-budget 1024 \
  --reasoning-budget-message "... thinking budget exceeded, let's answer now."

Optimized: add -cmoe and KV cache quantization

llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
  --no-mmap --no-mmproj --jinja \
  -ngl 99 -cmoe \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -b 2048 -ub 2048 -t 6 \
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.01 \
  -fitc 131072 -fitt 256 \
  --cache-ram 0 --parallel 1 \
  --reasoning-budget 1024 \
  --reasoning-budget-message "... thinking budget exceeded, let's answer now."

Partial MoE offload (experiment)

llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
  --no-mmap --no-mmproj --jinja \
  -ngl 99 -ncmoe 33 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -b 2048 -ub 2048 -t 6 \
  --cache-ram 0 --parallel 1

100 t/s Tier — Interactive

Qwen3.5-9B Q5_K_M (safe bet)

llama-server -hf unsloth/Qwen3.5-9B-GGUF:Q5_K_M \
  -ngl 99 -fa --jinja \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -c 8192 --port 8080

Qwen3.5-35B-A3B UD-IQ2_M (highest quality, IF bugs are fixed)

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_M \
  -ngl 99 -fa --jinja \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -c 4096 --port 8080

Gemma 4 26B-A4B Q2_K (vision + text)

llama-server -hf bartowski/google_gemma-4-26b-a4b-it-GGUF:Q2_K \
  -ngl 99 -fa \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -c 4096 --port 8080

Qwen3.5-4B Q8_0 (speed demon)

llama-server -hf unsloth/Qwen3.5-4B-GGUF:Q8_0 \
  -ngl 99 -fa --jinja \
  -c 16384 --port 8080

HERETIC uncensored variant

llama-server -hf mradermacher/Qwen3.5-9B-Claude-4.6-HighIQ-THINKING-HERETIC-UNCENSORED-i1-GGUF:i1-Q4_K_M \
  -ngl 99 -fa --jinja \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -c 8192 --port 8080

1000 t/s Tier — Speed

Speculative decoding: 4B target + 0.6B draft

llama-speculative-simple \
  -m qwen3.5-4b-q4_k_m.gguf \
  -md qwen3-0.6b-q4_k_m.gguf \
  -ngl 99 -ngld 99 -fa -c 4096 \
  --draft-max 16 --draft-min 5 --draft-p-min 0.9 \
  --sampling-seq k --top-k 1 --temp 0.0

Maximum raw speed (0.6B standalone)

llama-cli -m qwen3-0.6b-q4_k_m.gguf -ngl 99 -fa -c 2048

Tensor Offloading

Your old tested command (Qwen3-30B-A3B, 6.5 → 10 t/s)

llama-cli -m Qwen3-30B-A3B-UD-Q8_K_XL.gguf -ngl 48 \
  -ot "blk\.(0?[2-9]|1[2-9]|2[1-9]|3[1-9]|4[1-7])\.ffn_.*_exps\.=CPU" \
  -c 40960 -fa -t 5 -b 256 -ub 256 \
  --temp 0.7 --top-k 40 --top-p 0.95 --min-p 0.05 \
  --repeat-penalty 1.1

Generic MoE: all experts on CPU

-ot ".*_exps\.weight=CPU"

Check tensor names for any model

llama-cli --model your-model.gguf --verbose 2>&1 | grep "blk\." | head -50

Vision / Multimodal

Qwen 2.5 VL 3B (working vision today)

llama-server -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF \
  -ngl 99 -fa -c 4096 --port 8080

Disable vision to save VRAM

--no-mmproj

Speech-to-Text

Parakeet (English)

pip install nemo_toolkit[asr]

python -c "
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained('nvidia/parakeet-tdt-0.6b-v2')
output = model.transcribe(['audio.wav'], timestamps=True)
print(output[0].text)
"

Whisper (Japanese + 98 other languages)

pip install openai-whisper

whisper audio.wav --model large-v3 --language ja

llama.cpp Flag Reference

FlagShortDescriptionDefault
--n-gpu-layers-nglLayers to offload to GPU (99 or auto)auto
--ctx-size-cContext window sizefrom model
--batch-size-bLogical batch size for prompt processing2048
--ubatch-size-ubPhysical batch size (higher = faster prefill)512
--threads-tCPU threads for generationauto
--flash-attn-faFlash attentionauto
--cache-type-k-ctkKV cache key type (f16, q8_0, q4_0)f16
--cache-type-v-ctvKV cache value typef16
--cache-ram-cramRAM for prompt cache (0 = disable)unlimited
--no-mmapForce model into RAM (no memory mapping)off
--no-mmprojSkip vision projectoroff
--jinjaEnable Jinja chat template engineon (recent)
--override-tensor-otPlace specific tensors on CPU/GPU (regex)
--cpu-moe-cmoeAll MoE expert weights on CPUoff
--n-cpu-moe-ncmoeFirst N layers' MoE experts on CPU
--fitAuto-fit model to available memoryon
--fit-target-fittVRAM margin per device (MB)512
--fit-ctx-fitcMinimum context for fit algorithm4096
--reasoning-budgetThinking token budget (-1=unlimited, 0=off)-1
--parallelConcurrent request slots1