Command Reference
Copy-paste commands for every tier and use case
Build llama.cpp for RTX 3060
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86"
cmake --build build --config Release -j$(nproc)
10 t/s Tier — Planner
Your current command (reference baseline)
llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
--no-mmap --no-mmproj --jinja \
-b 2048 -ub 2048 \
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.01 \
-fitc 131072 --fit on -fitt 256 \
--cache-ram 0 --parallel 1 \
--reasoning-budget 1024 \
--reasoning-budget-message "... thinking budget exceeded, let's answer now."
Optimized: add -cmoe and KV cache quantization
llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
--no-mmap --no-mmproj --jinja \
-ngl 99 -cmoe \
--cache-type-k q8_0 --cache-type-v q8_0 \
-b 2048 -ub 2048 -t 6 \
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.01 \
-fitc 131072 -fitt 256 \
--cache-ram 0 --parallel 1 \
--reasoning-budget 1024 \
--reasoning-budget-message "... thinking budget exceeded, let's answer now."
Partial MoE offload (experiment)
llama-server -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ3_XXS \
--no-mmap --no-mmproj --jinja \
-ngl 99 -ncmoe 33 \
--cache-type-k q8_0 --cache-type-v q8_0 \
-b 2048 -ub 2048 -t 6 \
--cache-ram 0 --parallel 1
100 t/s Tier — Interactive
Qwen3.5-9B Q5_K_M (safe bet)
llama-server -hf unsloth/Qwen3.5-9B-GGUF:Q5_K_M \
-ngl 99 -fa --jinja \
--cache-type-k q8_0 --cache-type-v q8_0 \
-c 8192 --port 8080
Qwen3.5-35B-A3B UD-IQ2_M (highest quality, IF bugs are fixed)
llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_M \
-ngl 99 -fa --jinja \
--cache-type-k q8_0 --cache-type-v q8_0 \
-c 4096 --port 8080
Gemma 4 26B-A4B Q2_K (vision + text)
llama-server -hf bartowski/google_gemma-4-26b-a4b-it-GGUF:Q2_K \
-ngl 99 -fa \
--cache-type-k q8_0 --cache-type-v q8_0 \
-c 4096 --port 8080
Qwen3.5-4B Q8_0 (speed demon)
llama-server -hf unsloth/Qwen3.5-4B-GGUF:Q8_0 \
-ngl 99 -fa --jinja \
-c 16384 --port 8080
HERETIC uncensored variant
llama-server -hf mradermacher/Qwen3.5-9B-Claude-4.6-HighIQ-THINKING-HERETIC-UNCENSORED-i1-GGUF:i1-Q4_K_M \
-ngl 99 -fa --jinja \
--cache-type-k q8_0 --cache-type-v q8_0 \
-c 8192 --port 8080
1000 t/s Tier — Speed
Speculative decoding: 4B target + 0.6B draft
llama-speculative-simple \
-m qwen3.5-4b-q4_k_m.gguf \
-md qwen3-0.6b-q4_k_m.gguf \
-ngl 99 -ngld 99 -fa -c 4096 \
--draft-max 16 --draft-min 5 --draft-p-min 0.9 \
--sampling-seq k --top-k 1 --temp 0.0
Maximum raw speed (0.6B standalone)
llama-cli -m qwen3-0.6b-q4_k_m.gguf -ngl 99 -fa -c 2048
Tensor Offloading
Your old tested command (Qwen3-30B-A3B, 6.5 → 10 t/s)
llama-cli -m Qwen3-30B-A3B-UD-Q8_K_XL.gguf -ngl 48 \
-ot "blk\.(0?[2-9]|1[2-9]|2[1-9]|3[1-9]|4[1-7])\.ffn_.*_exps\.=CPU" \
-c 40960 -fa -t 5 -b 256 -ub 256 \
--temp 0.7 --top-k 40 --top-p 0.95 --min-p 0.05 \
--repeat-penalty 1.1
Generic MoE: all experts on CPU
-ot ".*_exps\.weight=CPU"
Check tensor names for any model
llama-cli --model your-model.gguf --verbose 2>&1 | grep "blk\." | head -50
Vision / Multimodal
Qwen 2.5 VL 3B (working vision today)
llama-server -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF \
-ngl 99 -fa -c 4096 --port 8080
Disable vision to save VRAM
--no-mmproj
Speech-to-Text
Parakeet (English)
pip install nemo_toolkit[asr]
python -c "
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained('nvidia/parakeet-tdt-0.6b-v2')
output = model.transcribe(['audio.wav'], timestamps=True)
print(output[0].text)
"
Whisper (Japanese + 98 other languages)
pip install openai-whisper
whisper audio.wav --model large-v3 --language ja
llama.cpp Flag Reference
| Flag | Short | Description | Default |
|---|---|---|---|
--n-gpu-layers | -ngl | Layers to offload to GPU (99 or auto) | auto |
--ctx-size | -c | Context window size | from model |
--batch-size | -b | Logical batch size for prompt processing | 2048 |
--ubatch-size | -ub | Physical batch size (higher = faster prefill) | 512 |
--threads | -t | CPU threads for generation | auto |
--flash-attn | -fa | Flash attention | auto |
--cache-type-k | -ctk | KV cache key type (f16, q8_0, q4_0) | f16 |
--cache-type-v | -ctv | KV cache value type | f16 |
--cache-ram | -cram | RAM for prompt cache (0 = disable) | unlimited |
--no-mmap | Force model into RAM (no memory mapping) | off | |
--no-mmproj | Skip vision projector | off | |
--jinja | Enable Jinja chat template engine | on (recent) | |
--override-tensor | -ot | Place specific tensors on CPU/GPU (regex) | |
--cpu-moe | -cmoe | All MoE expert weights on CPU | off |
--n-cpu-moe | -ncmoe | First N layers' MoE experts on CPU | |
--fit | Auto-fit model to available memory | on | |
--fit-target | -fitt | VRAM margin per device (MB) | 512 |
--fit-ctx | -fitc | Minimum context for fit algorithm | 4096 |
--reasoning-budget | Thinking token budget (-1=unlimited, 0=off) | -1 | |
--parallel | Concurrent request slots | 1 |