llama.cpp PRs & Optimizations

Recent merged PRs relevant to RTX 3060 12GB + Qwen 3.5 + Gemma 4. Build b8736 (2026-04-09).

Critical discovery: --reasoning off is mandatory

Without --reasoning off, thinking tokens consume the entire generation budget on small models, producing empty responses. This single flag is more important than any PR optimization. Per-request override: "chat_template_kwargs": {"enable_thinking": false} in the API body. The --reasoning-budget flag has a regression bug (#21487) and is ignored — use --reasoning off instead.

High-Impact PRs (apply now)

PRWhatImpactHow to Use
#21038 Activation rotation for better quantization Quality++ Walsh-Hadamard rotation on Q/K/V before quantization. Improves ALL existing quant types. Huge for aggressive quants (Q4, Q3, Q2) on 12GB. Automatic in latest builds. Free upgrade.
#18471 Self-speculative decoding (no draft model) Segfaults --spec-self 1 segfaults on Qwen3.5 (incompatible with Gated DeltaNet). Not usable. --spec-self 1 (broken on Qwen3.5)
#19164 N-gram speculative decoding ~2% slower Tested on this hardware. Slightly slower than baseline, not faster. May help other architectures. --spec-type ngram-mod
#20993 Clear idle slots (--clear-idle) No effect measured Tested: no effect on single requests. May help with --parallel N when slots go idle. --clear-idle (for --parallel N servers)
#20905 MoE GEMV kernel optimization 9-22% faster MoE Dedicated warp-level kernel for MoE models at batch sizes 2-4. Automatic. Benefits 35B-A3B, 122B-A10B, Gemma 4 26B-A4B.
#16653 Auto-fit system (--fit) Essential Auto-determines optimal layer placement in 12GB. Reduces context, moves MoE experts to CPU. --fit (on by default). -fitt 256 for margin.
#18934 CUDA graphs for --cpu-moe Speed++ CUDA graphs now work WITH -cmoe. Both optimizations combined. Automatic when using -cmoe.
#16391 Host-memory prompt caching Latency-- Uses system RAM as KV cache extension. Minimizes prompt reprocessing across requests. --cache-ram 8192 (uses 8GB RAM for prompt cache)

Qwen 3.5 Specific PRs

PRWhatStatusImpact
#19504 GATED_DELTA_NET op (core Qwen3.5 hybrid attention) Merged CPU/CUDA implementation of GDN layer. Makes Qwen3.5 work in llama.cpp.
#20340 Chunked fused GDN path Merged Better throughput for GDN layers vs basic vector implementation.
#20506 NVFP4 support wired for Qwen3.5 Merged Extreme 4-bit floating point quantization for Qwen3.5 specifically.
#20700 MTP for dense Qwen3.5 with FastMTP vocab trimming Open Built-in multi-token prediction. FastMTP trims vocab 248K → 32K (3.7x faster drafting). No draft model needed. Watch this one.
#20653 Control vectors for Qwen3.5 Merged Control vectors (steering) now work with Qwen3.5 models.
#20301 Dynamic head_dim for SWA Merged Different head dims for full-attention vs SWA layers. Required by Qwen3.5 hybrid arch.
#20087 Hybrid model cache checkpoints Merged --checkpoint-every-nb for hybrid recurrent/attention models like Qwen3.5.

Gemma 4 PRs (stabilizing)

PRWhatStatus
#21309 Core Gemma 4 support (vision + MoE, no audio) Merged
#21326 + #21343 Template parser + tokenizer fixes (fixed <unused24> spam) Merged
#21418 Gemma 4 specialized tool-call parser Merged
#21390 Logit softcapping fix Merged
#21513 KV cache attention rotation for heterogeneous iSWA Merged
#21625 Multimodal padding token fix Merged
#21402 Vision CUDA crash (SIGABRT) on 26B/31B Open bug
#21325 Audio support not implemented Open
#21323 OOM with partial GPU offload on 26B Open bug

CUDA Kernel Optimizations (Ampere / sm_86)

PRWhatImpact
#21168 128-bit reads for Q4_0/Q4_1 MMQ kernels Faster quantized matrix multiply — direct t/s improvement for Q4 models
#21159 Optimized stream-K flash attention fixup kernel Better FA throughput on Ampere
#20998 FA support for head_dim 512 Needed for some newer architectures
#20635 Better thread utilization for small K dimensions Helps MoE expert layers specifically
#20525 Native BF16 flash attention for vec kernel Cleaner compute path for BF16 single-token decode
#17795 Fewer GPU synchronizations between tokens Direct t/s improvement by reducing sync stalls
#20595 Avoid creating CUDA context during device init Reduces initial VRAM overhead
#18343 Optimized CUDA cumsum (2.5x vs CUB) Faster MoE routing on all CUDA GPUs
#11894 Async data loading for FA Simultaneous loading+compute. Requires Ampere+ (sm_80). +13% FA throughput.

KV Cache & Memory

PRWhatImpact
#21277 Don't quantize SWA KV cache SWA layers keep full-precision KV (quantizing hurts quality). Reverted then re-applied — may be unstable with Gemma 4.
#21586 Extended cache quantization checks Better validation prevents silent KV corruption.
#21224 Unified KV cache for hybrid models Fix for Qwen3.5-style hybrid attention/recurrent using unified cache.
#18986 MLA: V tensor as view of K Halves KV cache for DeepSeek-style MLA models.
#18166 --direct-io for model loading Bypasses page cache — saves system RAM during model load.

TurboQuant / Rotation Status

TurboQuant PRs were closed for AI-generated code policy

#21062 (CUDA) and #21010 (Vulkan) were both closed. However, the activation rotation PR (#21038) was merged by ggerganov "in anticipation of TurboQuant PRs" — making regular quants with rotation potentially match TurboQuant quality. A CPU-only TurboQuant PR (#21089) is still open.

Speculative Decoding Roadmap

MethodPRStatusVRAM CostSpeedup
Self-speculative (token history) #18471 Merged Zero ~2.5x (code editing)
N-gram hash prediction #19164 Merged 16MB Good for reasoning
Draft model (traditional) #10455 Merged Draft model size 2-4x
EAGLE-3 #18039 Open Small encoder 2-3x
Qwen3.5 native MTP #20700 Open Zero (built-in heads) ~2x (with FastMTP)
dFlash block diffusion Discussion No PR yet Small drafter ~5x (datacenter only so far)

Tool Calling & Reasoning

PRWhatImpact
#18675 Autoparser refactoring New models auto-supported without custom parsers. But broke --reasoning-budget.
#16932 Generalized XML-style tool-call parsing with streaming Covers Qwen3-Coder, GLM, MiniMax, Kimi-K2, MiMo.
#12379 Streaming tool calls + thoughts with --jinja OpenAI format streaming + thinking model support.
#18655 Built-in MCP client + agentic loop in webui llama.cpp's own webui now has agentic capabilities.
#21036 Reasoning content preserved across turns Fix for reasoning_content field in multi-turn.
#21487 --reasoning-budget regression Open bug Budget value gets ignored. Fix: PR #21594.

Quantization Formats

PRWhatImpact
#19769 NVFP4 quantization type Extreme 4-bit NVIDIA floating point. Wired for Qwen3.5 via #20506.
#20539 Mixed-precision per-tensor NVFP4/FP8 from ModelOpt Per-tensor mixed precision from NVIDIA's quantization toolkit.
#21273 Q1_0 1-bit quantization (CPU) Experimental extreme compression. CPU only for now.

TensorRT-LLM Comparison

TensorRT-LLM is NOT strictly better, even settled on one model
FeatureTensorRT-LLMllama.cppWinner on 3060
Raw throughput (GPU-only) 30-50% faster Baseline TRT-LLM (but gap smaller without FP8)
FP8 quantization Yes (cc 8.9+ only) N/A Not available on 3060 (cc 8.6)
CPU offloading Weight streaming only (PCIe-bound) Full layer/tensor offloading llama.cpp
KV cache quant INT8 only on 3060 Q4_0, Q8_0 llama.cpp (more options)
Self-speculative decoding No Yes (zero VRAM) llama.cpp
N-gram speculative Yes Yes Tie
EAGLE-3 Yes Open PR TRT-LLM (for now)
Tool calling Via Triton only Native --jinja llama.cpp
-cmoe / -ot tensor offloading No Yes llama.cpp
Activation rotation No Yes (#21038) llama.cpp
--fit auto-distribution No Yes llama.cpp
Host prompt caching KV offload only Full --cache-ram llama.cpp
Engine rebuild on model change 10-30 min per change Instant (load GGUF) llama.cpp
License Apache 2.0 (open source since Mar 2025) MIT Both open

Verdict: TRT-LLM's biggest advantage (FP8) is unavailable on RTX 3060 (cc 8.6, needs 8.9+). On INT4/INT8, the gap narrows to ~30-50%. Meanwhile, llama.cpp has self-speculative decoding, activation rotation, tensor offloading, auto-fit, and instant model switching — features TRT-LLM lacks entirely. TRT-LLM only makes sense on RTX 4090+ or datacenter hardware.

Flags to Add to Your Setup

Based on these PRs, flags to test beyond the current config:

FlagPRWhat it doesTry on
--spec-self 1#18471Self-speculative decoding, zero VRAMAll tiers
--spec-type ngram-mod#19164N-gram hash spec decoding, 16MBPlanner (reasoning)
--clear-idle#20993Free idle slot KV from VRAMSpeed tier (--parallel)
--cache-ram 8192#163918GB RAM prompt cacheInteractive (agentic loops)
--direct-io#18166Skip page cache on model loadAll tiers (saves RAM)
--checkpoint-every-nb N#20087Incremental checkpoint for hybrid modelsPlanner (Qwen3.5-122B)

Sources