llama.cpp PRs & Optimizations

Recent merged PRs relevant to RTX 3060 12GB + Qwen 3.5 + Gemma 4. Build b8736 (2026-04-09).

Critical discovery: --reasoning off is mandatory

Without --reasoning off, thinking tokens consume the entire generation budget on small models, producing empty responses. This single flag is more important than any PR optimization. Per-request override: "chat_template_kwargs": {"enable_thinking": false} in the API body. The --reasoning-budget flag has a regression bug (#21487) and is ignored — use --reasoning off instead.

High-Impact PRs (apply now)

PR	What	Impact	How to Use
#21038	Activation rotation for better quantization	Quality++ Walsh-Hadamard rotation on Q/K/V before quantization. Improves ALL existing quant types. Huge for aggressive quants (Q4, Q3, Q2) on 12GB.	Automatic in latest builds. Free upgrade.
#18471	Self-speculative decoding (no draft model)	Segfaults `--spec-self 1` segfaults on Qwen3.5 (incompatible with Gated DeltaNet). Not usable.	`--spec-self 1` (broken on Qwen3.5)
#19164	N-gram speculative decoding	~2% slower Tested on this hardware. Slightly slower than baseline, not faster. May help other architectures.	`--spec-type ngram-mod`
#20993	Clear idle slots (`--clear-idle`)	No effect measured Tested: no effect on single requests. May help with --parallel N when slots go idle.	`--clear-idle` (for --parallel N servers)
#20905	MoE GEMV kernel optimization	9-22% faster MoE Dedicated warp-level kernel for MoE models at batch sizes 2-4.	Automatic. Benefits 35B-A3B, 122B-A10B, Gemma 4 26B-A4B.
#16653	Auto-fit system (`--fit`)	Essential Auto-determines optimal layer placement in 12GB. Reduces context, moves MoE experts to CPU.	`--fit` (on by default). `-fitt 256` for margin.
#18934	CUDA graphs for --cpu-moe	Speed++ CUDA graphs now work WITH -cmoe. Both optimizations combined.	Automatic when using `-cmoe`.
#16391	Host-memory prompt caching	Latency-- Uses system RAM as KV cache extension. Minimizes prompt reprocessing across requests.	`--cache-ram 8192` (uses 8GB RAM for prompt cache)

Qwen 3.5 Specific PRs

PR	What	Status	Impact
#19504	GATED_DELTA_NET op (core Qwen3.5 hybrid attention)	Merged	CPU/CUDA implementation of GDN layer. Makes Qwen3.5 work in llama.cpp.
#20340	Chunked fused GDN path	Merged	Better throughput for GDN layers vs basic vector implementation.
#20506	NVFP4 support wired for Qwen3.5	Merged	Extreme 4-bit floating point quantization for Qwen3.5 specifically.
#20700	MTP for dense Qwen3.5 with FastMTP vocab trimming	Open	Built-in multi-token prediction. FastMTP trims vocab 248K → 32K (3.7x faster drafting). No draft model needed. Watch this one.
#20653	Control vectors for Qwen3.5	Merged	Control vectors (steering) now work with Qwen3.5 models.
#20301	Dynamic head_dim for SWA	Merged	Different head dims for full-attention vs SWA layers. Required by Qwen3.5 hybrid arch.
#20087	Hybrid model cache checkpoints	Merged	`--checkpoint-every-nb` for hybrid recurrent/attention models like Qwen3.5.

Gemma 4 PRs (stabilizing)

PR	What	Status
#21309	Core Gemma 4 support (vision + MoE, no audio)	Merged
#21326 + #21343	Template parser + tokenizer fixes (fixed <unused24> spam)	Merged
#21418	Gemma 4 specialized tool-call parser	Merged
#21390	Logit softcapping fix	Merged
#21513	KV cache attention rotation for heterogeneous iSWA	Merged
#21625	Multimodal padding token fix	Merged
#21402	Vision CUDA crash (SIGABRT) on 26B/31B	Open bug
#21325	Audio support not implemented	Open
#21323	OOM with partial GPU offload on 26B	Open bug

CUDA Kernel Optimizations (Ampere / sm_86)

PR	What	Impact
#21168	128-bit reads for Q4_0/Q4_1 MMQ kernels	Faster quantized matrix multiply — direct t/s improvement for Q4 models
#21159	Optimized stream-K flash attention fixup kernel	Better FA throughput on Ampere
#20998	FA support for head_dim 512	Needed for some newer architectures
#20635	Better thread utilization for small K dimensions	Helps MoE expert layers specifically
#20525	Native BF16 flash attention for vec kernel	Cleaner compute path for BF16 single-token decode
#17795	Fewer GPU synchronizations between tokens	Direct t/s improvement by reducing sync stalls
#20595	Avoid creating CUDA context during device init	Reduces initial VRAM overhead
#18343	Optimized CUDA cumsum (2.5x vs CUB)	Faster MoE routing on all CUDA GPUs
#11894	Async data loading for FA	Simultaneous loading+compute. Requires Ampere+ (sm_80). +13% FA throughput.

KV Cache & Memory

PR	What	Impact
#21277	Don't quantize SWA KV cache	SWA layers keep full-precision KV (quantizing hurts quality). Reverted then re-applied — may be unstable with Gemma 4.
#21586	Extended cache quantization checks	Better validation prevents silent KV corruption.
#21224	Unified KV cache for hybrid models	Fix for Qwen3.5-style hybrid attention/recurrent using unified cache.
#18986	MLA: V tensor as view of K	Halves KV cache for DeepSeek-style MLA models.
#18166	`--direct-io` for model loading	Bypasses page cache — saves system RAM during model load.

TurboQuant / Rotation Status

TurboQuant PRs were closed for AI-generated code policy

#21062 (CUDA) and #21010 (Vulkan) were both closed. However, the activation rotation PR (#21038) was merged by ggerganov "in anticipation of TurboQuant PRs" — making regular quants with rotation potentially match TurboQuant quality. A CPU-only TurboQuant PR (#21089) is still open.

Speculative Decoding Roadmap

Method	PR	Status	VRAM Cost	Speedup
Self-speculative (token history)	#18471	Merged	Zero	~2.5x (code editing)
N-gram hash prediction	#19164	Merged	16MB	Good for reasoning
Draft model (traditional)	#10455	Merged	Draft model size	2-4x
EAGLE-3	#18039	Open	Small encoder	2-3x
Qwen3.5 native MTP	#20700	Open	Zero (built-in heads)	~2x (with FastMTP)
dFlash block diffusion	Discussion	No PR yet	Small drafter	~5x (datacenter only so far)

Tool Calling & Reasoning

PR	What	Impact
#18675	Autoparser refactoring	New models auto-supported without custom parsers. But broke `--reasoning-budget`.
#16932	Generalized XML-style tool-call parsing with streaming	Covers Qwen3-Coder, GLM, MiniMax, Kimi-K2, MiMo.
#12379	Streaming tool calls + thoughts with --jinja	OpenAI format streaming + thinking model support.
#18655	Built-in MCP client + agentic loop in webui	llama.cpp's own webui now has agentic capabilities.
#21036	Reasoning content preserved across turns	Fix for `reasoning_content` field in multi-turn.
#21487	`--reasoning-budget` regression	Open bug Budget value gets ignored. Fix: PR #21594.

Quantization Formats

PR	What	Impact
#19769	NVFP4 quantization type	Extreme 4-bit NVIDIA floating point. Wired for Qwen3.5 via #20506.
#20539	Mixed-precision per-tensor NVFP4/FP8 from ModelOpt	Per-tensor mixed precision from NVIDIA's quantization toolkit.
#21273	Q1_0 1-bit quantization (CPU)	Experimental extreme compression. CPU only for now.

TensorRT-LLM Comparison

TensorRT-LLM is NOT strictly better, even settled on one model

Feature	TensorRT-LLM	llama.cpp	Winner on 3060
Raw throughput (GPU-only)	30-50% faster	Baseline	TRT-LLM (but gap smaller without FP8)
FP8 quantization	Yes (cc 8.9+ only)	N/A	Not available on 3060 (cc 8.6)
CPU offloading	Weight streaming only (PCIe-bound)	Full layer/tensor offloading	llama.cpp
KV cache quant	INT8 only on 3060	Q4_0, Q8_0	llama.cpp (more options)
Self-speculative decoding	No	Yes (zero VRAM)	llama.cpp
N-gram speculative	Yes	Yes	Tie
EAGLE-3	Yes	Open PR	TRT-LLM (for now)
Tool calling	Via Triton only	Native --jinja	llama.cpp
-cmoe / -ot tensor offloading	No	Yes	llama.cpp
Activation rotation	No	Yes (#21038)	llama.cpp
--fit auto-distribution	No	Yes	llama.cpp
Host prompt caching	KV offload only	Full --cache-ram	llama.cpp
Engine rebuild on model change	10-30 min per change	Instant (load GGUF)	llama.cpp
License	Apache 2.0 (open source since Mar 2025)	MIT	Both open

Verdict: TRT-LLM's biggest advantage (FP8) is unavailable on RTX 3060 (cc 8.6, needs 8.9+). On INT4/INT8, the gap narrows to ~30-50%. Meanwhile, llama.cpp has self-speculative decoding, activation rotation, tensor offloading, auto-fit, and instant model switching — features TRT-LLM lacks entirely. TRT-LLM only makes sense on RTX 4090+ or datacenter hardware.

Flags to Add to Your Setup

Based on these PRs, flags to test beyond the current config:

Flag	PR	What it does	Try on
`--spec-self 1`	#18471	Self-speculative decoding, zero VRAM	All tiers
`--spec-type ngram-mod`	#19164	N-gram hash spec decoding, 16MB	Planner (reasoning)
`--clear-idle`	#20993	Free idle slot KV from VRAM	Speed tier (--parallel)
`--cache-ram 8192`	#16391	8GB RAM prompt cache	Interactive (agentic loops)
`--direct-io`	#18166	Skip page cache on model load	All tiers (saves RAM)
`--checkpoint-every-nb N`	#20087	Incremental checkpoint for hybrid models	Planner (Qwen3.5-122B)

Sources

← Inference Engines Harness & Agents →