Recent merged PRs relevant to RTX 3060 12GB + Qwen 3.5 + Gemma 4. Build b8736 (2026-04-09).
Critical discovery: --reasoning off is mandatory
Without --reasoning off, thinking tokens consume the entire generation budget on small models, producing empty responses. This single flag is more important than any PR optimization. Per-request override: "chat_template_kwargs": {"enable_thinking": false} in the API body. The --reasoning-budget flag has a regression bug (#21487) and is ignored — use --reasoning off instead.
| PR | What | Impact | How to Use |
| #21038 |
Activation rotation for better quantization |
Quality++ Walsh-Hadamard rotation on Q/K/V before quantization. Improves ALL existing quant types. Huge for aggressive quants (Q4, Q3, Q2) on 12GB. |
Automatic in latest builds. Free upgrade. |
| #18471 |
Self-speculative decoding (no draft model) |
Segfaults --spec-self 1 segfaults on Qwen3.5 (incompatible with Gated DeltaNet). Not usable. |
--spec-self 1 (broken on Qwen3.5) |
| #19164 |
N-gram speculative decoding |
~2% slower Tested on this hardware. Slightly slower than baseline, not faster. May help other architectures. |
--spec-type ngram-mod |
| #20993 |
Clear idle slots (--clear-idle) |
No effect measured Tested: no effect on single requests. May help with --parallel N when slots go idle. |
--clear-idle (for --parallel N servers) |
| #20905 |
MoE GEMV kernel optimization |
9-22% faster MoE Dedicated warp-level kernel for MoE models at batch sizes 2-4. |
Automatic. Benefits 35B-A3B, 122B-A10B, Gemma 4 26B-A4B. |
| #16653 |
Auto-fit system (--fit) |
Essential Auto-determines optimal layer placement in 12GB. Reduces context, moves MoE experts to CPU. |
--fit (on by default). -fitt 256 for margin. |
| #18934 |
CUDA graphs for --cpu-moe |
Speed++ CUDA graphs now work WITH -cmoe. Both optimizations combined. |
Automatic when using -cmoe. |
| #16391 |
Host-memory prompt caching |
Latency-- Uses system RAM as KV cache extension. Minimizes prompt reprocessing across requests. |
--cache-ram 8192 (uses 8GB RAM for prompt cache) |
| PR | What | Status | Impact |
| #19504 |
GATED_DELTA_NET op (core Qwen3.5 hybrid attention) |
Merged |
CPU/CUDA implementation of GDN layer. Makes Qwen3.5 work in llama.cpp. |
| #20340 |
Chunked fused GDN path |
Merged |
Better throughput for GDN layers vs basic vector implementation. |
| #20506 |
NVFP4 support wired for Qwen3.5 |
Merged |
Extreme 4-bit floating point quantization for Qwen3.5 specifically. |
| #20700 |
MTP for dense Qwen3.5 with FastMTP vocab trimming |
Open |
Built-in multi-token prediction. FastMTP trims vocab 248K → 32K (3.7x faster drafting). No draft model needed. Watch this one. |
| #20653 |
Control vectors for Qwen3.5 |
Merged |
Control vectors (steering) now work with Qwen3.5 models. |
| #20301 |
Dynamic head_dim for SWA |
Merged |
Different head dims for full-attention vs SWA layers. Required by Qwen3.5 hybrid arch. |
| #20087 |
Hybrid model cache checkpoints |
Merged |
--checkpoint-every-nb for hybrid recurrent/attention models like Qwen3.5. |
| PR | What | Impact |
| #21168 |
128-bit reads for Q4_0/Q4_1 MMQ kernels |
Faster quantized matrix multiply — direct t/s improvement for Q4 models |
| #21159 |
Optimized stream-K flash attention fixup kernel |
Better FA throughput on Ampere |
| #20998 |
FA support for head_dim 512 |
Needed for some newer architectures |
| #20635 |
Better thread utilization for small K dimensions |
Helps MoE expert layers specifically |
| #20525 |
Native BF16 flash attention for vec kernel |
Cleaner compute path for BF16 single-token decode |
| #17795 |
Fewer GPU synchronizations between tokens |
Direct t/s improvement by reducing sync stalls |
| #20595 |
Avoid creating CUDA context during device init |
Reduces initial VRAM overhead |
| #18343 |
Optimized CUDA cumsum (2.5x vs CUB) |
Faster MoE routing on all CUDA GPUs |
| #11894 |
Async data loading for FA |
Simultaneous loading+compute. Requires Ampere+ (sm_80). +13% FA throughput. |
| PR | What | Impact |
| #21277 |
Don't quantize SWA KV cache |
SWA layers keep full-precision KV (quantizing hurts quality). Reverted then re-applied — may be unstable with Gemma 4. |
| #21586 |
Extended cache quantization checks |
Better validation prevents silent KV corruption. |
| #21224 |
Unified KV cache for hybrid models |
Fix for Qwen3.5-style hybrid attention/recurrent using unified cache. |
| #18986 |
MLA: V tensor as view of K |
Halves KV cache for DeepSeek-style MLA models. |
| #18166 |
--direct-io for model loading |
Bypasses page cache — saves system RAM during model load. |
TurboQuant PRs were closed for AI-generated code policy
#21062 (CUDA) and #21010 (Vulkan) were both closed. However, the activation rotation PR (#21038) was merged by ggerganov "in anticipation of TurboQuant PRs" — making regular quants with rotation potentially match TurboQuant quality. A CPU-only TurboQuant PR (#21089) is still open.
| PR | What | Impact |
| #18675 |
Autoparser refactoring |
New models auto-supported without custom parsers. But broke --reasoning-budget. |
| #16932 |
Generalized XML-style tool-call parsing with streaming |
Covers Qwen3-Coder, GLM, MiniMax, Kimi-K2, MiMo. |
| #12379 |
Streaming tool calls + thoughts with --jinja |
OpenAI format streaming + thinking model support. |
| #18655 |
Built-in MCP client + agentic loop in webui |
llama.cpp's own webui now has agentic capabilities. |
| #21036 |
Reasoning content preserved across turns |
Fix for reasoning_content field in multi-turn. |
| #21487 |
--reasoning-budget regression |
Open bug Budget value gets ignored. Fix: PR #21594. |
| Feature | TensorRT-LLM | llama.cpp | Winner on 3060 |
| Raw throughput (GPU-only) |
30-50% faster |
Baseline |
TRT-LLM (but gap smaller without FP8) |
| FP8 quantization |
Yes (cc 8.9+ only) |
N/A |
Not available on 3060 (cc 8.6) |
| CPU offloading |
Weight streaming only (PCIe-bound) |
Full layer/tensor offloading |
llama.cpp |
| KV cache quant |
INT8 only on 3060 |
Q4_0, Q8_0 |
llama.cpp (more options) |
| Self-speculative decoding |
No |
Yes (zero VRAM) |
llama.cpp |
| N-gram speculative |
Yes |
Yes |
Tie |
| EAGLE-3 |
Yes |
Open PR |
TRT-LLM (for now) |
| Tool calling |
Via Triton only |
Native --jinja |
llama.cpp |
| -cmoe / -ot tensor offloading |
No |
Yes |
llama.cpp |
| Activation rotation |
No |
Yes (#21038) |
llama.cpp |
| --fit auto-distribution |
No |
Yes |
llama.cpp |
| Host prompt caching |
KV offload only |
Full --cache-ram |
llama.cpp |
| Engine rebuild on model change |
10-30 min per change |
Instant (load GGUF) |
llama.cpp |
| License |
Apache 2.0 (open source since Mar 2025) |
MIT |
Both open |