Inference Engines
llama.cpp is the only engine. Build b8736 (2026-04-09), CUDA 13.0, sm_86.
Decision Matrix
| Criteria | llama.cpp | ExLlamaV2 | vLLM | SGLang |
|---|---|---|---|---|
| Works on 3060 12GB | Excellent | GPU-only | Problematic | Tight |
| CPU+GPU hybrid | Best | No | No | No |
| Speed (full GPU) | Good | Best | Good | Good |
| Speed (partial offload) | Best | N/A | N/A | N/A |
| Quant quality | Good (GGUF) | Best (EXL2) | Good (AWQ) | Good (AWQ) |
| Multi-user serving | OK | OK (TabbyAPI) | Best | Very Good |
| Agentic prefix caching | Manual | No | Manual | Automatic |
| Tool calling | Yes (--jinja) | Via TabbyAPI | Yes | Yes |
| Speculative decoding | Yes | No | Yes | Yes |
| Platform | All | NVIDIA only | Linux only | Linux only |
llama.cpp is the only engine. GGUF is the only format. ExLlamaV2 archived, ExLlamaV3 slower on Ampere (confirmed by turboderp, issue #144). TensorRT-LLM's FP8 is unavailable on RTX 3060 (cc 8.6 needs 8.9+). vLLM/SGLang not practical for 12GB single-user.
1. llama.cpp Primary
Why It Wins for 3060 12GB
- Only engine with real CPU+GPU hybrid inference (essential when models barely fit)
- Best GGUF quantization ecosystem (IQ2, IQ3, IQ4 — the low-bit quants that make 35B MoE fit)
- Tensor-level offloading (
-ot), MoE-specific offloading (-cmoe) - Flash attention, KV cache quantization, speculative decoding
- Widest model support, single binary, weekly updates
Build for RTX 3060
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86"
cmake --build build --config Release -j$(nproc)
GGUF Quantization Formats
| Format | ~bpw | Family | Size (7B) | Quality |
|---|---|---|---|---|
| IQ2_XXS | 2.06 | I-quant | ~1.8G | Very low |
| IQ3_XXS | 3.06 | I-quant | ~2.6G | Low |
| Q3_K_S | 3.50 | K-quant | ~3.0G | Low-Med |
| IQ4_XS | 4.46 | I-quant | ~4.2G | Good |
| Q4_K_M | 4.89 | K-quant | ~4.6G | Most popular default |
| Q5_K_M / Q5_K_L | 5.75 | K-quant | ~5.3G | Near-lossless. Bartowski's favorite. |
| Q8_0 | 8.50 | Legacy | ~7.2G | Near-lossless |
I-quants use importance-matrix reconstruction — better quality per bit but need good imatrix. K-quants use hierarchical super-blocks — more predictable and broadly compatible.
Key Optimizations
| Flag | Effect | Recommendation |
|---|---|---|
-fa on | Flash attention | Always on |
-ctk q8_0 -ctv q8_0 | Halve KV cache VRAM | Always on (negligible quality loss) |
-ctk q4_0 -ctv q4_0 | 1/4 KV cache VRAM | Only if VRAM tight — degrades tool calling |
-ngl 99 | All layers to GPU | Combine with -cmoe for MoE |
-cmoe | Expert weights on CPU | Use for all MoE models |
-t 6 | CPU threads | Match 6 physical cores |
--no-mmap | Force into RAM | More predictable memory |
-ub 2048 | Ubatch for prefill | Higher = faster prefill |
2. ExLlamaV2 Archived
github.com/turboderp-org/exllamav2
ExLlamaV2 is no longer maintained. ExLlamaV3 was confirmed slower on Ampere (RTX 3060, sm_86) by turboderp in issue #144. Since all our models need CPU offloading (35B-A3B at 17.8GB, 122B at ~44.7GB), and ExLlama is GPU-only, it was never viable for this setup anyway.
Historical Speed Advantage (for reference)
| Metric | ExLlamaV2 (EXL2) | llama.cpp (GGUF) |
|---|---|---|
| Prompt processing | ~14,000 t/s | ~7,500 t/s |
| Token generation | ~64 t/s | ~52-56 t/s |
| Speedup | ~1.9x prompt, ~1.15x gen | baseline |
These numbers are moot since our models require CPU+GPU hybrid inference.
3. vLLM Not Recommended
- RTX 3060 has CC 8.6 (meets minimum CC 7.0) but V1 engine has memory issues on 12GB cards
- No CPU offloading — model must fit entirely in VRAM
- GGUF performance is poor (93 t/s vs llama.cpp's native speed)
- Designed for multi-tenant serving with ample VRAM
- Only advantage: PagedAttention + continuous batching for multiple concurrent users
4. SGLang Not Practical
- RadixAttention for automatic prefix caching (ideal for agentic loops)
- Requires CC >= 8.0 (3060 qualifies at 8.6)
- No CPU offloading — same VRAM constraint as vLLM
- dFlash block diffusion support (cutting-edge, datacenter GPUs only)
- Only worth considering if you build agentic pipelines AND can fit model in 12GB
5. Other Engines
Ollama (llama.cpp wrapper)
Two commands to running inference. Use for zero-friction model testing, switch to raw llama.cpp for tuned performance.
curl -fsSL https://ollama.ai/install.sh | sh
ollama run llama3
TensorRT-LLM
20-40% faster than vLLM but requires engine rebuilds for every model change. 2-4 week learning curve. Skip for consumer hardware.
MLC-LLM
Edge/mobile/browser deployment (WebGPU, iOS, Android). Not relevant for desktop.
KTransformers
Closest competitor to llama.cpp for CPU-GPU hybrid with MoE models. Requires Intel AMX (your Haswell Xeon doesn't have it) and targets 24GB+ VRAM + 128GB+ RAM. Not practical for your setup.