Inference Engines

llama.cpp is the only engine. Build b8736 (2026-04-09), CUDA 13.0, sm_86.

Decision Matrix

Criteria	llama.cpp	ExLlamaV2	vLLM	SGLang
Works on 3060 12GB	Excellent	GPU-only	Problematic	Tight
CPU+GPU hybrid	Best	No	No	No
Speed (full GPU)	Good	Best	Good	Good
Speed (partial offload)	Best	N/A	N/A	N/A
Quant quality	Good (GGUF)	Best (EXL2)	Good (AWQ)	Good (AWQ)
Multi-user serving	OK	OK (TabbyAPI)	Best	Very Good
Agentic prefix caching	Manual	No	Manual	Automatic
Tool calling	Yes (--jinja)	Via TabbyAPI	Yes	Yes
Speculative decoding	Yes	No	Yes	Yes
Platform	All	NVIDIA only	Linux only	Linux only

Recommendation: llama.cpp only

llama.cpp is the only engine. GGUF is the only format. ExLlamaV2 archived, ExLlamaV3 slower on Ampere (confirmed by turboderp, issue #144). TensorRT-LLM's FP8 is unavailable on RTX 3060 (cc 8.6 needs 8.9+). vLLM/SGLang not practical for 12GB single-user.

1. llama.cpp Primary

github.com/ggml-org/llama.cpp

Why It Wins for 3060 12GB

Only engine with real CPU+GPU hybrid inference (essential when models barely fit)
Best GGUF quantization ecosystem (IQ2, IQ3, IQ4 — the low-bit quants that make 35B MoE fit)
Tensor-level offloading (-ot), MoE-specific offloading (-cmoe)
Flash attention, KV cache quantization, speculative decoding
Widest model support, single binary, weekly updates

Build for RTX 3060

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86"
cmake --build build --config Release -j$(nproc)

GGUF Quantization Formats

Format	~bpw	Family	Size (7B)	Quality
IQ2_XXS	2.06	I-quant	~1.8G	Very low
IQ3_XXS	3.06	I-quant	~2.6G	Low
Q3_K_S	3.50	K-quant	~3.0G	Low-Med
IQ4_XS	4.46	I-quant	~4.2G	Good
Q4_K_M	4.89	K-quant	~4.6G	Most popular default
Q5_K_M / Q5_K_L	5.75	K-quant	~5.3G	Near-lossless. Bartowski's favorite.
Q8_0	8.50	Legacy	~7.2G	Near-lossless

I-quants use importance-matrix reconstruction — better quality per bit but need good imatrix. K-quants use hierarchical super-blocks — more predictable and broadly compatible.

Key Optimizations

Flag	Effect	Recommendation
`-fa on`	Flash attention	Always on
`-ctk q8_0 -ctv q8_0`	Halve KV cache VRAM	Always on (negligible quality loss)
`-ctk q4_0 -ctv q4_0`	1/4 KV cache VRAM	Only if VRAM tight — degrades tool calling
`-ngl 99`	All layers to GPU	Combine with `-cmoe` for MoE
`-cmoe`	Expert weights on CPU	Use for all MoE models
`-t 6`	CPU threads	Match 6 physical cores
`--no-mmap`	Force into RAM	More predictable memory
`-ub 2048`	Ubatch for prefill	Higher = faster prefill

2. ExLlamaV2 Archived

github.com/turboderp-org/exllamav2

ExLlamaV2 archived, ExLlamaV3 slower on Ampere

ExLlamaV2 is no longer maintained. ExLlamaV3 was confirmed slower on Ampere (RTX 3060, sm_86) by turboderp in issue #144. Since all our models need CPU offloading (35B-A3B at 17.8GB, 122B at ~44.7GB), and ExLlama is GPU-only, it was never viable for this setup anyway.

Historical Speed Advantage (for reference)

Metric	ExLlamaV2 (EXL2)	llama.cpp (GGUF)
Prompt processing	~14,000 t/s	~7,500 t/s
Token generation	~64 t/s	~52-56 t/s
Speedup	~1.9x prompt, ~1.15x gen	baseline

These numbers are moot since our models require CPU+GPU hybrid inference.

3. vLLM Not Recommended

github.com/vllm-project/vllm

RTX 3060 has CC 8.6 (meets minimum CC 7.0) but V1 engine has memory issues on 12GB cards
No CPU offloading — model must fit entirely in VRAM
GGUF performance is poor (93 t/s vs llama.cpp's native speed)
Designed for multi-tenant serving with ample VRAM
Only advantage: PagedAttention + continuous batching for multiple concurrent users

4. SGLang Not Practical

github.com/sgl-project/sglang

RadixAttention for automatic prefix caching (ideal for agentic loops)
Requires CC >= 8.0 (3060 qualifies at 8.6)
No CPU offloading — same VRAM constraint as vLLM
dFlash block diffusion support (cutting-edge, datacenter GPUs only)
Only worth considering if you build agentic pipelines AND can fit model in 12GB

5. Other Engines

Ollama (llama.cpp wrapper)

Two commands to running inference. Use for zero-friction model testing, switch to raw llama.cpp for tuned performance.

curl -fsSL https://ollama.ai/install.sh | sh
ollama run llama3

TensorRT-LLM

20-40% faster than vLLM but requires engine rebuilds for every model change. 2-4 week learning curve. Skip for consumer hardware.

MLC-LLM

Edge/mobile/browser deployment (WebGPU, iOS, Android). Not relevant for desktop.

KTransformers

Closest competitor to llama.cpp for CPU-GPU hybrid with MoE models. Requires Intel AMX (your Haswell Xeon doesn't have it) and targets 24GB+ VRAM + 128GB+ RAM. Not practical for your setup.

Sources

← 1000 t/s Tier Harness & Agents →