Inference Engines

llama.cpp is the only engine. Build b8736 (2026-04-09), CUDA 13.0, sm_86.

Decision Matrix

Criteriallama.cppExLlamaV2vLLMSGLang
Works on 3060 12GBExcellentGPU-onlyProblematicTight
CPU+GPU hybridBestNoNoNo
Speed (full GPU)GoodBestGoodGood
Speed (partial offload)BestN/AN/AN/A
Quant qualityGood (GGUF)Best (EXL2)Good (AWQ)Good (AWQ)
Multi-user servingOKOK (TabbyAPI)BestVery Good
Agentic prefix cachingManualNoManualAutomatic
Tool callingYes (--jinja)Via TabbyAPIYesYes
Speculative decodingYesNoYesYes
PlatformAllNVIDIA onlyLinux onlyLinux only
Recommendation: llama.cpp only

llama.cpp is the only engine. GGUF is the only format. ExLlamaV2 archived, ExLlamaV3 slower on Ampere (confirmed by turboderp, issue #144). TensorRT-LLM's FP8 is unavailable on RTX 3060 (cc 8.6 needs 8.9+). vLLM/SGLang not practical for 12GB single-user.

1. llama.cpp Primary

github.com/ggml-org/llama.cpp

Why It Wins for 3060 12GB

Build for RTX 3060

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86"
cmake --build build --config Release -j$(nproc)

GGUF Quantization Formats

Format~bpwFamilySize (7B)Quality
IQ2_XXS2.06I-quant~1.8GVery low
IQ3_XXS3.06I-quant~2.6GLow
Q3_K_S3.50K-quant~3.0GLow-Med
IQ4_XS4.46I-quant~4.2GGood
Q4_K_M4.89K-quant~4.6GMost popular default
Q5_K_M / Q5_K_L5.75K-quant~5.3GNear-lossless. Bartowski's favorite.
Q8_08.50Legacy~7.2GNear-lossless

I-quants use importance-matrix reconstruction — better quality per bit but need good imatrix. K-quants use hierarchical super-blocks — more predictable and broadly compatible.

Key Optimizations

FlagEffectRecommendation
-fa onFlash attentionAlways on
-ctk q8_0 -ctv q8_0Halve KV cache VRAMAlways on (negligible quality loss)
-ctk q4_0 -ctv q4_01/4 KV cache VRAMOnly if VRAM tight — degrades tool calling
-ngl 99All layers to GPUCombine with -cmoe for MoE
-cmoeExpert weights on CPUUse for all MoE models
-t 6CPU threadsMatch 6 physical cores
--no-mmapForce into RAMMore predictable memory
-ub 2048Ubatch for prefillHigher = faster prefill

2. ExLlamaV2 Archived

github.com/turboderp-org/exllamav2

ExLlamaV2 archived, ExLlamaV3 slower on Ampere

ExLlamaV2 is no longer maintained. ExLlamaV3 was confirmed slower on Ampere (RTX 3060, sm_86) by turboderp in issue #144. Since all our models need CPU offloading (35B-A3B at 17.8GB, 122B at ~44.7GB), and ExLlama is GPU-only, it was never viable for this setup anyway.

Historical Speed Advantage (for reference)

MetricExLlamaV2 (EXL2)llama.cpp (GGUF)
Prompt processing~14,000 t/s~7,500 t/s
Token generation~64 t/s~52-56 t/s
Speedup~1.9x prompt, ~1.15x genbaseline

These numbers are moot since our models require CPU+GPU hybrid inference.

3. vLLM Not Recommended

github.com/vllm-project/vllm

4. SGLang Not Practical

github.com/sgl-project/sglang

5. Other Engines

Ollama (llama.cpp wrapper)

Two commands to running inference. Use for zero-friction model testing, switch to raw llama.cpp for tuned performance.

curl -fsSL https://ollama.ai/install.sh | sh
ollama run llama3

TensorRT-LLM

20-40% faster than vLLM but requires engine rebuilds for every model change. 2-4 week learning curve. Skip for consumer hardware.

MLC-LLM

Edge/mobile/browser deployment (WebGPU, iOS, Android). Not relevant for desktop.

KTransformers

Closest competitor to llama.cpp for CPU-GPU hybrid with MoE models. Requires Intel AMX (your Haswell Xeon doesn't have it) and targets 24GB+ VRAM + 128GB+ RAM. Not practical for your setup.

Sources