1000 t/s — The Speed Tier
Pushing the limits of generation speed on RTX 3060 12GB. Target: ~300M parameter class.
The memory bandwidth ceiling (360 GB/s) limits single-stream autoregressive generation to ~650 t/s for a 0.5B model. The realistic maximum with speculative decoding is ~300-500 t/s for useful-quality models.
The Physics
For autoregressive decoding (batch=1), throughput = bandwidth / model_size_bytes:
| Model | Quant | Size on Disk | Theoretical Max | Realistic (~65%) |
|---|---|---|---|---|
| 0.5B (Qwen3-0.6B) | Q4_K_M | ~0.35 GB | ~1000 t/s | ~650 t/s |
| 1.5B (Qwen3.5-1.5B) | Q4_K_M | ~0.9 GB | ~400 t/s | ~260 t/s |
| 3-4B (Qwen3.5-4B) | Q4_K_M | ~2.2 GB | ~163 t/s | ~105 t/s |
| 7-8B | Q4_K_M | ~4.5 GB | ~80 t/s | ~52 t/s |
Candidate Small Models
Sub-1B: Maximum Speed (~500-650 t/s)
Qwen3-0.6B — 0.6B params, decent for its size. Expect ~500-650 t/s in Q4 on 3060. Limited capability but useful as a speculative decoding draft model.
1-2B: Best Speed/Quality Tradeoff (~200-260 t/s)
Qwen3.5-1.5B — Hybrid architecture (Gated DeltaNet + sparse MoE), 262K native context extensible to 1M. Competitive with much larger models.
3-4B: Best Quality That Fits Fast (~80-150 t/s)
| Model | Params | Highlights |
|---|---|---|
| Qwen3.5-4B | 4B (MoE hybrid) | MMLU-Pro 79.1, IFEval 89.8, 262K context, native MTP for speculative decoding, 201 languages, built-in vision |
| Phi-4-mini | 3.8B (dense) | GSM8K 88.6, MATH 64.0, 128K context. Good reasoning for size. HF |
| Gemma 3n E2B | 2B effective (6B raw) | MatFormer selective activation, multimodal (text+image+audio), 32K context. ~4-8GB VRAM. HF |
Speculative Decoding
The most practical path to high speed: a small draft model proposes N tokens, the large target verifies in one forward pass. When acceptance rate is high, 2-4x speedup.
Recommended Pairings
| Target | Draft | Combined VRAM | Expected t/s |
|---|---|---|---|
| Qwen3.5-4B Q4 | Qwen3-0.6B Q4 | ~2.5 GB | 150-250 |
| Qwen3.5-4B Q4 | Qwen3.5-1.5B Q4 | ~3.1 GB | 120-170 |
| Qwen3-8B Q4 | Qwen3-0.6B Q4 | ~5 GB | 80-130 |
# Speculative decoding: 4B target + 0.6B draft
llama-speculative-simple \
-m qwen3.5-4b-q4_k_m.gguf \
-md qwen3-0.6b-q4_k_m.gguf \
-ngl 99 -ngld 99 -fa -c 4096 \
--draft-max 16 --draft-min 5 --draft-p-min 0.9 \
--sampling-seq k --top-k 1 --temp 0.0
Qwen3.5 Native MTP (Multi-Token Prediction)
Qwen3.5 models have built-in speculative decoding via MTP heads — no separate draft model needed. Simpler setup, ~1.5-2x speedup:
# Via vLLM (if it works on 3060)
vllm serve Qwen/Qwen3.5-4B \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
dFlash: Block Diffusion
The most promising path to breaking the bandwidth ceiling. dFlash replaces autoregressive draft generation with block diffusion — generating all draft tokens in parallel in a single forward pass.
- Tested only on H200 and B200 GPUs
- Requires SGLang (datacenter-oriented, no CPU offloading)
- Not available in llama.cpp
- dFlash drafter adds VRAM overhead, tight on 12GB
Watch for llama.cpp integration. If dFlash lands there, a 4B target could theoretically reach 400-500 t/s on 3060.
The Reddit discussion confirms community interest but no consumer GPU benchmarks yet.
RotorQuant & TurboQuant
KV-cache compression techniques (not weight quantization). They apply rotations before quantizing the KV cache, allowing higher compression with less quality loss.
| Metric | RotorQuant | TurboQuant |
|---|---|---|
| Perplexity (Llama 3.1 8B) | 6.91 | 7.07 |
| Decode speed | 119 t/s | 93 t/s |
| Prefill speed | 3,822 t/s | 722 t/s |
| Parameters | 128 | 16,384 |
RotorQuant is 28% faster decode, 5.3x faster prefill, with better quality. Useful for long-context scenarios (frees VRAM) but doesn't directly increase single-token speed for short contexts.
Practical Recipes
Recipe A: Fastest useful generation (~200-300 t/s)
# Qwen3.5-4B + Qwen3-0.6B speculative decoding
llama-speculative-simple \
-m qwen3.5-4b-q4_k_m.gguf \
-md qwen3-0.6b-q4_k_m.gguf \
-ngl 99 -ngld 99 -fa -c 4096 \
--draft-max 16 --draft-min 5 --draft-p-min 0.9
Recipe B: Maximum raw speed (~500-650 t/s, lower quality)
# Qwen3-0.6B Q4_K_M — smallest useful model
llama-cli -m qwen3-0.6b-q4_k_m.gguf -ngl 99 -fa -c 2048
Recipe C: Voice pipeline (Parakeet + small LLM)
# Parakeet (2GB) + Qwen3.5-4B (2.2GB) = 4.2GB, leaves 7.8GB free
# See multimodal.html for Parakeet setup
Realistic Speed Ceiling Summary
| Setup | t/s | Quality | Status |
|---|---|---|---|
| Qwen3-0.6B Q4 (raw) | 500-650 | Low | Available now |
| Qwen3.5-4B + spec decode | 200-300 | Good | Available now |
| Qwen3.5-4B + dFlash | 400-500? | Good | Waiting for llama.cpp |
| True 1000 t/s (useful quality) | 1000 | Medium+ | Not achievable today |