1000 t/s — The Speed Tier

Pushing the limits of generation speed on RTX 3060 12GB. Target: ~300M parameter class.

1000 t/s is not achievable today on RTX 3060 12GB

The memory bandwidth ceiling (360 GB/s) limits single-stream autoregressive generation to ~650 t/s for a 0.5B model. The realistic maximum with speculative decoding is ~300-500 t/s for useful-quality models.

The Physics

For autoregressive decoding (batch=1), throughput = bandwidth / model_size_bytes:

Model	Quant	Size on Disk	Theoretical Max	Realistic (~65%)
0.5B (Qwen3-0.6B)	Q4_K_M	~0.35 GB	~1000 t/s	~650 t/s
1.5B (Qwen3.5-1.5B)	Q4_K_M	~0.9 GB	~400 t/s	~260 t/s
3-4B (Qwen3.5-4B)	Q4_K_M	~2.2 GB	~163 t/s	~105 t/s
7-8B	Q4_K_M	~4.5 GB	~80 t/s	~52 t/s

Candidate Small Models

Sub-1B: Maximum Speed (~500-650 t/s)

Qwen3-0.6B — 0.6B params, decent for its size. Expect ~500-650 t/s in Q4 on 3060. Limited capability but useful as a speculative decoding draft model.

1-2B: Best Speed/Quality Tradeoff (~200-260 t/s)

Qwen3.5-1.5B — Hybrid architecture (Gated DeltaNet + sparse MoE), 262K native context extensible to 1M. Competitive with much larger models.

3-4B: Best Quality That Fits Fast (~80-150 t/s)

Model	Params	Highlights
Qwen3.5-4B	4B (MoE hybrid)	MMLU-Pro 79.1, IFEval 89.8, 262K context, native MTP for speculative decoding, 201 languages, built-in vision
Phi-4-mini	3.8B (dense)	GSM8K 88.6, MATH 64.0, 128K context. Good reasoning for size. HF
Gemma 3n E2B	2B effective (6B raw)	MatFormer selective activation, multimodal (text+image+audio), 32K context. ~4-8GB VRAM. HF

Speculative Decoding

The most practical path to high speed: a small draft model proposes N tokens, the large target verifies in one forward pass. When acceptance rate is high, 2-4x speedup.

Recommended Pairings

Target	Draft	Combined VRAM	Expected t/s
Qwen3.5-4B Q4	Qwen3-0.6B Q4	~2.5 GB	150-250
Qwen3.5-4B Q4	Qwen3.5-1.5B Q4	~3.1 GB	120-170
Qwen3-8B Q4	Qwen3-0.6B Q4	~5 GB	80-130

# Speculative decoding: 4B target + 0.6B draft
llama-speculative-simple \
  -m qwen3.5-4b-q4_k_m.gguf \
  -md qwen3-0.6b-q4_k_m.gguf \
  -ngl 99 -ngld 99 -fa -c 4096 \
  --draft-max 16 --draft-min 5 --draft-p-min 0.9 \
  --sampling-seq k --top-k 1 --temp 0.0

Qwen3.5 Native MTP (Multi-Token Prediction)

Qwen3.5 models have built-in speculative decoding via MTP heads — no separate draft model needed. Simpler setup, ~1.5-2x speedup:

# Via vLLM (if it works on 3060)
vllm serve Qwen/Qwen3.5-4B \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

dFlash: Block Diffusion

The most promising path to breaking the bandwidth ceiling. dFlash replaces autoregressive draft generation with block diffusion — generating all draft tokens in parallel in a single forward pass.

6.1x

Speedup on Qwen3-8B (vs autoregressive baseline)

2.5x

Over EAGLE-3 (previous SOTA speculative)

5.1x

Average on SGLang framework

Not practical on RTX 3060 today

Tested only on H200 and B200 GPUs
Requires SGLang (datacenter-oriented, no CPU offloading)
Not available in llama.cpp
dFlash drafter adds VRAM overhead, tight on 12GB

Watch for llama.cpp integration. If dFlash lands there, a 4B target could theoretically reach 400-500 t/s on 3060.

The Reddit discussion confirms community interest but no consumer GPU benchmarks yet.

RotorQuant & TurboQuant

KV-cache compression techniques (not weight quantization). They apply rotations before quantizing the KV cache, allowing higher compression with less quality loss.

Metric	RotorQuant	TurboQuant
Perplexity (Llama 3.1 8B)	6.91	7.07
Decode speed	119 t/s	93 t/s
Prefill speed	3,822 t/s	722 t/s
Parameters	128	16,384

RotorQuant is 28% faster decode, 5.3x faster prefill, with better quality. Useful for long-context scenarios (frees VRAM) but doesn't directly increase single-token speed for short contexts.

Practical Recipes

Recipe A: Fastest useful generation (~200-300 t/s)

# Qwen3.5-4B + Qwen3-0.6B speculative decoding
llama-speculative-simple \
  -m qwen3.5-4b-q4_k_m.gguf \
  -md qwen3-0.6b-q4_k_m.gguf \
  -ngl 99 -ngld 99 -fa -c 4096 \
  --draft-max 16 --draft-min 5 --draft-p-min 0.9

Recipe B: Maximum raw speed (~500-650 t/s, lower quality)

# Qwen3-0.6B Q4_K_M — smallest useful model
llama-cli -m qwen3-0.6b-q4_k_m.gguf -ngl 99 -fa -c 2048

Recipe C: Voice pipeline (Parakeet + small LLM)

# Parakeet (2GB) + Qwen3.5-4B (2.2GB) = 4.2GB, leaves 7.8GB free
# See multimodal.html for Parakeet setup

Realistic Speed Ceiling Summary

Setup	t/s	Quality	Status
Qwen3-0.6B Q4 (raw)	500-650	Low	Available now
Qwen3.5-4B + spec decode	200-300	Good	Available now
Qwen3.5-4B + dFlash	400-500?	Good	Waiting for llama.cpp
True 1000 t/s (useful quality)	1000	Medium+	Not achievable today

Sources

← 100 t/s Tier Parallel & API →