1000 t/s — The Speed Tier

Pushing the limits of generation speed on RTX 3060 12GB. Target: ~300M parameter class.

1000 t/s is not achievable today on RTX 3060 12GB

The memory bandwidth ceiling (360 GB/s) limits single-stream autoregressive generation to ~650 t/s for a 0.5B model. The realistic maximum with speculative decoding is ~300-500 t/s for useful-quality models.

The Physics

For autoregressive decoding (batch=1), throughput = bandwidth / model_size_bytes:

ModelQuantSize on DiskTheoretical MaxRealistic (~65%)
0.5B (Qwen3-0.6B)Q4_K_M~0.35 GB~1000 t/s~650 t/s
1.5B (Qwen3.5-1.5B)Q4_K_M~0.9 GB~400 t/s~260 t/s
3-4B (Qwen3.5-4B)Q4_K_M~2.2 GB~163 t/s~105 t/s
7-8BQ4_K_M~4.5 GB~80 t/s~52 t/s

Candidate Small Models

Sub-1B: Maximum Speed (~500-650 t/s)

Qwen3-0.6B — 0.6B params, decent for its size. Expect ~500-650 t/s in Q4 on 3060. Limited capability but useful as a speculative decoding draft model.

1-2B: Best Speed/Quality Tradeoff (~200-260 t/s)

Qwen3.5-1.5B — Hybrid architecture (Gated DeltaNet + sparse MoE), 262K native context extensible to 1M. Competitive with much larger models.

3-4B: Best Quality That Fits Fast (~80-150 t/s)

ModelParamsHighlights
Qwen3.5-4B 4B (MoE hybrid) MMLU-Pro 79.1, IFEval 89.8, 262K context, native MTP for speculative decoding, 201 languages, built-in vision
Phi-4-mini 3.8B (dense) GSM8K 88.6, MATH 64.0, 128K context. Good reasoning for size. HF
Gemma 3n E2B 2B effective (6B raw) MatFormer selective activation, multimodal (text+image+audio), 32K context. ~4-8GB VRAM. HF

Speculative Decoding

The most practical path to high speed: a small draft model proposes N tokens, the large target verifies in one forward pass. When acceptance rate is high, 2-4x speedup.

Recommended Pairings

TargetDraftCombined VRAMExpected t/s
Qwen3.5-4B Q4Qwen3-0.6B Q4~2.5 GB150-250
Qwen3.5-4B Q4Qwen3.5-1.5B Q4~3.1 GB120-170
Qwen3-8B Q4Qwen3-0.6B Q4~5 GB80-130
# Speculative decoding: 4B target + 0.6B draft
llama-speculative-simple \
  -m qwen3.5-4b-q4_k_m.gguf \
  -md qwen3-0.6b-q4_k_m.gguf \
  -ngl 99 -ngld 99 -fa -c 4096 \
  --draft-max 16 --draft-min 5 --draft-p-min 0.9 \
  --sampling-seq k --top-k 1 --temp 0.0

Qwen3.5 Native MTP (Multi-Token Prediction)

Qwen3.5 models have built-in speculative decoding via MTP heads — no separate draft model needed. Simpler setup, ~1.5-2x speedup:

# Via vLLM (if it works on 3060)
vllm serve Qwen/Qwen3.5-4B \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

dFlash: Block Diffusion

The most promising path to breaking the bandwidth ceiling. dFlash replaces autoregressive draft generation with block diffusion — generating all draft tokens in parallel in a single forward pass.

6.1x
Speedup on Qwen3-8B (vs autoregressive baseline)
2.5x
Over EAGLE-3 (previous SOTA speculative)
5.1x
Average on SGLang framework
Not practical on RTX 3060 today

Watch for llama.cpp integration. If dFlash lands there, a 4B target could theoretically reach 400-500 t/s on 3060.

The Reddit discussion confirms community interest but no consumer GPU benchmarks yet.

RotorQuant & TurboQuant

KV-cache compression techniques (not weight quantization). They apply rotations before quantizing the KV cache, allowing higher compression with less quality loss.

MetricRotorQuantTurboQuant
Perplexity (Llama 3.1 8B)6.917.07
Decode speed119 t/s93 t/s
Prefill speed3,822 t/s722 t/s
Parameters12816,384

RotorQuant is 28% faster decode, 5.3x faster prefill, with better quality. Useful for long-context scenarios (frees VRAM) but doesn't directly increase single-token speed for short contexts.

Practical Recipes

Recipe A: Fastest useful generation (~200-300 t/s)

# Qwen3.5-4B + Qwen3-0.6B speculative decoding
llama-speculative-simple \
  -m qwen3.5-4b-q4_k_m.gguf \
  -md qwen3-0.6b-q4_k_m.gguf \
  -ngl 99 -ngld 99 -fa -c 4096 \
  --draft-max 16 --draft-min 5 --draft-p-min 0.9

Recipe B: Maximum raw speed (~500-650 t/s, lower quality)

# Qwen3-0.6B Q4_K_M — smallest useful model
llama-cli -m qwen3-0.6b-q4_k_m.gguf -ngl 99 -fa -c 2048

Recipe C: Voice pipeline (Parakeet + small LLM)

# Parakeet (2GB) + Qwen3.5-4B (2.2GB) = 4.2GB, leaves 7.8GB free
# See multimodal.html for Parakeet setup

Realistic Speed Ceiling Summary

Setupt/sQualityStatus
Qwen3-0.6B Q4 (raw)500-650LowAvailable now
Qwen3.5-4B + spec decode200-300GoodAvailable now
Qwen3.5-4B + dFlash400-500?GoodWaiting for llama.cpp
True 1000 t/s (useful quality)1000Medium+Not achievable today

Sources