Parallel Inference & API Fallback

Real parallel benchmarks, context scaling tables, and API as the true "speed tier"

Key finding: speed without quality is useless

0.8B achieves 343 t/s aggregate with --parallel 5, but scores only 5/25 on the polecat quality suite. It fabricates bugs, uses wrong frameworks, and is not viable for agentic coding. API fallback (Haiku at 22/25, Groq at 500+ t/s) replaces the local speed tier.

Real Benchmarks (measured on this hardware)

Single-stream generation (measured, @34K context)

ModelSizeTG t/sPP t/sQualityVerdict
0.8B Q8_0764MB1486,9135/25Not viable for agentic work
2B Q8_01.86GB964,9729/25Not viable
4B Q8_04.48GB432,06010/25Same speed as 9B, worse quality
9B IQ4_NL5.37GB411,51815/25Interactive tier winner
35B-A3B IQ4_NL17.8GB21 (regex)42017/25Workhorse tier winner
Haiku 4.5 APIN/A~100N/A22/25Quality baseline ($0.80/M)

Parallel scaling: 0.8B Q4_K_M (real concurrent requests, 128 tokens each)

NAggregate t/sPer-slot t/sWall timeNote
1160160798ms
22521261017ms
3281941367ms
4310781653ms
5343691865msSweet spot
6357602154ms
7351502550msAggregate drops — CPU contention
8369462775ms
CPU is the bottleneck, not GPU

GPU utilization sits at ~20% during parallel generation. Token sampling happens on CPU per-slot sequentially. The 2014 Xeon E5-1650v3 creates a hard ceiling at ~350-370 t/s aggregate regardless of parallel count. N=5 is the sweet spot: biggest jump from N=4 (310→343), after that aggregate plateaus while per-slot collapses.

Context Scaling (real measured data)

All models use Q8 KV cache + 262K max context + activation rotation. Clean restart per test.

9B IQ4_NL (Interactive tier)

ContextPP t/sTG t/sVRAM
34K1,5184110,807MB
93K1,1783011,073MB
200K8722111,454MB
241K77819

35B-A3B IQ4_NL regex (Workhorse fast mode)

ContextPP t/sTG t/sVRAM
34K4202111,589MB
93K3821611,750MB
200K~254~1411,896MB
>188KOOM — switch to -cmoe mode

35B-A3B IQ4_NL -cmoe (Workhorse full context mode)

ContextPP t/sTG t/sVRAM
241K325126.7GB (fits easily)
Cache hits are 15x faster

Prompt cache is on by default. Same-prefix follow-up queries skip prompt eval:

For 35B on video-platform (140K tokens): first query ~15 min, follow-ups ~1-2s each.

Prompt Processing Time (real)

Context Size35B IQ4_NL (regex)9B IQ4_NL
60K tokens~6 min~40s
140K tokens~6 min~2 min

VRAM Budget: Qwen3.5-4B Q4_K_M

Qwen3.5-4B's hybrid architecture means KV cache per token is only ~32KB (vs ~144KB for a standard 4B transformer):

ComponentVRAM
CUDA overhead~0.75 GB
Model weights (Q4_K_M)~2.50 GB
Available for KV cache~8.75 GB
Parallel NContext/SlotTotal KV (FP16)Total KV (Q8_0)Fits?
48K1.0 GB0.5 GBPlenty of room
88K2.0 GB1.0 GBYes
432K4.0 GB2.0 GBYes
832K8.0 GB4.0 GBTight (FP16) / OK (Q8_0)
168K4.0 GB2.0 GBYes
1616K8.0 GB4.0 GBTight (FP16) / OK (Q8_0)

You could theoretically run 16 parallel slots at 16K context each on your 3060 with Qwen3.5-4B. The Gated DeltaNet layers use O(1) state, so only the 8 attention layers contribute to KV cache.

Gemma 4 vs Qwen 3.5: Head-to-Head for Parallel Agentic Use

Small Models (sub-4B)

Qwen3.5-4BGemma 4 E4B (4.5B eff)Gemma 4 E2B (2.3B eff)
Single-stream t/s (3060)~80-120 (est.)12-15~20
VRAM (Q4)~2.5 GB~2.5 GB~1.3 GB
MMLU-Pro79.1LowerLower
TAU2 (agent/tool-use)+37.7 pts vs GemmaBaselineN/A
Audio inputNoYesYes
Image inputYes (when llama.cpp lands)Yes (buggy)Yes (buggy)
Thinking mode toggleYes (per-request)NoNo
Tool callingSlightly more reliable JSONCleaner JSON outputN/A
Parallel slots fit16 @ 16K ctx~8-12 @ 8K ctx~16+ @ 8K ctx
Gemma 4 E2B/E4B are surprisingly slow

Despite being smaller, Gemma 4 E2B gets only ~20 t/s on RTX 3060 (vs Qwen3.5-4B's estimated ~80-120 t/s). The Per-Layer Embeddings (PLE) architecture adds overhead. For pure speed, Qwen 3.5 wins decisively.

MoE Tier (26B class)

Qwen3.5-35B-A3BGemma 4 26B-A4B
Active params3B3.8B
t/s (3060, --fit)~80-100 (UD-IQ2_M in VRAM)~48 (partial CPU offload)
Arena EloHigh#6 at 1441 Elo
VisionNoYes (text+image)
Parallel viabilityPossible at IQ2_M (11.4GB)Not viable (needs --fit CPU offload)

Candidates to Manually Benchmark Per Tier

TierQwen 3.5 CandidateGemma 4 CandidateBenchmark Command
10 t/s 122B-A10B IQ3_XXS 26B-A4B IQ3_S (with --fit) llama-bench -m model.gguf -ngl 99 -fa 1
100 t/s 9B Q5_K_M / 35B-A3B IQ2_M 26B-A4B IQ2_M / E4B Q8 llama-bench -m model.gguf -ngl 99 -fa 1 -c 8192
1000 t/s 4B Q4_K_M (--parallel 4-8) E2B Q8 (--parallel 4-8) llama-batched-bench -m model.gguf -ngl 99 -fa 1 -npl 1,2,4,8

Gemma 4 Known Bugs in llama.cpp

BugImpactStatusTracking
<unused24>/<unused25> token spam Garbage tokens in output Fixed #21321, PR #21326 + #21343
Vision CUDA crash (SIGABRT) mmproj crashes on CUDA for 31B/26B Open #21402
Audio support missing E2B/E4B audio not implemented Open #21325
OOM with partial offload on 26B Even on RTX 5080 16GB Open #21323
Flash Attention hang (Apple Silicon) Hybrid head dimensions break FA Open N/A (Apple-specific)

Parallel Inference: Gotchas

  1. Quadratic attention overhead: Unified KV cache means attention is computed over the entire cache then masked per-slot — wasteful with many slots
  2. CPU sampling bottleneck: Sampling is per-slot on CPU. Your 2014 Xeon will struggle at high N — watch for GPU utilization dropping below 40%
  3. KV cache full = sequence killed: When cache fills, llama.cpp terminates the largest sequence. No graceful offloading.
  4. Nondeterministic outputs: cache_prompt can cause non-identical logits across different batch sizes
  5. One server, not four: Multiple llama-server instances each load model weights separately. On 12GB, only one server with --parallel N is viable.

Benchmark Commands

Run this to get your actual numbers

# Qwen3.5-4B: parallel scaling benchmark
llama-batched-bench -m Qwen3.5-4B-Q4_K_M.gguf -ngl 99 -fa 1 \
  -c 32768 -b 2048 -ub 512 \
  -npp 512,2048 -ntg 128,512 -npl 1,2,4,8

# Gemma 4 E2B: parallel scaling benchmark
llama-batched-bench -m gemma-4-E2B-it-Q8_0.gguf -ngl 99 -fa 1 \
  -c 16384 -b 2048 -ub 512 \
  -npp 512,2048 -ntg 128,512 -npl 1,2,4,8

# Single-stream baseline for each candidate
llama-bench -m Qwen3.5-4B-Q4_K_M.gguf -ngl 99 -fa 1
llama-bench -m gemma-4-E2B-it-Q8_0.gguf -ngl 99 -fa 1
llama-bench -m gemma-4-E4B-it-Q8_0.gguf -ngl 99 -fa 1

Parallel server for agentic use

# Qwen3.5-4B with 4 parallel agent slots
llama-server -hf unsloth/Qwen3.5-4B-GGUF:Q4_K_M \
  -ngl 99 -fa --jinja \
  --parallel 4 -c 32768 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --port 8080

# Each slot gets 8K context (32768 / 4)
# Total KV cache: ~1GB (Q8_0) — leaves 6.75GB free

API Fallback: Models Too Big for Local

For frontier models that can't run on 12GB, cheap API options ranked by price:

ModelInput/1MOutput/1MProviderTool CallingContext
MiMo-V2-Flash $0.09 $0.29 OpenRouter Yes 262K
MiniMax M2.5 $0.118 $0.99 OpenRouter Yes 196K
DeepSeek V3.2 (cached) $0.028 $0.42 Direct Yes 128K
DeepSeek V3.2 (uncached) $0.28 $0.42 Direct Yes 128K
MiMo-V2-Pro $1.00 $3.00 OpenRouter Yes 1M (!)
MiniMax M2.7 $0.30 $1.20 OpenRouter Yes 204K
Haiku 4.5 $0.80 $4.00 Anthropic direct Yes 200K
Kimi K2.5 $0.38 $1.72 OpenRouter Yes 262K
GLM 5.1 $1.395 $4.40 OpenRouter Yes 202K

Speed-Optimized APIs (for latency-sensitive agent loops)

ModelInput/1MOutput/1MServer-side t/sProvider
GPT-OSS 20B $0.075 $0.30 1,000 Groq
Llama 3.1 8B $0.05 $0.08 840 Groq
Qwen3 32B $0.29 $0.59 662 Groq
Llama 4 Scout $0.11 $0.34 594 Groq
GPT-OSS 120B $0.15 $0.60 500 Groq

Monthly Cost Comparison (1000 queries/day, ~1K input + 500 output tokens each)

$1.50
DeepSeek V3.2 direct (cached)
$7.05
MiMo-V2-Flash (OpenRouter)
$13.50
GPT-OSS 120B (Groq, fast)
~$2
Local electricity (RTX 3060, 4h/day)

Aggregators

Hybrid Strategy: Local + API (data-driven)

The optimal setup after benchmarking

Sources