Parallel Inference & API Fallback

Real parallel benchmarks, context scaling tables, and API as the true "speed tier"

Key finding: speed without quality is useless

0.8B achieves 343 t/s aggregate with --parallel 5, but scores only 5/25 on the polecat quality suite. It fabricates bugs, uses wrong frameworks, and is not viable for agentic coding. API fallback (Haiku at 22/25, Groq at 500+ t/s) replaces the local speed tier.

Real Benchmarks (measured on this hardware)

Single-stream generation (measured, @34K context)

Model	Size	TG t/s	PP t/s	Quality	Verdict
0.8B Q8_0	764MB	148	6,913	5/25	Not viable for agentic work
2B Q8_0	1.86GB	96	4,972	9/25	Not viable
4B Q8_0	4.48GB	43	2,060	10/25	Same speed as 9B, worse quality
9B IQ4_NL	5.37GB	41	1,518	15/25	Interactive tier winner
35B-A3B IQ4_NL	17.8GB	21 (regex)	420	17/25	Workhorse tier winner
Haiku 4.5 API	N/A	~100	N/A	22/25	Quality baseline ($0.80/M)

Parallel scaling: 0.8B Q4_K_M (real concurrent requests, 128 tokens each)

N	Aggregate t/s	Per-slot t/s	Wall time	Note
1	160	160	798ms
2	252	126	1017ms
3	281	94	1367ms
4	310	78	1653ms
5	343	69	1865ms	Sweet spot
6	357	60	2154ms
7	351	50	2550ms	Aggregate drops — CPU contention
8	369	46	2775ms

CPU is the bottleneck, not GPU

GPU utilization sits at ~20% during parallel generation. Token sampling happens on CPU per-slot sequentially. The 2014 Xeon E5-1650v3 creates a hard ceiling at ~350-370 t/s aggregate regardless of parallel count. N=5 is the sweet spot: biggest jump from N=4 (310→343), after that aggregate plateaus while per-slot collapses.

Context Scaling (real measured data)

All models use Q8 KV cache + 262K max context + activation rotation. Clean restart per test.

9B IQ4_NL (Interactive tier)

Context	PP t/s	TG t/s	VRAM
34K	1,518	41	10,807MB
93K	1,178	30	11,073MB
200K	872	21	11,454MB
241K	778	19	—

35B-A3B IQ4_NL regex (Workhorse fast mode)

Context	PP t/s	TG t/s	VRAM
34K	420	21	11,589MB
93K	382	16	11,750MB
200K	~254	~14	11,896MB
>188K	OOM — switch to -cmoe mode

35B-A3B IQ4_NL -cmoe (Workhorse full context mode)

Context	PP t/s	TG t/s	VRAM
241K	325	12	6.7GB (fits easily)

Cache hits are 15x faster

Prompt cache is on by default. Same-prefix follow-up queries skip prompt eval:

Q1 (cold, 34K context): 5,386ms wall
Q2 (cached, same prefix): 346ms wall (15x faster)
Q3 (cached): 578ms wall

For 35B on video-platform (140K tokens): first query ~15 min, follow-ups ~1-2s each.

Prompt Processing Time (real)

Context Size	35B IQ4_NL (regex)	9B IQ4_NL
60K tokens	~6 min	~40s
140K tokens	~6 min	~2 min

VRAM Budget: Qwen3.5-4B Q4_K_M

Qwen3.5-4B's hybrid architecture means KV cache per token is only ~32KB (vs ~144KB for a standard 4B transformer):

Component	VRAM
CUDA overhead	~0.75 GB
Model weights (Q4_K_M)	~2.50 GB
Available for KV cache	~8.75 GB

Parallel N	Context/Slot	Total KV (FP16)	Total KV (Q8_0)	Fits?
4	8K	1.0 GB	0.5 GB	Plenty of room
8	8K	2.0 GB	1.0 GB	Yes
4	32K	4.0 GB	2.0 GB	Yes
8	32K	8.0 GB	4.0 GB	Tight (FP16) / OK (Q8_0)
16	8K	4.0 GB	2.0 GB	Yes
16	16K	8.0 GB	4.0 GB	Tight (FP16) / OK (Q8_0)

You could theoretically run 16 parallel slots at 16K context each on your 3060 with Qwen3.5-4B. The Gated DeltaNet layers use O(1) state, so only the 8 attention layers contribute to KV cache.

Gemma 4 vs Qwen 3.5: Head-to-Head for Parallel Agentic Use

Small Models (sub-4B)

	Qwen3.5-4B	Gemma 4 E4B (4.5B eff)	Gemma 4 E2B (2.3B eff)
Single-stream t/s (3060)	~80-120 (est.)	12-15	~20
VRAM (Q4)	~2.5 GB	~2.5 GB	~1.3 GB
MMLU-Pro	79.1	Lower	Lower
TAU2 (agent/tool-use)	+37.7 pts vs Gemma	Baseline	N/A
Audio input	No	Yes	Yes
Image input	Yes (when llama.cpp lands)	Yes (buggy)	Yes (buggy)
Thinking mode toggle	Yes (per-request)	No	No
Tool calling	Slightly more reliable JSON	Cleaner JSON output	N/A
Parallel slots fit	16 @ 16K ctx	~8-12 @ 8K ctx	~16+ @ 8K ctx

Gemma 4 E2B/E4B are surprisingly slow

Despite being smaller, Gemma 4 E2B gets only ~20 t/s on RTX 3060 (vs Qwen3.5-4B's estimated ~80-120 t/s). The Per-Layer Embeddings (PLE) architecture adds overhead. For pure speed, Qwen 3.5 wins decisively.

MoE Tier (26B class)

	Qwen3.5-35B-A3B	Gemma 4 26B-A4B
Active params	3B	3.8B
t/s (3060, --fit)	~80-100 (UD-IQ2_M in VRAM)	~48 (partial CPU offload)
Arena Elo	High	#6 at 1441 Elo
Vision	No	Yes (text+image)
Parallel viability	Possible at IQ2_M (11.4GB)	Not viable (needs --fit CPU offload)

Candidates to Manually Benchmark Per Tier

Tier	Qwen 3.5 Candidate	Gemma 4 Candidate	Benchmark Command
10 t/s	122B-A10B IQ3_XXS	26B-A4B IQ3_S (with --fit)	`llama-bench -m model.gguf -ngl 99 -fa 1`
100 t/s	9B Q5_K_M / 35B-A3B IQ2_M	26B-A4B IQ2_M / E4B Q8	`llama-bench -m model.gguf -ngl 99 -fa 1 -c 8192`
1000 t/s	4B Q4_K_M (--parallel 4-8)	E2B Q8 (--parallel 4-8)	`llama-batched-bench -m model.gguf -ngl 99 -fa 1 -npl 1,2,4,8`

Gemma 4 Known Bugs in llama.cpp

Bug	Impact	Status	Tracking
`<unused24>`/`<unused25>` token spam	Garbage tokens in output	Fixed	#21321, PR #21326 + #21343
Vision CUDA crash (SIGABRT)	mmproj crashes on CUDA for 31B/26B	Open	#21402
Audio support missing	E2B/E4B audio not implemented	Open	#21325
OOM with partial offload on 26B	Even on RTX 5080 16GB	Open	#21323
Flash Attention hang (Apple Silicon)	Hybrid head dimensions break FA	Open	N/A (Apple-specific)

Parallel Inference: Gotchas

Quadratic attention overhead: Unified KV cache means attention is computed over the entire cache then masked per-slot — wasteful with many slots
CPU sampling bottleneck: Sampling is per-slot on CPU. Your 2014 Xeon will struggle at high N — watch for GPU utilization dropping below 40%
KV cache full = sequence killed: When cache fills, llama.cpp terminates the largest sequence. No graceful offloading.
Nondeterministic outputs: cache_prompt can cause non-identical logits across different batch sizes
One server, not four: Multiple llama-server instances each load model weights separately. On 12GB, only one server with --parallel N is viable.

Benchmark Commands

Run this to get your actual numbers

# Qwen3.5-4B: parallel scaling benchmark
llama-batched-bench -m Qwen3.5-4B-Q4_K_M.gguf -ngl 99 -fa 1 \
  -c 32768 -b 2048 -ub 512 \
  -npp 512,2048 -ntg 128,512 -npl 1,2,4,8

# Gemma 4 E2B: parallel scaling benchmark
llama-batched-bench -m gemma-4-E2B-it-Q8_0.gguf -ngl 99 -fa 1 \
  -c 16384 -b 2048 -ub 512 \
  -npp 512,2048 -ntg 128,512 -npl 1,2,4,8

# Single-stream baseline for each candidate
llama-bench -m Qwen3.5-4B-Q4_K_M.gguf -ngl 99 -fa 1
llama-bench -m gemma-4-E2B-it-Q8_0.gguf -ngl 99 -fa 1
llama-bench -m gemma-4-E4B-it-Q8_0.gguf -ngl 99 -fa 1

Parallel server for agentic use

# Qwen3.5-4B with 4 parallel agent slots
llama-server -hf unsloth/Qwen3.5-4B-GGUF:Q4_K_M \
  -ngl 99 -fa --jinja \
  --parallel 4 -c 32768 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --port 8080

# Each slot gets 8K context (32768 / 4)
# Total KV cache: ~1GB (Q8_0) — leaves 6.75GB free

API Fallback: Models Too Big for Local

For frontier models that can't run on 12GB, cheap API options ranked by price:

Model	Input/1M	Output/1M	Provider	Tool Calling	Context
MiMo-V2-Flash	$0.09	$0.29	OpenRouter	Yes	262K
MiniMax M2.5	$0.118	$0.99	OpenRouter	Yes	196K
DeepSeek V3.2 (cached)	$0.028	$0.42	Direct	Yes	128K
DeepSeek V3.2 (uncached)	$0.28	$0.42	Direct	Yes	128K
MiMo-V2-Pro	$1.00	$3.00	OpenRouter	Yes	1M (!)
MiniMax M2.7	$0.30	$1.20	OpenRouter	Yes	204K
Haiku 4.5	$0.80	$4.00	Anthropic direct	Yes	200K
Kimi K2.5	$0.38	$1.72	OpenRouter	Yes	262K
GLM 5.1	$1.395	$4.40	OpenRouter	Yes	202K

Speed-Optimized APIs (for latency-sensitive agent loops)

Model	Input/1M	Output/1M	Server-side t/s	Provider
GPT-OSS 20B	$0.075	$0.30	1,000	Groq
Llama 3.1 8B	$0.05	$0.08	840	Groq
Qwen3 32B	$0.29	$0.59	662	Groq
Llama 4 Scout	$0.11	$0.34	594	Groq
GPT-OSS 120B	$0.15	$0.60	500	Groq

Monthly Cost Comparison (1000 queries/day, ~1K input + 500 output tokens each)

$1.50

DeepSeek V3.2 direct (cached)

$7.05

MiMo-V2-Flash (OpenRouter)

$13.50

GPT-OSS 120B (Groq, fast)

~$2

Local electricity (RTX 3060, 4h/day)

Aggregators

OpenRouter — single API key for all models, automatic failover, often cheapest. OpenAI-compatible format.
Groq — unbeatable for latency (500-1000 t/s server-side). Limited model selection (Llama/Qwen/GPT-OSS only).
Together.ai — batch API at 50% off. Free tier available.
Fireworks.ai — prompt caching at ~50% discount. Good for repeated system prompts in agentic loops.
NanoGPT — pay-as-you-go aggregator starting at $0.01 top-up. Not a dedicated plan for any model.

Hybrid Strategy: Local + API (data-driven)

The optimal setup after benchmarking

Local Interactive (free): 9B IQ4_NL at 41 t/s for code gen, tool calling (15/25)
Local Workhorse (free): 35B-A3B IQ4_NL at 21 t/s for code review, bug finding (17/25)
API speed tier: Haiku 4.5 at $0.80/M for quality-critical tasks (22/25)
API budget: DeepSeek V3.2 cached at $0.028/M, MiMo-V2-Flash at $0.09/M
API fast: Groq at 500-1000 t/s server-side for latency-sensitive loops

← 1000 t/s Tier Inference Engines →