Parallel Inference & API Fallback
Real parallel benchmarks, context scaling tables, and API as the true "speed tier"
Key finding: speed without quality is useless
0.8B achieves 343 t/s aggregate with --parallel 5, but scores only 5/25 on the polecat quality suite. It fabricates bugs, uses wrong frameworks, and is not viable for agentic coding. API fallback (Haiku at 22/25, Groq at 500+ t/s) replaces the local speed tier.
Real Benchmarks (measured on this hardware)
Single-stream generation (measured, @34K context)
| Model | Size | TG t/s | PP t/s | Quality | Verdict |
| 0.8B Q8_0 | 764MB | 148 | 6,913 | 5/25 | Not viable for agentic work |
| 2B Q8_0 | 1.86GB | 96 | 4,972 | 9/25 | Not viable |
| 4B Q8_0 | 4.48GB | 43 | 2,060 | 10/25 | Same speed as 9B, worse quality |
| 9B IQ4_NL | 5.37GB | 41 | 1,518 | 15/25 | Interactive tier winner |
| 35B-A3B IQ4_NL | 17.8GB | 21 (regex) | 420 | 17/25 | Workhorse tier winner |
| Haiku 4.5 API | N/A | ~100 | N/A | 22/25 | Quality baseline ($0.80/M) |
Parallel scaling: 0.8B Q4_K_M (real concurrent requests, 128 tokens each)
| N | Aggregate t/s | Per-slot t/s | Wall time | Note |
| 1 | 160 | 160 | 798ms | |
| 2 | 252 | 126 | 1017ms | |
| 3 | 281 | 94 | 1367ms | |
| 4 | 310 | 78 | 1653ms | |
| 5 | 343 | 69 | 1865ms | Sweet spot |
| 6 | 357 | 60 | 2154ms | |
| 7 | 351 | 50 | 2550ms | Aggregate drops — CPU contention |
| 8 | 369 | 46 | 2775ms | |
CPU is the bottleneck, not GPU
GPU utilization sits at ~20% during parallel generation. Token sampling happens on CPU per-slot sequentially. The 2014 Xeon E5-1650v3 creates a hard ceiling at ~350-370 t/s aggregate regardless of parallel count. N=5 is the sweet spot: biggest jump from N=4 (310→343), after that aggregate plateaus while per-slot collapses.
Context Scaling (real measured data)
All models use Q8 KV cache + 262K max context + activation rotation. Clean restart per test.
9B IQ4_NL (Interactive tier)
| Context | PP t/s | TG t/s | VRAM |
| 34K | 1,518 | 41 | 10,807MB |
| 93K | 1,178 | 30 | 11,073MB |
| 200K | 872 | 21 | 11,454MB |
| 241K | 778 | 19 | — |
35B-A3B IQ4_NL regex (Workhorse fast mode)
| Context | PP t/s | TG t/s | VRAM |
| 34K | 420 | 21 | 11,589MB |
| 93K | 382 | 16 | 11,750MB |
| 200K | ~254 | ~14 | 11,896MB |
| >188K | OOM — switch to -cmoe mode |
35B-A3B IQ4_NL -cmoe (Workhorse full context mode)
| Context | PP t/s | TG t/s | VRAM |
| 241K | 325 | 12 | 6.7GB (fits easily) |
Cache hits are 15x faster
Prompt cache is on by default. Same-prefix follow-up queries skip prompt eval:
- Q1 (cold, 34K context): 5,386ms wall
- Q2 (cached, same prefix): 346ms wall (15x faster)
- Q3 (cached): 578ms wall
For 35B on video-platform (140K tokens): first query ~15 min, follow-ups ~1-2s each.
Prompt Processing Time (real)
| Context Size | 35B IQ4_NL (regex) | 9B IQ4_NL |
| 60K tokens | ~6 min | ~40s |
| 140K tokens | ~6 min | ~2 min |
VRAM Budget: Qwen3.5-4B Q4_K_M
Qwen3.5-4B's hybrid architecture means KV cache per token is only ~32KB (vs ~144KB for a standard 4B transformer):
| Component | VRAM |
| CUDA overhead | ~0.75 GB |
| Model weights (Q4_K_M) | ~2.50 GB |
| Available for KV cache | ~8.75 GB |
| Parallel N | Context/Slot | Total KV (FP16) | Total KV (Q8_0) | Fits? |
| 4 | 8K | 1.0 GB | 0.5 GB | Plenty of room |
| 8 | 8K | 2.0 GB | 1.0 GB | Yes |
| 4 | 32K | 4.0 GB | 2.0 GB | Yes |
| 8 | 32K | 8.0 GB | 4.0 GB | Tight (FP16) / OK (Q8_0) |
| 16 | 8K | 4.0 GB | 2.0 GB | Yes |
| 16 | 16K | 8.0 GB | 4.0 GB | Tight (FP16) / OK (Q8_0) |
You could theoretically run 16 parallel slots at 16K context each on your 3060 with Qwen3.5-4B. The Gated DeltaNet layers use O(1) state, so only the 8 attention layers contribute to KV cache.
Gemma 4 vs Qwen 3.5: Head-to-Head for Parallel Agentic Use
Small Models (sub-4B)
| Qwen3.5-4B | Gemma 4 E4B (4.5B eff) | Gemma 4 E2B (2.3B eff) |
| Single-stream t/s (3060) | ~80-120 (est.) | 12-15 | ~20 |
| VRAM (Q4) | ~2.5 GB | ~2.5 GB | ~1.3 GB |
| MMLU-Pro | 79.1 | Lower | Lower |
| TAU2 (agent/tool-use) | +37.7 pts vs Gemma | Baseline | N/A |
| Audio input | No | Yes | Yes |
| Image input | Yes (when llama.cpp lands) | Yes (buggy) | Yes (buggy) |
| Thinking mode toggle | Yes (per-request) | No | No |
| Tool calling | Slightly more reliable JSON | Cleaner JSON output | N/A |
| Parallel slots fit | 16 @ 16K ctx | ~8-12 @ 8K ctx | ~16+ @ 8K ctx |
Gemma 4 E2B/E4B are surprisingly slow
Despite being smaller, Gemma 4 E2B gets only ~20 t/s on RTX 3060 (vs Qwen3.5-4B's estimated ~80-120 t/s). The Per-Layer Embeddings (PLE) architecture adds overhead. For pure speed, Qwen 3.5 wins decisively.
MoE Tier (26B class)
| Qwen3.5-35B-A3B | Gemma 4 26B-A4B |
| Active params | 3B | 3.8B |
| t/s (3060, --fit) | ~80-100 (UD-IQ2_M in VRAM) | ~48 (partial CPU offload) |
| Arena Elo | High | #6 at 1441 Elo |
| Vision | No | Yes (text+image) |
| Parallel viability | Possible at IQ2_M (11.4GB) | Not viable (needs --fit CPU offload) |
Candidates to Manually Benchmark Per Tier
| Tier | Qwen 3.5 Candidate | Gemma 4 Candidate | Benchmark Command |
| 10 t/s |
122B-A10B IQ3_XXS |
26B-A4B IQ3_S (with --fit) |
llama-bench -m model.gguf -ngl 99 -fa 1 |
| 100 t/s |
9B Q5_K_M / 35B-A3B IQ2_M |
26B-A4B IQ2_M / E4B Q8 |
llama-bench -m model.gguf -ngl 99 -fa 1 -c 8192 |
| 1000 t/s |
4B Q4_K_M (--parallel 4-8) |
E2B Q8 (--parallel 4-8) |
llama-batched-bench -m model.gguf -ngl 99 -fa 1 -npl 1,2,4,8 |
Gemma 4 Known Bugs in llama.cpp
| Bug | Impact | Status | Tracking |
<unused24>/<unused25> token spam |
Garbage tokens in output |
Fixed |
#21321, PR #21326 + #21343 |
| Vision CUDA crash (SIGABRT) |
mmproj crashes on CUDA for 31B/26B |
Open |
#21402 |
| Audio support missing |
E2B/E4B audio not implemented |
Open |
#21325 |
| OOM with partial offload on 26B |
Even on RTX 5080 16GB |
Open |
#21323 |
| Flash Attention hang (Apple Silicon) |
Hybrid head dimensions break FA |
Open |
N/A (Apple-specific) |
Parallel Inference: Gotchas
- Quadratic attention overhead: Unified KV cache means attention is computed over the entire cache then masked per-slot — wasteful with many slots
- CPU sampling bottleneck: Sampling is per-slot on CPU. Your 2014 Xeon will struggle at high N — watch for GPU utilization dropping below 40%
- KV cache full = sequence killed: When cache fills, llama.cpp terminates the largest sequence. No graceful offloading.
- Nondeterministic outputs:
cache_prompt can cause non-identical logits across different batch sizes
- One server, not four: Multiple llama-server instances each load model weights separately. On 12GB, only one server with
--parallel N is viable.
Benchmark Commands
Run this to get your actual numbers
# Qwen3.5-4B: parallel scaling benchmark
llama-batched-bench -m Qwen3.5-4B-Q4_K_M.gguf -ngl 99 -fa 1 \
-c 32768 -b 2048 -ub 512 \
-npp 512,2048 -ntg 128,512 -npl 1,2,4,8
# Gemma 4 E2B: parallel scaling benchmark
llama-batched-bench -m gemma-4-E2B-it-Q8_0.gguf -ngl 99 -fa 1 \
-c 16384 -b 2048 -ub 512 \
-npp 512,2048 -ntg 128,512 -npl 1,2,4,8
# Single-stream baseline for each candidate
llama-bench -m Qwen3.5-4B-Q4_K_M.gguf -ngl 99 -fa 1
llama-bench -m gemma-4-E2B-it-Q8_0.gguf -ngl 99 -fa 1
llama-bench -m gemma-4-E4B-it-Q8_0.gguf -ngl 99 -fa 1
Parallel server for agentic use
# Qwen3.5-4B with 4 parallel agent slots
llama-server -hf unsloth/Qwen3.5-4B-GGUF:Q4_K_M \
-ngl 99 -fa --jinja \
--parallel 4 -c 32768 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--port 8080
# Each slot gets 8K context (32768 / 4)
# Total KV cache: ~1GB (Q8_0) — leaves 6.75GB free
API Fallback: Models Too Big for Local
For frontier models that can't run on 12GB, cheap API options ranked by price:
| Model | Input/1M | Output/1M | Provider | Tool Calling | Context |
| MiMo-V2-Flash |
$0.09 |
$0.29 |
OpenRouter |
Yes |
262K |
| MiniMax M2.5 |
$0.118 |
$0.99 |
OpenRouter |
Yes |
196K |
| DeepSeek V3.2 (cached) |
$0.028 |
$0.42 |
Direct |
Yes |
128K |
| DeepSeek V3.2 (uncached) |
$0.28 |
$0.42 |
Direct |
Yes |
128K |
| MiMo-V2-Pro |
$1.00 |
$3.00 |
OpenRouter |
Yes |
1M (!) |
| MiniMax M2.7 |
$0.30 |
$1.20 |
OpenRouter |
Yes |
204K |
| Haiku 4.5 |
$0.80 |
$4.00 |
Anthropic direct |
Yes |
200K |
| Kimi K2.5 |
$0.38 |
$1.72 |
OpenRouter |
Yes |
262K |
| GLM 5.1 |
$1.395 |
$4.40 |
OpenRouter |
Yes |
202K |
Speed-Optimized APIs (for latency-sensitive agent loops)
| Model | Input/1M | Output/1M | Server-side t/s | Provider |
| GPT-OSS 20B |
$0.075 |
$0.30 |
1,000 |
Groq |
| Llama 3.1 8B |
$0.05 |
$0.08 |
840 |
Groq |
| Qwen3 32B |
$0.29 |
$0.59 |
662 |
Groq |
| Llama 4 Scout |
$0.11 |
$0.34 |
594 |
Groq |
| GPT-OSS 120B |
$0.15 |
$0.60 |
500 |
Groq |
Monthly Cost Comparison (1000 queries/day, ~1K input + 500 output tokens each)
$1.50
DeepSeek V3.2 direct (cached)
$7.05
MiMo-V2-Flash (OpenRouter)
$13.50
GPT-OSS 120B (Groq, fast)
~$2
Local electricity (RTX 3060, 4h/day)
Aggregators
- OpenRouter — single API key for all models, automatic failover, often cheapest. OpenAI-compatible format.
- Groq — unbeatable for latency (500-1000 t/s server-side). Limited model selection (Llama/Qwen/GPT-OSS only).
- Together.ai — batch API at 50% off. Free tier available.
- Fireworks.ai — prompt caching at ~50% discount. Good for repeated system prompts in agentic loops.
- NanoGPT — pay-as-you-go aggregator starting at $0.01 top-up. Not a dedicated plan for any model.
Hybrid Strategy: Local + API (data-driven)
The optimal setup after benchmarking
- Local Interactive (free): 9B IQ4_NL at 41 t/s for code gen, tool calling (15/25)
- Local Workhorse (free): 35B-A3B IQ4_NL at 21 t/s for code review, bug finding (17/25)
- API speed tier: Haiku 4.5 at $0.80/M for quality-critical tasks (22/25)
- API budget: DeepSeek V3.2 cached at $0.028/M, MiMo-V2-Flash at $0.09/M
- API fast: Groq at 500-1000 t/s server-side for latency-sensitive loops