Multimodal & OCR

Japanese OCR, vision models, speech-to-text on RTX 3060 12GB

Japanese OCR: Decision Tree

Use Case	Best Tool	Speed	VRAM
Fast, simple text (printed docs)	PaddleOCR PP-OCRv5	Instant (~2M params)	Negligible
Complex layouts (manga, mixed text/images)	Qwen 2.5 VL 3B	Fast	~6-8 GB
OCR + translate in one shot	Qwen3.5-4B (when llama.cpp vision lands)	Fast	~4-8 GB
Complex documents, charts, seals	PaddleOCR-VL (0.9B)	Fast	~2-3 GB

PaddleOCR PP-OCRv5

github.com/PaddlePaddle/PaddleOCR

Japanese explicitly supported (unified Chinese/English/Japanese framework)
PP-OCRv5_server_rec is the high-accuracy server variant (~2M params)
13% accuracy boost over v4, 109 languages
Nearly instant for individual images
Production-ready, battle-tested

PaddleOCR-VL

0.9B parameter vision-language model for document parsing
94.5% accuracy on OmniDocBench
111 languages, seal recognition, text spotting
Robust to skew, warping, scanning artifacts, illumination
Significantly more capable than traditional PaddleOCR for complex layouts

Qwen 2.5 VL for OCR

Available in 3B, 7B, 32B, 72B. The 3B model is the sweet spot for 12GB VRAM:

Benchmark	Score
DocVQA	93.9%
InfoVQA	77.1%
TextVQA	79.3%

Fully supported in llama.cpp with mmproj. At ~6-8GB in BF16, fits your GPU.

# Qwen 2.5 VL 3B with vision
llama-server -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF \
  -ngl 99 -fa -c 4096

Qwen3.5-4B with Vision

The all-in-one option: OCR + translate + converse in 201 languages.

Benchmark	Score
CC-OCR	76.7
OCRBench	85.0
OmniDocBench	86.2
WMT24++ (translation)	66.6

Not yet in llama.cpp multimodal docs

Qwen3.5 vision is not yet listed in llama.cpp multimodal docs. Qwen 2.5 VL and Gemma 3 are. Wait for support or use Qwen 2.5 VL 3B today.

Gemma 4 26B-A4B Multimodal

unsloth GGUFs | Model card

Modality	26B-A4B	E2B/E4B
Text	Yes	Yes
Images	Yes	Yes
Audio	No	Yes (~300M audio encoder, up to 30s)
Video	As frame sequences (up to 60s)	As frame sequences

The "multi lingual/audio/image/video" claim is partially wrong for 26B-A4B. Audio requires the smaller E2B/E4B models. "E" stands for "effective parameters" (not e2b.dev sandbox).

140+ languages, 35+ with out-of-the-box quality. Japanese included. 256K context.

Quantizations That Fit

Quant	Size	Fits 12GB?
IQ2_XXS	9.66 GB	Yes
UD-IQ2_M	9.97-10.7 GB	Yes
UD-IQ3_XXS	11.2 GB	Tight
UD-IQ4_XS	13.4 GB	No — needs CPU offload

Gemma 4 vision is buggy in llama.cpp

PR #21309 merged support but open issues report: image processing failures, CUDA mmproj crashes, large image assertion errors, very long generation latency. Wait for stabilization.

Vision in llama.cpp (mmproj)

What are mmproj files?

The multimodal projector (mmproj) is a separate GGUF file that bridges visual input to the language model. It projects image features (from SigLIP or ViT) into the LLM's token embedding space.

Key Flags

Flag	Effect
`--mmproj file.gguf`	Load a custom projector file
`--no-mmproj`	Disable vision — saves VRAM when text-only
`--no-mmproj-offload`	Keep projector on CPU (saves GPU VRAM at cost of speed)

Models with Working Vision in llama.cpp

Model	Sizes
Gemma 3	4B, 12B, 27B
Gemma 4	26B-A4B (merged but buggy)
Qwen 2 VL	2B, 7B
Qwen 2.5 VL	3B, 7B, 32B, 72B
InternVL 2.5	1B, 4B
InternVL 3	1B, 2B, 8B, 14B
SmolVLM	256M, 500M, 2.2B
Pixtral	12B
Moondream2	~1.4B

Qwen3.5 is NOT yet listed. Audio models: Ultravox 0.5 (1B, 8B), Voxtral Mini 3B, Qwen2.5 Omni (3B, 7B).

Speech-to-Text

Parakeet TDT 0.6B v2 (English)

HuggingFace — NVIDIA's FastConformer-TDT architecture

600M

Parameters (~2GB VRAM)

3,386x

Real-time factor (batch 128, A100)

1.69%

WER on LibriSpeech clean

CC-BY-4.0

License (commercial OK)

16kHz mono audio, up to 24 minutes per segment
Automatic punctuation, capitalization, word-level timestamps
English only — no Japanese
v3 adds 25 European languages (still no Japanese)

pip install nemo_toolkit[asr]

import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
output = model.transcribe(['audio.wav'], timestamps=True)
print(output[0].text)

For Japanese speech-to-text: use Whisper

Parakeet does not support Japanese in any version. Use Whisper large-v3 (~3GB VRAM, 99 languages including Japanese). Competitive English WER (~1.8% on LibriSpeech clean) but Parakeet is faster on NVIDIA GPUs.

LFM2.5-VL-450M (Liquid AI)

Blog | 450M param vision-language model

8 languages including Japanese
OCRBench: 684 (mediocre vs Qwen's 850+)
Best for: object detection, bounding boxes (RefCOCO-M: 81.28)
Speed: 242ms on Jetson Orin for 512x512 images
Q4 would be ~0.25GB — trivial VRAM
Gated model — apply for access on HuggingFace

Not recommended for rich OCR or description — a 3-4B VLM produces much better output. Useful only for lightweight structured vision tasks.

Tesseract vs VLM-based OCR

From r/LocalLLaMA discussion:

	Tesseract	VLM-based
Clean printed text	Good	Overkill
Complex layouts (manga)	Poor	Excellent
Handwriting	Poor	Good
Scene text (signs)	Poor	Good
Vertical Japanese	Poor	Good
Speed	Fast	Slower
Context understanding	None	Full

For Japanese manga/comics: VLM-based OCR is strongly preferred. Text appears in vertical columns, speech bubbles, and artistic fonts that Tesseract handles poorly.

Recommended Setup

Daily OCR Pipeline (Gallery/Manga app)

PaddleOCR PP-OCRv5 for fast pre-screening (instant, CPU)
Qwen 2.5 VL 3B for complex images needing context (llama.cpp, ~6GB VRAM)
Both can run simultaneously — PaddleOCR on CPU, Qwen on GPU

Voice Pipeline

Parakeet TDT 0.6B v2 for English STT (2GB VRAM)
Whisper large-v3 for Japanese STT (3GB VRAM)
Feed transcription to Qwen3.5-4B for processing (2.2GB VRAM)
Total: ~7GB VRAM — fits comfortably with room for context

Sources

← Harness & Agents Command Reference →