Multimodal & OCR

Japanese OCR, vision models, speech-to-text on RTX 3060 12GB

Japanese OCR: Decision Tree

Use CaseBest ToolSpeedVRAM
Fast, simple text (printed docs) PaddleOCR PP-OCRv5 Instant (~2M params) Negligible
Complex layouts (manga, mixed text/images) Qwen 2.5 VL 3B Fast ~6-8 GB
OCR + translate in one shot Qwen3.5-4B (when llama.cpp vision lands) Fast ~4-8 GB
Complex documents, charts, seals PaddleOCR-VL (0.9B) Fast ~2-3 GB

PaddleOCR PP-OCRv5

github.com/PaddlePaddle/PaddleOCR

PaddleOCR-VL

Qwen 2.5 VL for OCR

Available in 3B, 7B, 32B, 72B. The 3B model is the sweet spot for 12GB VRAM:

BenchmarkScore
DocVQA93.9%
InfoVQA77.1%
TextVQA79.3%

Fully supported in llama.cpp with mmproj. At ~6-8GB in BF16, fits your GPU.

# Qwen 2.5 VL 3B with vision
llama-server -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF \
  -ngl 99 -fa -c 4096

Qwen3.5-4B with Vision

The all-in-one option: OCR + translate + converse in 201 languages.

BenchmarkScore
CC-OCR76.7
OCRBench85.0
OmniDocBench86.2
WMT24++ (translation)66.6
Not yet in llama.cpp multimodal docs

Qwen3.5 vision is not yet listed in llama.cpp multimodal docs. Qwen 2.5 VL and Gemma 3 are. Wait for support or use Qwen 2.5 VL 3B today.

Gemma 4 26B-A4B Multimodal

unsloth GGUFs | Model card

Modality26B-A4BE2B/E4B
TextYesYes
ImagesYesYes
AudioNoYes (~300M audio encoder, up to 30s)
VideoAs frame sequences (up to 60s)As frame sequences

The "multi lingual/audio/image/video" claim is partially wrong for 26B-A4B. Audio requires the smaller E2B/E4B models. "E" stands for "effective parameters" (not e2b.dev sandbox).

140+ languages, 35+ with out-of-the-box quality. Japanese included. 256K context.

Quantizations That Fit

QuantSizeFits 12GB?
IQ2_XXS9.66 GBYes
UD-IQ2_M9.97-10.7 GBYes
UD-IQ3_XXS11.2 GBTight
UD-IQ4_XS13.4 GBNo — needs CPU offload
Gemma 4 vision is buggy in llama.cpp

PR #21309 merged support but open issues report: image processing failures, CUDA mmproj crashes, large image assertion errors, very long generation latency. Wait for stabilization.

Vision in llama.cpp (mmproj)

What are mmproj files?

The multimodal projector (mmproj) is a separate GGUF file that bridges visual input to the language model. It projects image features (from SigLIP or ViT) into the LLM's token embedding space.

Key Flags

FlagEffect
--mmproj file.ggufLoad a custom projector file
--no-mmprojDisable vision — saves VRAM when text-only
--no-mmproj-offloadKeep projector on CPU (saves GPU VRAM at cost of speed)

Models with Working Vision in llama.cpp

ModelSizes
Gemma 34B, 12B, 27B
Gemma 426B-A4B (merged but buggy)
Qwen 2 VL2B, 7B
Qwen 2.5 VL3B, 7B, 32B, 72B
InternVL 2.51B, 4B
InternVL 31B, 2B, 8B, 14B
SmolVLM256M, 500M, 2.2B
Pixtral12B
Moondream2~1.4B

Qwen3.5 is NOT yet listed. Audio models: Ultravox 0.5 (1B, 8B), Voxtral Mini 3B, Qwen2.5 Omni (3B, 7B).

Speech-to-Text

Parakeet TDT 0.6B v2 (English)

HuggingFace — NVIDIA's FastConformer-TDT architecture

600M
Parameters (~2GB VRAM)
3,386x
Real-time factor (batch 128, A100)
1.69%
WER on LibriSpeech clean
CC-BY-4.0
License (commercial OK)
pip install nemo_toolkit[asr]

import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
output = model.transcribe(['audio.wav'], timestamps=True)
print(output[0].text)
For Japanese speech-to-text: use Whisper

Parakeet does not support Japanese in any version. Use Whisper large-v3 (~3GB VRAM, 99 languages including Japanese). Competitive English WER (~1.8% on LibriSpeech clean) but Parakeet is faster on NVIDIA GPUs.

LFM2.5-VL-450M (Liquid AI)

Blog | 450M param vision-language model

Not recommended for rich OCR or description — a 3-4B VLM produces much better output. Useful only for lightweight structured vision tasks.

Tesseract vs VLM-based OCR

From r/LocalLLaMA discussion:

TesseractVLM-based
Clean printed textGoodOverkill
Complex layouts (manga)PoorExcellent
HandwritingPoorGood
Scene text (signs)PoorGood
Vertical JapanesePoorGood
SpeedFastSlower
Context understandingNoneFull

For Japanese manga/comics: VLM-based OCR is strongly preferred. Text appears in vertical columns, speech bubbles, and artistic fonts that Tesseract handles poorly.

Recommended Setup

Daily OCR Pipeline (Gallery/Manga app)

  1. PaddleOCR PP-OCRv5 for fast pre-screening (instant, CPU)
  2. Qwen 2.5 VL 3B for complex images needing context (llama.cpp, ~6GB VRAM)
  3. Both can run simultaneously — PaddleOCR on CPU, Qwen on GPU

Voice Pipeline

  1. Parakeet TDT 0.6B v2 for English STT (2GB VRAM)
  2. Whisper large-v3 for Japanese STT (3GB VRAM)
  3. Feed transcription to Qwen3.5-4B for processing (2.2GB VRAM)
  4. Total: ~7GB VRAM — fits comfortably with room for context

Sources