Multimodal & OCR
Japanese OCR, vision models, speech-to-text on RTX 3060 12GB
Japanese OCR: Decision Tree
| Use Case | Best Tool | Speed | VRAM |
|---|---|---|---|
| Fast, simple text (printed docs) | PaddleOCR PP-OCRv5 | Instant (~2M params) | Negligible |
| Complex layouts (manga, mixed text/images) | Qwen 2.5 VL 3B | Fast | ~6-8 GB |
| OCR + translate in one shot | Qwen3.5-4B (when llama.cpp vision lands) | Fast | ~4-8 GB |
| Complex documents, charts, seals | PaddleOCR-VL (0.9B) | Fast | ~2-3 GB |
PaddleOCR PP-OCRv5
github.com/PaddlePaddle/PaddleOCR
- Japanese explicitly supported (unified Chinese/English/Japanese framework)
PP-OCRv5_server_recis the high-accuracy server variant (~2M params)- 13% accuracy boost over v4, 109 languages
- Nearly instant for individual images
- Production-ready, battle-tested
PaddleOCR-VL
- 0.9B parameter vision-language model for document parsing
- 94.5% accuracy on OmniDocBench
- 111 languages, seal recognition, text spotting
- Robust to skew, warping, scanning artifacts, illumination
- Significantly more capable than traditional PaddleOCR for complex layouts
Qwen 2.5 VL for OCR
Available in 3B, 7B, 32B, 72B. The 3B model is the sweet spot for 12GB VRAM:
| Benchmark | Score |
|---|---|
| DocVQA | 93.9% |
| InfoVQA | 77.1% |
| TextVQA | 79.3% |
Fully supported in llama.cpp with mmproj. At ~6-8GB in BF16, fits your GPU.
# Qwen 2.5 VL 3B with vision
llama-server -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF \
-ngl 99 -fa -c 4096
Qwen3.5-4B with Vision
The all-in-one option: OCR + translate + converse in 201 languages.
| Benchmark | Score |
|---|---|
| CC-OCR | 76.7 |
| OCRBench | 85.0 |
| OmniDocBench | 86.2 |
| WMT24++ (translation) | 66.6 |
Qwen3.5 vision is not yet listed in llama.cpp multimodal docs. Qwen 2.5 VL and Gemma 3 are. Wait for support or use Qwen 2.5 VL 3B today.
Gemma 4 26B-A4B Multimodal
| Modality | 26B-A4B | E2B/E4B |
|---|---|---|
| Text | Yes | Yes |
| Images | Yes | Yes |
| Audio | No | Yes (~300M audio encoder, up to 30s) |
| Video | As frame sequences (up to 60s) | As frame sequences |
The "multi lingual/audio/image/video" claim is partially wrong for 26B-A4B. Audio requires the smaller E2B/E4B models. "E" stands for "effective parameters" (not e2b.dev sandbox).
140+ languages, 35+ with out-of-the-box quality. Japanese included. 256K context.
Quantizations That Fit
| Quant | Size | Fits 12GB? |
|---|---|---|
| IQ2_XXS | 9.66 GB | Yes |
| UD-IQ2_M | 9.97-10.7 GB | Yes |
| UD-IQ3_XXS | 11.2 GB | Tight |
| UD-IQ4_XS | 13.4 GB | No — needs CPU offload |
PR #21309 merged support but open issues report: image processing failures, CUDA mmproj crashes, large image assertion errors, very long generation latency. Wait for stabilization.
Vision in llama.cpp (mmproj)
What are mmproj files?
The multimodal projector (mmproj) is a separate GGUF file that bridges visual input to the language model. It projects image features (from SigLIP or ViT) into the LLM's token embedding space.
Key Flags
| Flag | Effect |
|---|---|
--mmproj file.gguf | Load a custom projector file |
--no-mmproj | Disable vision — saves VRAM when text-only |
--no-mmproj-offload | Keep projector on CPU (saves GPU VRAM at cost of speed) |
Models with Working Vision in llama.cpp
| Model | Sizes |
|---|---|
| Gemma 3 | 4B, 12B, 27B |
| Gemma 4 | 26B-A4B (merged but buggy) |
| Qwen 2 VL | 2B, 7B |
| Qwen 2.5 VL | 3B, 7B, 32B, 72B |
| InternVL 2.5 | 1B, 4B |
| InternVL 3 | 1B, 2B, 8B, 14B |
| SmolVLM | 256M, 500M, 2.2B |
| Pixtral | 12B |
| Moondream2 | ~1.4B |
Qwen3.5 is NOT yet listed. Audio models: Ultravox 0.5 (1B, 8B), Voxtral Mini 3B, Qwen2.5 Omni (3B, 7B).
Speech-to-Text
Parakeet TDT 0.6B v2 (English)
HuggingFace — NVIDIA's FastConformer-TDT architecture
- 16kHz mono audio, up to 24 minutes per segment
- Automatic punctuation, capitalization, word-level timestamps
- English only — no Japanese
- v3 adds 25 European languages (still no Japanese)
pip install nemo_toolkit[asr]
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
output = model.transcribe(['audio.wav'], timestamps=True)
print(output[0].text)
Parakeet does not support Japanese in any version. Use Whisper large-v3 (~3GB VRAM, 99 languages including Japanese). Competitive English WER (~1.8% on LibriSpeech clean) but Parakeet is faster on NVIDIA GPUs.
LFM2.5-VL-450M (Liquid AI)
Blog | 450M param vision-language model
- 8 languages including Japanese
- OCRBench: 684 (mediocre vs Qwen's 850+)
- Best for: object detection, bounding boxes (RefCOCO-M: 81.28)
- Speed: 242ms on Jetson Orin for 512x512 images
- Q4 would be ~0.25GB — trivial VRAM
- Gated model — apply for access on HuggingFace
Not recommended for rich OCR or description — a 3-4B VLM produces much better output. Useful only for lightweight structured vision tasks.
Tesseract vs VLM-based OCR
From r/LocalLLaMA discussion:
| Tesseract | VLM-based | |
|---|---|---|
| Clean printed text | Good | Overkill |
| Complex layouts (manga) | Poor | Excellent |
| Handwriting | Poor | Good |
| Scene text (signs) | Poor | Good |
| Vertical Japanese | Poor | Good |
| Speed | Fast | Slower |
| Context understanding | None | Full |
For Japanese manga/comics: VLM-based OCR is strongly preferred. Text appears in vertical columns, speech bubbles, and artistic fonts that Tesseract handles poorly.
Recommended Setup
Daily OCR Pipeline (Gallery/Manga app)
- PaddleOCR PP-OCRv5 for fast pre-screening (instant, CPU)
- Qwen 2.5 VL 3B for complex images needing context (llama.cpp, ~6GB VRAM)
- Both can run simultaneously — PaddleOCR on CPU, Qwen on GPU
Voice Pipeline
- Parakeet TDT 0.6B v2 for English STT (2GB VRAM)
- Whisper large-v3 for Japanese STT (3GB VRAM)
- Feed transcription to Qwen3.5-4B for processing (2.2GB VRAM)
- Total: ~7GB VRAM — fits comfortably with room for context