DeepSeek DeepSeek-OCR-2 Goes Hardcore Open Source!
DeepSeek-OCR-2 replaces CLIP with Qwen2-0.5B LLM architecture, introducing Visual Causal Flow for intelligent document reading. Open-source OCR achieves 91.09% on OmniDocBench, processing 200K pages
DeepSeek has released DeepSeek-OCR-2, completely replacing the traditional CLIP visual encoder with an LLM architecture. This is a more radical and fundamental approach.
If Kimi K2.5 has pushed the task of “understanding the interface → writing code” to a practical level, then DeepSeek-OCR-2 is tackling an even more foundational question:
Can AI “read documents” like a human?
The answer is: Yes—and this time, it’s genuinely different.
Project Background
We all know that CLIP excels at “getting the big picture”—it can instantly recognize “this is a photo of a cat,” but it struggles with “sequential fine-grained reading.”
This causes traditional models to frequently produce scrambled reading orders when handling complex documents (such as multi-column layouts or nested tables).
CLIP processes images like this: a quick global scan to capture overall semantics.
But true OCR requires reading block by block, just like a human.



