New Open-Source OCR on GitHub: Fast Local PDF Parsing!
Fast local PDF parsing with OCR. Open-source, offline, and agent-friendly. No API keys needed.
Parsing PDF documents with online tools is not only slow but also prone to losing formatting information. When you need to quickly extract text with precise positional data locally, there aren’t many good options available.
There are decent solutions on the market, but most require an internet connection, rely on APIs, and involve uploading your data. This is unacceptable for enterprise scenarios such as contract review, financial report analysis, or medical records, where documents often cannot even leave the internal network.
Recently, I discovered LiteParse, an open-source PDF parsing tool that emphasizes local execution, lightweight design, and high speed. It was developed by the LlamaIndex team.
It uses the PDFium engine for text extraction and includes built-in OCR (Optical Character Recognition). It works out of the box with no extra configuration required, and the parsing results preserve the exact positional information of the text.
In addition to PDFs, it can automatically handle Word, Excel, PowerPoint, images, and other formats. It supports batch processing of entire folders and can generate page screenshots, making it easy for AI agents to extract visual information.



