| liteparse | pdf | Faster than LlamaParse, runs locally |
| kreuzberg | framework | Supports multiple programming languages and document formats for extraction and orchestration, backend supports Tesseract, PaddleOCR, EasyOCR |
| GLM-OCR | ocr | Zhipu |
| langchain-paddleocr | Baidu | pip install langchain-paddleocr, good performance, can be cloud-based |
| OpenDataLoader | pdf to markdown parsing | Java, runs completely locally, uses only CPU no GPU, free and open source. Original post directly states '100+ pages/second', thousands of pages of material converted to Markdown in minutes, perfect for feeding to local LLM. Docker deployment, HTTP calls. |
| [embed-pdf-viewer](GitHub: https://github.com/embedpdf/embed-pdf-viewer) | pdf preview | Relatively new |
| tika | document extraction | Good Java project, parses multiple document types |
| BabelDOC | ocr | Paper translation tool |
| OpenDoc-0.1B | ocr | Fudan University Vision and Learning Lab open-sourced an ultra-lightweight document parsing system |
| imagepdf2txt | ocr | Handles image-based PDFs, uses paddle |
| OCRmyPDF | ocr | Uses Tesseract OCR, supports command line, batch processing |
| MinerU | pdf parsing | Can parse LaTeX formulas, requires at least 16GB GPU |
| PDFMathTranslate | pdf translation tool | |
| zerox | ocr | Python OCR tool |
| Stirling-PDF | format conversion | Multiple PDF format conversions, private deployment |
| ParseStudio | pdf export | Integrates multiple tools |
| itext-dotnet | pdf | .NET PDF toolkit |