Article

PDF Processing Related Resources Independent Resource Development

TitleCategoryRemarks
liteparsepdfFaster than LlamaParse, runs locally
kreuzbergframeworkSupports multiple programming languages and document formats for extraction and orchestration, backend supports Tesseract, PaddleOCR, EasyOCR
GLM-OCRocrZhipu
langchain-paddleocrBaidupip install langchain-paddleocr, good performance, can be cloud-based
OpenDataLoaderpdf to markdown parsingJava, runs completely locally, uses only CPU no GPU, free and open source. Original post directly states '100+ pages/second', thousands of pages of material converted to Markdown in minutes, perfect for feeding to local LLM. Docker deployment, HTTP calls.
[embed-pdf-viewer](GitHub: https://github.com/embedpdf/embed-pdf-viewer)pdf previewRelatively new
tikadocument extractionGood Java project, parses multiple document types
BabelDOCocrPaper translation tool
OpenDoc-0.1BocrFudan University Vision and Learning Lab open-sourced an ultra-lightweight document parsing system
imagepdf2txtocrHandles image-based PDFs, uses paddle
OCRmyPDFocrUses Tesseract OCR, supports command line, batch processing
MinerUpdf parsingCan parse LaTeX formulas, requires at least 16GB GPU
PDFMathTranslatepdf translation tool
zeroxocrPython OCR tool
Stirling-PDFformat conversionMultiple PDF format conversions, private deployment
ParseStudiopdf exportIntegrates multiple tools
itext-dotnetpdf.NET PDF toolkit