PDF Processing Related Resources Independent Resource Development

TitleCategoryRemarks
liteparsepdfFaster than LlamaParse, runs locally
kreuzbergframeworkSupports multiple programming languages and document formats for extraction and orchestration, backend supports Tesseract, PaddleOCR, EasyOCR
GLM-OCRocrZhipu
langchain-paddleocrBaidupip install langchain-paddleocr, good performance, can be cloud-based
OpenDataLoaderpdf to markdown parsingJava, runs completely locally, uses only CPU no GPU, free and open source. Original post directly states '100+ pages/second', thousands of pages of material converted to Markdown in minutes, perfect for feeding to local LLM. Docker deployment, HTTP calls.
[embed-pdf-viewer](GitHub: https://github.com/embedpdf/embed-pdf-viewer)pdf previewRelatively new
tikadocument extractionGood Java project, parses multiple document types
BabelDOCocrPaper translation tool
OpenDoc-0.1BocrFudan University Vision and Learning Lab open-sourced an ultra-lightweight document parsing system
imagepdf2txtocrHandles image-based PDFs, uses paddle
OCRmyPDFocrUses Tesseract OCR, supports command line, batch processing
MinerUpdf parsingCan parse LaTeX formulas, requires at least 16GB GPU
PDFMathTranslatepdf translation tool
zeroxocrPython OCR tool
Stirling-PDFformat conversionMultiple PDF format conversions, private deployment
ParseStudiopdf exportIntegrates multiple tools
itext-dotnetpdf.NET PDF toolkit