PDF Processing Related Resources Independent Resource Development

Title	Category	Remarks
liteparse	pdf	Faster than LlamaParse, runs locally
kreuzberg	framework	Supports multiple programming languages and document formats for extraction and orchestration, backend supports Tesseract, PaddleOCR, EasyOCR
GLM-OCR	ocr	Zhipu
langchain-paddleocr	Baidu	pip install langchain-paddleocr, good performance, can be cloud-based
OpenDataLoader	pdf to markdown parsing	Java, runs completely locally, uses only CPU no GPU, free and open source. Original post directly states '100+ pages/second', thousands of pages of material converted to Markdown in minutes, perfect for feeding to local LLM. Docker deployment, HTTP calls.
[embed-pdf-viewer](GitHub: https://github.com/embedpdf/embed-pdf-viewer)	pdf preview	Relatively new
tika	document extraction	Good Java project, parses multiple document types
BabelDOC	ocr	Paper translation tool
OpenDoc-0.1B	ocr	Fudan University Vision and Learning Lab open-sourced an ultra-lightweight document parsing system
imagepdf2txt	ocr	Handles image-based PDFs, uses paddle
OCRmyPDF	ocr	Uses Tesseract OCR, supports command line, batch processing
MinerU	pdf parsing	Can parse LaTeX formulas, requires at least 16GB GPU
PDFMathTranslate	pdf translation tool
zerox	ocr	Python OCR tool
Stirling-PDF	format conversion	Multiple PDF format conversions, private deployment
ParseStudio	pdf export	Integrates multiple tools
itext-dotnet	pdf	.NET PDF toolkit