Discover tools for parsing PDFs, Word documents, spreadsheets, and extracting structured data from unstructured files.
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.
Convert documentation websites, GitHub repositories, and PDFs into Claude AI skills with automatic conflict detection
Office document creation and editing skills for Claude Code - PPTX, DOCX, XLSX, and PDF workflows with automation support
A comprehensive collection of Claude Code skills for document generation, styling, and manipulation. Includes Document Polisher with 10 premium brand themes (McKinsey, Deloitte, Stripe, Apple, Notion, etc.) plus docx, pdf, xlsx, pptx skills.
📄 Production-ready MCP server for PDF processing - 5-10x faster with parallel processing and 94%+ test coverage
An Agent Skill and Dify plugin to transform Markdown to files of DOCX, PPTX, XLSX, PNG, PDF, Mermaid, HTML, MD, CSV, JSON, XML.
RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs
For GenAI and LLM usage. This package converts codebase (folder structure with files) into a single text file or a Microsoft Word document (.docx), preserving folder structure and file contents. The tool extracts file contents from various file types, including text files, documents, and more, while retaining their formatting for easy readability.
A privacy-first, self-hosted, fully open source personal knowledge management software, written in typescript and golang.
| Tool | Stars | Language | License | Score |
|---|---|---|---|---|
| ExtractThinker | ★ 1.5k | Python | Apache-2.0 | 31 |
| pdfmux | ★ 47 | Python | MIT | 42 |
| Skill_Seekers | ★ 11.4k | Python | MIT | 51 |
| claude-office-skills | ★ 348 | Python | — | 32 |
| claude-code-polished-documents-skills | ★ 51 | Python | — | 33 |
| pdf-reader-mcp | ★ 580 | TypeScript | MIT | 46 |
| markdown-exporter | ★ 189 | Python | Apache-2.0 | 41 |
| ragflow | ★ 76.5k | Python | Apache-2.0 | 53 |
| codebase_to_text | ★ 102 | Python | Apache-2.0 | 30 |
| siyuan | ★ 42.2k | TypeScript | AGPL-3.0 | 50 |
The top document parsing tools include ExtractThinker, pdfmux, Skill_Seekers. These are ranked by our composite score based on GitHub stars, community activity, and code quality.
Most tools listed here are open-source. 8 out of 10 have explicit open-source licenses, making them free to use and modify.
Consider your tech stack (language compatibility), project scale (stars indicate community trust), and specific features you need. Use the comparison table above to evaluate side by side.
Top 20 fastest-growing AI tools delivered every Monday. Free.
No spam, unsubscribe anytime.