agentic-doc: Extract Structured Data from Complex PDFs in Python (100+ Pages Supported)
Extract structured data from complex PDFs with agentic-doc Python library. Supports 100+ page documents, batch processing & auto-retry.
"Top Python Libraries" Publication 400 Subscriptions 20% Discount Offer Link.
LandingAI’s Agentic Document Extraction API can extract structured data from visually complex documents (such as tables, images, and charts) and return hierarchical JSON with precise element locations.
This Python library encapsulates the API, providing the following features:
Long document support – Process 100+ page PDFs in a single call
Automatic retry/pagination – Handle concurrency, timeouts, and rate limits
Utility tools – Bounding box snippets, visualization debugger, etc.
Features
Out-of-the-box installation: pip install agentic-doc – No additional dependencies
Supports all file types: Parse PDFs of any length, single images, or URLs
Long document ready: Automatically split and process 1000+ page PDFs in parallel, then merge results
Structured output: Returns hierarchical JSON and directly renderable Markdown
True visualization: Optional bounding box snippets and full-page visualization
Batch parallel processing: Input a list; the library manages threads and rate limits (BATCH_SIZE, MAX_WORKERS)
High fault tolerance: Exponential backoff retries for 408/429/502/503/504 errors and rate limit triggers
Ready-to-use helper functions: parse_documents, parse_and_save_documents, parse_and_save_document
Configuration via environment variables/.env: Adjust parallelism, logging style, retry limits without code changes
Native API support: Advanced users can still directly call REST endpoints