Capability · Framework — rag
Unstructured.io
Unstructured is the de-facto open-source ETL layer for RAG pipelines. Its `partition` functions recognise 30+ file types, emit a common element model (titles, tables, list items, narrative text), and integrate with LangChain, LlamaIndex, and Haystack loaders. A hosted API and Enterprise Platform provide a managed path for teams that don't want to run the models themselves.
Framework facts
- Category
- rag
- Language
- Python
- License
- Apache-2.0 (core) / commercial (Platform)
- Repository
- https://github.com/Unstructured-IO/unstructured
Install
pip install 'unstructured[pdf]'
# or system-wide extras:
pip install 'unstructured[all-docs]' Quickstart
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename='10k.pdf',
strategy='hi_res', # layout-aware model
infer_table_structure=True
)
for el in elements[:5]:
print(el.category, '->', el.text[:80]) Alternatives
- LlamaParse — hosted, strong on slides/tables
- Docling — IBM open-source, HTML-first
- PyMuPDF — raw PDF extraction, no layout model
- Marker — academic-quality PDF → Markdown
Frequently asked questions
Do I need a GPU?
Only for the `hi_res` PDF strategy, which uses a layout model. The `fast` strategy is CPU-only and sufficient for text-heavy PDFs.
How does Unstructured compare to LlamaParse?
Unstructured is fully open-source and self-hostable. LlamaParse is hosted-only but often wins on complex slides and nested tables. Many pipelines use both: Unstructured for most docs, LlamaParse for edge cases.
Sources
- Unstructured — docs — accessed 2026-04-20
- Unstructured GitHub — accessed 2026-04-20