Capability · Framework — rag

Unstructured.io

Unstructured is the de-facto open-source ETL layer for RAG pipelines. Its `partition` functions recognise 30+ file types, emit a common element model (titles, tables, list items, narrative text), and integrate with LangChain, LlamaIndex, and Haystack loaders. A hosted API and Enterprise Platform provide a managed path for teams that don't want to run the models themselves.

Framework facts

Category
rag
Language
Python
License
Apache-2.0 (core) / commercial (Platform)
Repository
https://github.com/Unstructured-IO/unstructured

Install

pip install 'unstructured[pdf]'
# or system-wide extras:
pip install 'unstructured[all-docs]'

Quickstart

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename='10k.pdf',
    strategy='hi_res',   # layout-aware model
    infer_table_structure=True
)
for el in elements[:5]:
    print(el.category, '->', el.text[:80])

Alternatives

  • LlamaParse — hosted, strong on slides/tables
  • Docling — IBM open-source, HTML-first
  • PyMuPDF — raw PDF extraction, no layout model
  • Marker — academic-quality PDF → Markdown

Frequently asked questions

Do I need a GPU?

Only for the `hi_res` PDF strategy, which uses a layout model. The `fast` strategy is CPU-only and sufficient for text-heavy PDFs.

How does Unstructured compare to LlamaParse?

Unstructured is fully open-source and self-hostable. LlamaParse is hosted-only but often wins on complex slides and nested tables. Many pipelines use both: Unstructured for most docs, LlamaParse for edge cases.

Sources

  1. Unstructured — docs — accessed 2026-04-20
  2. Unstructured GitHub — accessed 2026-04-20