Capability · Framework — rag
ColPali
ColPali changed how many teams think about document RAG. Instead of OCRing pages, chunking, and embedding text, ColPali embeds each page image with a late-interaction vision model (PaliGemma fine-tuned with a ColBERT-style head). Retrieval works directly on page pixels, preserving tables, figures, and layout with state-of-the-art accuracy on the ViDoRe benchmark.
Framework facts
- Category
- rag
- Language
- Python
- License
- MIT (code); model weights under Gemma licence
- Repository
- https://github.com/illuin-tech/colpali
Install
pip install colpali-engine Quickstart
from colpali_engine.models import ColPali, ColPaliProcessor
from PIL import Image
import torch
model = ColPali.from_pretrained('vidore/colpali-v1.3', torch_dtype=torch.bfloat16, device_map='cuda')
proc = ColPaliProcessor.from_pretrained('vidore/colpali-v1.3')
images = [Image.open('page1.png'), Image.open('page2.png')]
queries = ['what is the total revenue?']
with torch.no_grad():
img_emb = model(**proc.process_images(images).to(model.device))
q_emb = model(**proc.process_queries(queries).to(model.device))
print(proc.score_multi_vector(q_emb, img_emb)) Alternatives
- Nomic Embed Vision — image + text embeddings
- Classic RAG: Unstructured + text embeddings
- VLM end-to-end: pass full doc to Claude / Gemini
- ColBERT v2 — text-only late interaction
Frequently asked questions
Do I still need a text RAG pipeline?
Often no — ColPali can replace the ingestion-chunk-embed stack for document-heavy corpora. You still need a vector store that can handle late-interaction scoring (Vespa, Qdrant with multivector, or custom).
What hardware do I need?
A single 16-24 GB GPU is enough for indexing and inference with bfloat16. CPU inference works for tiny corpora but is impractical at scale.
Sources
- ColPali paper — accessed 2026-04-20
- ColPali GitHub — accessed 2026-04-20