Capability · Framework — rag

ColPali

ColPali changed how many teams think about document RAG. Instead of OCRing pages, chunking, and embedding text, ColPali embeds each page image with a late-interaction vision model (PaliGemma fine-tuned with a ColBERT-style head). Retrieval works directly on page pixels, preserving tables, figures, and layout with state-of-the-art accuracy on the ViDoRe benchmark.

Framework facts

Category
rag
Language
Python
License
MIT (code); model weights under Gemma licence
Repository
https://github.com/illuin-tech/colpali

Install

pip install colpali-engine

Quickstart

from colpali_engine.models import ColPali, ColPaliProcessor
from PIL import Image
import torch

model = ColPali.from_pretrained('vidore/colpali-v1.3', torch_dtype=torch.bfloat16, device_map='cuda')
proc = ColPaliProcessor.from_pretrained('vidore/colpali-v1.3')

images = [Image.open('page1.png'), Image.open('page2.png')]
queries = ['what is the total revenue?']
with torch.no_grad():
    img_emb = model(**proc.process_images(images).to(model.device))
    q_emb   = model(**proc.process_queries(queries).to(model.device))
print(proc.score_multi_vector(q_emb, img_emb))

Alternatives

  • Nomic Embed Vision — image + text embeddings
  • Classic RAG: Unstructured + text embeddings
  • VLM end-to-end: pass full doc to Claude / Gemini
  • ColBERT v2 — text-only late interaction

Frequently asked questions

Do I still need a text RAG pipeline?

Often no — ColPali can replace the ingestion-chunk-embed stack for document-heavy corpora. You still need a vector store that can handle late-interaction scoring (Vespa, Qdrant with multivector, or custom).

What hardware do I need?

A single 16-24 GB GPU is enough for indexing and inference with bfloat16. CPU inference works for tiny corpora but is impractical at scale.

Sources

  1. ColPali paper — accessed 2026-04-20
  2. ColPali GitHub — accessed 2026-04-20