Curiosity · Concept

Vision-Language Models (VLMs)

A Vision-Language Model (VLM) can read an image and answer questions about it, describe it, extract text, or reason over charts and documents. Most VLMs pair a vision encoder (usually a Vision Transformer trained with CLIP-style contrastive learning) with a language model via a projection layer. GPT-4o, Claude 3.5 Sonnet Vision, Gemini, Qwen-VL, and LLaVA are well-known examples.

Quick reference

Proficiency: Intermediate
Also known as: VLM, multimodal LLM, MLLM, image-text model
Prerequisites: Transformer architecture, Embeddings

Frequently asked questions

What is a Vision-Language Model?

It is a model that jointly processes images and text. You can show it a photo, a screenshot, a chart, or a PDF page and it will describe, answer questions, or extract information using natural language.

How does a VLM 'see' an image?

The image is split into patches (e.g. 14x14 pixel tiles), each patch is embedded by a vision encoder, and the resulting vectors are projected into the LLM's embedding space. The LLM then processes these visual tokens and text tokens as one sequence.

What's CLIP and why does it matter?

CLIP (OpenAI, 2021) trains a vision encoder and a text encoder so that matching image-caption pairs end up close in a shared embedding space. That contrastively trained vision encoder is the starting point for most modern VLMs.

What are VLMs good and bad at today?

Strong: captioning, OCR, document QA, chart reading, UI understanding, coarse spatial relations. Weak: fine pixel-level localization, precise counting of many small objects, complex geometric reasoning, and any task where image resolution was downsampled below what the target text needed.

Sources

Radford et al. — CLIP — accessed 2026-04-20
Liu et al. — LLaVA — accessed 2026-04-20
Hugging Face — Vision Language Models explained — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related