Capability · Framework — fine-tuning

vLLM

vLLM (from UC Berkeley's Sky Computing Lab) is a high-throughput, memory-efficient inference and serving engine for LLMs. Its PagedAttention and continuous batching algorithms deliver significantly higher throughput than naive serving. vLLM exposes an OpenAI-compatible HTTP server, supports tensor/pipeline parallelism, prefix caching, speculative decoding, quantised models (AWQ, GPTQ, FP8), and runs on NVIDIA, AMD, and Intel accelerators. It's the default inference engine for many open-model deployments.

Framework facts

Category: fine-tuning
Language: Python / CUDA
License: Apache 2.0
Repository: https://github.com/vllm-project/vllm

Install

pip install vllm

Quickstart

from vllm import LLM, SamplingParams

llm = LLM(model='meta-llama/Llama-3.1-8B-Instruct')
sampling = SamplingParams(temperature=0.7, max_tokens=256)
outs = llm.generate(['Capital of India?'], sampling)
print(outs[0].outputs[0].text)

# Or serve an OpenAI-compatible API:
# vllm serve meta-llama/Llama-3.1-8B-Instruct

Alternatives

TGI — Hugging Face Text Generation Inference
SGLang — structured generation-first serving
TensorRT-LLM — NVIDIA's optimised runtime
Ollama — local desktop-first runtime

Frequently asked questions

Is vLLM for training or inference?

vLLM is strictly inference / serving. For fine-tuning use frameworks like TRL, Unsloth, Axolotl, or Hugging Face Transformers. vLLM is often paired with them to serve the resulting model.

What is PagedAttention?

PagedAttention (Kwon et al., SOSP 2023) manages the KV cache in fixed-size pages like virtual memory, which eliminates fragmentation and enables very efficient batching. It's the main reason vLLM delivers 2-24x higher throughput than naive serving on many workloads.

Sources

vLLM — docs — accessed 2026-04-20
vLLM — GitHub — accessed 2026-04-20