Capability · Framework — fine-tuning
vLLM
vLLM (from UC Berkeley's Sky Computing Lab) is a high-throughput, memory-efficient inference and serving engine for LLMs. Its PagedAttention and continuous batching algorithms deliver significantly higher throughput than naive serving. vLLM exposes an OpenAI-compatible HTTP server, supports tensor/pipeline parallelism, prefix caching, speculative decoding, quantised models (AWQ, GPTQ, FP8), and runs on NVIDIA, AMD, and Intel accelerators. It's the default inference engine for many open-model deployments.
Framework facts
- Category
- fine-tuning
- Language
- Python / CUDA
- License
- Apache 2.0
- Repository
- https://github.com/vllm-project/vllm
Install
pip install vllm Quickstart
from vllm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-3.1-8B-Instruct')
sampling = SamplingParams(temperature=0.7, max_tokens=256)
outs = llm.generate(['Capital of India?'], sampling)
print(outs[0].outputs[0].text)
# Or serve an OpenAI-compatible API:
# vllm serve meta-llama/Llama-3.1-8B-Instruct Alternatives
- TGI — Hugging Face Text Generation Inference
- SGLang — structured generation-first serving
- TensorRT-LLM — NVIDIA's optimised runtime
- Ollama — local desktop-first runtime
Frequently asked questions
Is vLLM for training or inference?
vLLM is strictly inference / serving. For fine-tuning use frameworks like TRL, Unsloth, Axolotl, or Hugging Face Transformers. vLLM is often paired with them to serve the resulting model.
What is PagedAttention?
PagedAttention (Kwon et al., SOSP 2023) manages the KV cache in fixed-size pages like virtual memory, which eliminates fragmentation and enables very efficient batching. It's the main reason vLLM delivers 2-24x higher throughput than naive serving on many workloads.
Sources
- vLLM — docs — accessed 2026-04-20
- vLLM — GitHub — accessed 2026-04-20