Capability · Framework — fine-tuning

vLLM

vLLM (from UC Berkeley's Sky Computing Lab) is a high-throughput, memory-efficient inference and serving engine for LLMs. Its PagedAttention and continuous batching algorithms deliver significantly higher throughput than naive serving. vLLM exposes an OpenAI-compatible HTTP server, supports tensor/pipeline parallelism, prefix caching, speculative decoding, quantised models (AWQ, GPTQ, FP8), and runs on NVIDIA, AMD, and Intel accelerators. It's the default inference engine for many open-model deployments.

Framework facts

Category
fine-tuning
Language
Python / CUDA
License
Apache 2.0
Repository
https://github.com/vllm-project/vllm

Install

pip install vllm

Quickstart

from vllm import LLM, SamplingParams

llm = LLM(model='meta-llama/Llama-3.1-8B-Instruct')
sampling = SamplingParams(temperature=0.7, max_tokens=256)
outs = llm.generate(['Capital of India?'], sampling)
print(outs[0].outputs[0].text)

# Or serve an OpenAI-compatible API:
# vllm serve meta-llama/Llama-3.1-8B-Instruct

Alternatives

  • TGI — Hugging Face Text Generation Inference
  • SGLang — structured generation-first serving
  • TensorRT-LLM — NVIDIA's optimised runtime
  • Ollama — local desktop-first runtime

Frequently asked questions

Is vLLM for training or inference?

vLLM is strictly inference / serving. For fine-tuning use frameworks like TRL, Unsloth, Axolotl, or Hugging Face Transformers. vLLM is often paired with them to serve the resulting model.

What is PagedAttention?

PagedAttention (Kwon et al., SOSP 2023) manages the KV cache in fixed-size pages like virtual memory, which eliminates fragmentation and enables very efficient batching. It's the main reason vLLM delivers 2-24x higher throughput than naive serving on many workloads.

Sources

  1. vLLM — docs — accessed 2026-04-20
  2. vLLM — GitHub — accessed 2026-04-20