Capability · Framework — fine-tuning
TensorRT-LLM
If you're running LLMs at scale on NVIDIA GPUs and every token of throughput matters, TensorRT-LLM is the ceiling. It compiles models into fused kernels tuned for specific GPU architectures, with in-flight batching, speculative decoding, FP8 and FP4 quantisation, and first-class support for Hopper and Blackwell features. The trade-off: setup is far more involved than vLLM or SGLang, and it's tied to NVIDIA.
Framework facts
- Category
- fine-tuning
- Language
- C++ / Python / CUDA
- License
- Apache 2.0
- Repository
- https://github.com/NVIDIA/TensorRT-LLM
Install
pip install tensorrt-llm
# Or use the NGC container: nvcr.io/nvidia/tensorrt-llm Quickstart
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-3.1-8B-Instruct',
tensor_parallel_size=1, dtype='float16')
params = SamplingParams(max_tokens=128, temperature=0.7)
outputs = llm.generate(['Hello, who are you?'], params)
print(outputs[0].outputs[0].text) Alternatives
- vLLM — easier setup, wide adoption
- SGLang — faster on shared-prefix workloads
- Triton Inference Server — NVIDIA's full serving layer (pairs with TRT-LLM)
- llama.cpp — CPU/small-GPU alternative
Frequently asked questions
When is TensorRT-LLM worth the complexity?
When you're running at scale on H100s/H200s/B200s and inference cost is a top-3 line item. The engine build step is slower than vLLM's 'just load the weights' model, but the per-token latency and throughput wins pay back quickly at scale.
FP8 and FP4 — are the quality losses real?
FP8 is generally quality-preserving on H100+ with proper calibration. FP4 shows small degradations on some benchmarks but can 2x throughput on Blackwell. Always eval on your specific task before shipping a quantised engine.
Sources
- TensorRT-LLM — docs — accessed 2026-04-20
- TensorRT-LLM on GitHub — accessed 2026-04-20