Capability · Framework — fine-tuning

TensorRT-LLM

If you're running LLMs at scale on NVIDIA GPUs and every token of throughput matters, TensorRT-LLM is the ceiling. It compiles models into fused kernels tuned for specific GPU architectures, with in-flight batching, speculative decoding, FP8 and FP4 quantisation, and first-class support for Hopper and Blackwell features. The trade-off: setup is far more involved than vLLM or SGLang, and it's tied to NVIDIA.

Framework facts

Category
fine-tuning
Language
C++ / Python / CUDA
License
Apache 2.0
Repository
https://github.com/NVIDIA/TensorRT-LLM

Install

pip install tensorrt-llm
# Or use the NGC container: nvcr.io/nvidia/tensorrt-llm

Quickstart

from tensorrt_llm import LLM, SamplingParams

llm = LLM(model='meta-llama/Llama-3.1-8B-Instruct',
          tensor_parallel_size=1, dtype='float16')
params = SamplingParams(max_tokens=128, temperature=0.7)
outputs = llm.generate(['Hello, who are you?'], params)
print(outputs[0].outputs[0].text)

Alternatives

  • vLLM — easier setup, wide adoption
  • SGLang — faster on shared-prefix workloads
  • Triton Inference Server — NVIDIA's full serving layer (pairs with TRT-LLM)
  • llama.cpp — CPU/small-GPU alternative

Frequently asked questions

When is TensorRT-LLM worth the complexity?

When you're running at scale on H100s/H200s/B200s and inference cost is a top-3 line item. The engine build step is slower than vLLM's 'just load the weights' model, but the per-token latency and throughput wins pay back quickly at scale.

FP8 and FP4 — are the quality losses real?

FP8 is generally quality-preserving on H100+ with proper calibration. FP4 shows small degradations on some benchmarks but can 2x throughput on Blackwell. Always eval on your specific task before shipping a quantised engine.

Sources

  1. TensorRT-LLM — docs — accessed 2026-04-20
  2. TensorRT-LLM on GitHub — accessed 2026-04-20