Capability · Framework — fine-tuning
NVIDIA Triton Inference Server
Triton is NVIDIA's general-purpose serving runtime — a polyglot framework that runs any model on GPUs or CPUs via pluggable backends. For LLMs, the TensorRT-LLM backend offers best-in-class single-GPU throughput; for classic ML and vision, the ONNX and PyTorch backends dominate. Triton is the workhorse behind NVIDIA AI Enterprise and many hyperscaler stacks.
Framework facts
- Category
- fine-tuning
- Language
- C++ / Python
- License
- BSD-3-Clause
- Repository
- https://github.com/triton-inference-server/server
Install
docker pull nvcr.io/nvidia/tritonserver:24.10-py3
docker run --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $PWD/model_repo:/models \
nvcr.io/nvidia/tritonserver:24.10-py3 tritonserver --model-repository=/models Quickstart
# with TensorRT-LLM backend: compile your model into engines first,
# drop them under model_repo/my-model/1/ and expose an OpenAI-style proxy.
curl -X POST localhost:8000/v2/models/my-model/generate \
-d '{"text_input":"hello","max_tokens":32}' Alternatives
- vLLM — simpler LLM-only engine
- TGI — HF-native
- Ray Serve — higher-level orchestration
- Torchserve — PyTorch-centric
Frequently asked questions
Is Triton only for NVIDIA GPUs?
Primarily yes, but Triton also has CPU backends (ONNX CPU, OpenVINO) and can schedule across mixed-device clusters.
Triton vs vLLM for LLMs?
vLLM is simpler to set up and competitive on throughput. Triton + TensorRT-LLM wins on peak throughput and multi-model, multi-tenant serving where you want a single server for vision, LLM, and classic ML side-by-side.
Sources
- Triton docs — accessed 2026-04-20
- Triton GitHub — accessed 2026-04-20