Capability · Framework — fine-tuning

NVIDIA Triton Inference Server

Triton is NVIDIA's general-purpose serving runtime — a polyglot framework that runs any model on GPUs or CPUs via pluggable backends. For LLMs, the TensorRT-LLM backend offers best-in-class single-GPU throughput; for classic ML and vision, the ONNX and PyTorch backends dominate. Triton is the workhorse behind NVIDIA AI Enterprise and many hyperscaler stacks.

Framework facts

Category: fine-tuning
Language: C++ / Python
License: BSD-3-Clause
Repository: https://github.com/triton-inference-server/server

Install

docker pull nvcr.io/nvidia/tritonserver:24.10-py3
docker run --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v $PWD/model_repo:/models \
  nvcr.io/nvidia/tritonserver:24.10-py3 tritonserver --model-repository=/models

Quickstart

# with TensorRT-LLM backend: compile your model into engines first,
# drop them under model_repo/my-model/1/ and expose an OpenAI-style proxy.
curl -X POST localhost:8000/v2/models/my-model/generate \
  -d '{"text_input":"hello","max_tokens":32}'

Alternatives

vLLM — simpler LLM-only engine
TGI — HF-native
Ray Serve — higher-level orchestration
Torchserve — PyTorch-centric

Frequently asked questions

Is Triton only for NVIDIA GPUs?

Primarily yes, but Triton also has CPU backends (ONNX CPU, OpenVINO) and can schedule across mixed-device clusters.

Triton vs vLLM for LLMs?

vLLM is simpler to set up and competitive on throughput. Triton + TensorRT-LLM wins on peak throughput and multi-model, multi-tenant serving where you want a single server for vision, LLM, and classic ML side-by-side.

Sources

Triton docs — accessed 2026-04-20
Triton GitHub — accessed 2026-04-20