Capability · Framework — fine-tuning
Text Generation Inference (TGI)
TGI powers the Hugging Face Inference API and Inference Endpoints. It's a Rust/Python server built around continuous batching, FlashAttention-based kernels, tensor parallelism, and support for GPTQ / AWQ / EETQ quantisation. For open-weights serving on your own GPUs, TGI and vLLM are the two benchmark engines most teams evaluate.
Framework facts
- Category
- fine-tuning
- Language
- Rust / Python
- License
- Apache-2.0
- Repository
- https://github.com/huggingface/text-generation-inference
Install
docker pull ghcr.io/huggingface/text-generation-inference:latest
docker run --gpus all --shm-size 1g -p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Meta-Llama-3.1-8B-Instruct Quickstart
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"tgi","messages":[{"role":"user","content":"hi"}]}' Alternatives
- vLLM — strong competitor, often faster on same hardware
- Triton Inference Server (NVIDIA) — broader ML serving
- TensorRT-LLM — NVIDIA-optimised engine
- LoRAX — multi-LoRA specialization of TGI
Frequently asked questions
TGI vs vLLM?
Both are excellent continuous-batching servers. vLLM usually wins on throughput for single-model setups with paged attention; TGI wins on operational polish, HF Hub integration, and wider quantisation format support. Benchmark on your model and hardware.
Does TGI support tool calling?
Yes — TGI exposes an OpenAI-compatible /chat/completions endpoint with tool-calling, and tracks the chat templates from the model card.
Sources
- TGI — docs — accessed 2026-04-20
- TGI GitHub — accessed 2026-04-20