Capability · Framework — fine-tuning

Text Generation Inference (TGI)

TGI powers the Hugging Face Inference API and Inference Endpoints. It's a Rust/Python server built around continuous batching, FlashAttention-based kernels, tensor parallelism, and support for GPTQ / AWQ / EETQ quantisation. For open-weights serving on your own GPUs, TGI and vLLM are the two benchmark engines most teams evaluate.

Framework facts

Category
fine-tuning
Language
Rust / Python
License
Apache-2.0
Repository
https://github.com/huggingface/text-generation-inference

Install

docker pull ghcr.io/huggingface/text-generation-inference:latest
docker run --gpus all --shm-size 1g -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Meta-Llama-3.1-8B-Instruct

Quickstart

curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"tgi","messages":[{"role":"user","content":"hi"}]}'

Alternatives

  • vLLM — strong competitor, often faster on same hardware
  • Triton Inference Server (NVIDIA) — broader ML serving
  • TensorRT-LLM — NVIDIA-optimised engine
  • LoRAX — multi-LoRA specialization of TGI

Frequently asked questions

TGI vs vLLM?

Both are excellent continuous-batching servers. vLLM usually wins on throughput for single-model setups with paged attention; TGI wins on operational polish, HF Hub integration, and wider quantisation format support. Benchmark on your model and hardware.

Does TGI support tool calling?

Yes — TGI exposes an OpenAI-compatible /chat/completions endpoint with tool-calling, and tracks the chat templates from the model card.

Sources

  1. TGI — docs — accessed 2026-04-20
  2. TGI GitHub — accessed 2026-04-20