Capability · Framework — fine-tuning
LoRAX
LoRAX (LoRA eXchange) started as a fork of TGI and became the reference server for 'one base model, many adapters' serving. You pay for one GPU running (say) Llama 3.1 8B, and LoRAX can attend to requests from dozens or hundreds of LoRAs that specialise the model per tenant, task, or user — with dynamic batching across adapters.
Framework facts
- Category
- fine-tuning
- Language
- Rust / Python
- License
- Apache-2.0
- Repository
- https://github.com/predibase/lorax
Install
docker run --gpus all --shm-size 1g -p 8080:80 \
ghcr.io/predibase/lorax:main \
--model-id mistralai/Mistral-7B-v0.1 Quickstart
from lorax import Client
c = Client('http://127.0.0.1:8080')
res = c.generate(
prompt='Write a haiku about learning.',
adapter_id='predibase/haiku-adapter',
max_new_tokens=50
)
print(res.generated_text) Alternatives
- TGI — parent project without multi-LoRA optimisation
- vLLM — now supports multi-LoRA too
- SGLang — newer performance-oriented server
- NVIDIA Triton — broader, non-LLM-specific
Frequently asked questions
When should I use LoRAX?
When you have dozens of small fine-tunes (one per customer, one per task) and need to serve them efficiently. For a single adapter, regular vLLM or TGI is simpler.
Does LoRAX still exist if Predibase supports vLLM multi-LoRA?
Yes — LoRAX remains active, and Predibase also ships vLLM integrations. Benchmark both; LoRAX is often better on very-high adapter counts, vLLM on raw per-request latency.
Sources
- LoRAX — docs — accessed 2026-04-20
- LoRAX GitHub — accessed 2026-04-20