Capability · Framework — fine-tuning

LoRAX

LoRAX (LoRA eXchange) started as a fork of TGI and became the reference server for 'one base model, many adapters' serving. You pay for one GPU running (say) Llama 3.1 8B, and LoRAX can attend to requests from dozens or hundreds of LoRAs that specialise the model per tenant, task, or user — with dynamic batching across adapters.

Framework facts

Category
fine-tuning
Language
Rust / Python
License
Apache-2.0
Repository
https://github.com/predibase/lorax

Install

docker run --gpus all --shm-size 1g -p 8080:80 \
  ghcr.io/predibase/lorax:main \
  --model-id mistralai/Mistral-7B-v0.1

Quickstart

from lorax import Client
c = Client('http://127.0.0.1:8080')

res = c.generate(
    prompt='Write a haiku about learning.',
    adapter_id='predibase/haiku-adapter',
    max_new_tokens=50
)
print(res.generated_text)

Alternatives

  • TGI — parent project without multi-LoRA optimisation
  • vLLM — now supports multi-LoRA too
  • SGLang — newer performance-oriented server
  • NVIDIA Triton — broader, non-LLM-specific

Frequently asked questions

When should I use LoRAX?

When you have dozens of small fine-tunes (one per customer, one per task) and need to serve them efficiently. For a single adapter, regular vLLM or TGI is simpler.

Does LoRAX still exist if Predibase supports vLLM multi-LoRA?

Yes — LoRAX remains active, and Predibase also ships vLLM integrations. Benchmark both; LoRAX is often better on very-high adapter counts, vLLM on raw per-request latency.

Sources

  1. LoRAX — docs — accessed 2026-04-20
  2. LoRAX GitHub — accessed 2026-04-20