Capability · Framework — fine-tuning

LoRAX

LoRAX (LoRA eXchange) started as a fork of TGI and became the reference server for 'one base model, many adapters' serving. You pay for one GPU running (say) Llama 3.1 8B, and LoRAX can attend to requests from dozens or hundreds of LoRAs that specialise the model per tenant, task, or user — with dynamic batching across adapters.

Framework facts

Category: fine-tuning
Language: Rust / Python
License: Apache-2.0
Repository: https://github.com/predibase/lorax

Install

docker run --gpus all --shm-size 1g -p 8080:80 \
  ghcr.io/predibase/lorax:main \
  --model-id mistralai/Mistral-7B-v0.1

Quickstart

from lorax import Client
c = Client('http://127.0.0.1:8080')

res = c.generate(
    prompt='Write a haiku about learning.',
    adapter_id='predibase/haiku-adapter',
    max_new_tokens=50
)
print(res.generated_text)

Alternatives

TGI — parent project without multi-LoRA optimisation
vLLM — now supports multi-LoRA too
SGLang — newer performance-oriented server
NVIDIA Triton — broader, non-LLM-specific

Frequently asked questions

When should I use LoRAX?

When you have dozens of small fine-tunes (one per customer, one per task) and need to serve them efficiently. For a single adapter, regular vLLM or TGI is simpler.

Does LoRAX still exist if Predibase supports vLLM multi-LoRA?

Yes — LoRAX remains active, and Predibase also ships vLLM integrations. Benchmark both; LoRAX is often better on very-high adapter counts, vLLM on raw per-request latency.

Sources

LoRAX — docs — accessed 2026-04-20
LoRAX GitHub — accessed 2026-04-20