Capability · Framework — fine-tuning
TRL (Transformer Reinforcement Learning)
TRL is the canonical toolkit for RLHF, DPO, and modern preference-optimisation methods on Hugging Face Transformers. If you're training reward models, running DPO on preference pairs, or experimenting with GRPO (the algorithm behind DeepSeek-R1), TRL is the reference implementation. It's used as the training backbone under Unsloth, Axolotl, and many custom pipelines.
Framework facts
- Category
- fine-tuning
- Language
- Python
- License
- Apache 2.0
- Repository
- https://github.com/huggingface/trl
Install
pip install trl Quickstart
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
dataset = load_dataset('trl-lib/Capybara', split='train')
trainer = SFTTrainer(
model='meta-llama/Llama-3.1-8B',
train_dataset=dataset,
args=SFTConfig(output_dir='./out', num_train_epochs=1)
)
trainer.train() Alternatives
- Unsloth — faster wrapper on top of TRL for single-GPU
- Axolotl — config-driven layer built on TRL
- torchtune — native PyTorch alternative
- OpenRLHF — RLHF-specialised distributed framework
Frequently asked questions
What's the difference between DPO, ORPO, and GRPO?
DPO trains directly on preference pairs without a reward model. ORPO combines SFT and preference optimisation in one loss. GRPO (used by DeepSeek-R1) does relative reinforcement across a group of sampled completions. TRL implements all three with similar APIs.
Is TRL production-ready?
Yes — it's the library Hugging Face uses for its own model training and is stable. For multi-node distributed runs pair it with Accelerate + DeepSpeed, which TRL integrates with natively.
Sources
- TRL — docs — accessed 2026-04-20
- TRL on GitHub — accessed 2026-04-20