Capability · Framework — fine-tuning

TRL (Transformer Reinforcement Learning)

TRL is the canonical toolkit for RLHF, DPO, and modern preference-optimisation methods on Hugging Face Transformers. If you're training reward models, running DPO on preference pairs, or experimenting with GRPO (the algorithm behind DeepSeek-R1), TRL is the reference implementation. It's used as the training backbone under Unsloth, Axolotl, and many custom pipelines.

Framework facts

Category: fine-tuning
Language: Python
License: Apache 2.0
Repository: https://github.com/huggingface/trl

Install

pip install trl

Quickstart

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

dataset = load_dataset('trl-lib/Capybara', split='train')
trainer = SFTTrainer(
    model='meta-llama/Llama-3.1-8B',
    train_dataset=dataset,
    args=SFTConfig(output_dir='./out', num_train_epochs=1)
)
trainer.train()

Alternatives

Unsloth — faster wrapper on top of TRL for single-GPU
Axolotl — config-driven layer built on TRL
torchtune — native PyTorch alternative
OpenRLHF — RLHF-specialised distributed framework

Frequently asked questions

What's the difference between DPO, ORPO, and GRPO?

DPO trains directly on preference pairs without a reward model. ORPO combines SFT and preference optimisation in one loss. GRPO (used by DeepSeek-R1) does relative reinforcement across a group of sampled completions. TRL implements all three with similar APIs.

Is TRL production-ready?

Yes — it's the library Hugging Face uses for its own model training and is stable. For multi-node distributed runs pair it with Accelerate + DeepSpeed, which TRL integrates with natively.

Sources

TRL — docs — accessed 2026-04-20
TRL on GitHub — accessed 2026-04-20