Capability · Framework — fine-tuning
Unsloth
Unsloth has become the dominant fine-tuning library for anyone training LoRA/QLoRA adapters on open models. Through custom Triton kernels, manual backprop, and aggressive memory optimisation it can fine-tune a 14B model on a single consumer GPU and a 70B model on a single 80GB H100 — territory that previously required multi-GPU clusters. The free open-source version handles single-GPU; paid tiers add multi-GPU and distributed training.
Framework facts
- Category
- fine-tuning
- Language
- Python
- License
- Apache 2.0 + commercial
- Repository
- https://github.com/unslothai/unsloth
Install
pip install unsloth Quickstart
from unsloth import FastLanguageModel
import torch
model, tok = FastLanguageModel.from_pretrained(
'unsloth/Llama-3.1-8B', max_seq_length=4096,
dtype=None, load_in_4bit=True)
model = FastLanguageModel.get_peft_model(
model, r=16, target_modules=['q_proj','v_proj'])
# pass to trl.SFTTrainer for training Alternatives
- Axolotl — config-driven fine-tuning on top of HF
- torchtune — PyTorch-native alternative
- TRL — HF's RLHF/DPO toolkit
- LLaMA-Factory — GUI + config-driven
Frequently asked questions
Why is Unsloth faster?
Hand-written Triton kernels for attention and MLP layers, manual backward passes that avoid autograd overhead, and aggressive use of 4-bit quantisation. Unsloth's team publishes benchmarks against HF's defaults showing consistent 2-5x training throughput on the same GPU.
Does it support RLHF / DPO?
Yes — Unsloth integrates with TRL so you can run SFT, DPO, ORPO, and KTO training on any supported base model with the same speed and memory benefits.
Sources
- Unsloth — docs — accessed 2026-04-20
- Unsloth on GitHub — accessed 2026-04-20