Capability · Framework — fine-tuning

Unsloth

Unsloth has become the dominant fine-tuning library for anyone training LoRA/QLoRA adapters on open models. Through custom Triton kernels, manual backprop, and aggressive memory optimisation it can fine-tune a 14B model on a single consumer GPU and a 70B model on a single 80GB H100 — territory that previously required multi-GPU clusters. The free open-source version handles single-GPU; paid tiers add multi-GPU and distributed training.

Framework facts

Category: fine-tuning
Language: Python
License: Apache 2.0 + commercial
Repository: https://github.com/unslothai/unsloth

Install

pip install unsloth

Quickstart

from unsloth import FastLanguageModel
import torch

model, tok = FastLanguageModel.from_pretrained(
    'unsloth/Llama-3.1-8B', max_seq_length=4096,
    dtype=None, load_in_4bit=True)
model = FastLanguageModel.get_peft_model(
    model, r=16, target_modules=['q_proj','v_proj'])
# pass to trl.SFTTrainer for training

Alternatives

Axolotl — config-driven fine-tuning on top of HF
torchtune — PyTorch-native alternative
TRL — HF's RLHF/DPO toolkit
LLaMA-Factory — GUI + config-driven

Frequently asked questions

Why is Unsloth faster?

Hand-written Triton kernels for attention and MLP layers, manual backward passes that avoid autograd overhead, and aggressive use of 4-bit quantisation. Unsloth's team publishes benchmarks against HF's defaults showing consistent 2-5x training throughput on the same GPU.

Does it support RLHF / DPO?

Yes — Unsloth integrates with TRL so you can run SFT, DPO, ORPO, and KTO training on any supported base model with the same speed and memory benefits.

Sources

Unsloth — docs — accessed 2026-04-20
Unsloth on GitHub — accessed 2026-04-20