Curiosity · Concept
QLoRA — 4-bit Quantized LoRA Fine-Tuning
QLoRA, introduced by Dettmers et al. in 2023, combines 4-bit NormalFloat (NF4) quantization of the frozen base model with LoRA low-rank adapters and paged optimizers. The base weights stay quantized in memory, gradients flow only through the small LoRA adapters, and CPU-offloaded paged optimizers absorb transient memory spikes. The net effect: fine-tuning a 65B-parameter model (e.g., Guanaco) on a single 48 GB consumer GPU with quality indistinguishable from full 16-bit fine-tuning on standard benchmarks.
Quick reference
- Proficiency
- Intermediate
- Also known as
- QLoRA, quantized LoRA
- Prerequisites
- LoRA, Quantization, Fine-tuning
Frequently asked questions
What is QLoRA?
QLoRA is a parameter-efficient fine-tuning method that keeps the base LLM frozen in 4-bit NF4 quantization while training small LoRA low-rank adapters on top. Only adapter weights are updated, so memory usage drops dramatically.
What is 4-bit NormalFloat (NF4)?
NF4 is a 4-bit data type introduced with QLoRA, quantile-matched to the typical normal distribution of pre-trained LLM weights. It stores weights more accurately than uniform INT4 for the same bit budget.
How much memory does QLoRA save?
A 65B-parameter model in 16-bit needs ~130 GB; in 4-bit it needs ~33 GB. Adding LoRA adapters and paged optimizers, QLoRA fine-tunes that 65B model comfortably on one 48 GB A100 or RTX 6000 Ada.
Does QLoRA hurt quality?
The original paper showed no significant quality loss versus full 16-bit fine-tuning on MMLU, Vicuna evaluation, and reasoning benchmarks. NF4 + double quantization preserves enough precision to make the frozen weights effectively lossless for fine-tuning.
Sources
- Dettmers et al. — QLoRA: Efficient Finetuning of Quantized LLMs — accessed 2026-04-20
- Hu et al. — LoRA: Low-Rank Adaptation of Large Language Models — accessed 2026-04-20