Curiosity · Concept
INT4 Quantization
INT4 quantization stores each model weight in just 4 bits, representing one of 16 discrete values, instead of the 16 bits used by bfloat16 training weights. Naive round-to-nearest INT4 destroys quality, so production pipelines use algorithms like GPTQ (one-shot second-order), AWQ (activation-aware), or NF4 (QLoRA). Combined with dequantization-on-the-fly in fused kernels, INT4 shrinks a 70B model from ~140 GB to ~35 GB, making single-GPU inference and consumer-card deployment realistic.
Quick reference
- Proficiency
- Intermediate
- Also known as
- 4-bit quantization, INT4
- Prerequisites
- Quantization, LLM inference
Frequently asked questions
What is INT4 quantization?
INT4 quantization represents model weights using 4 bits per value — 16 discrete levels per scaling group — instead of 16-bit floats. That gives roughly 4x compression versus bfloat16 training weights.
Which INT4 algorithms matter in practice?
GPTQ (one-shot second-order quantization), AWQ (activation-aware weight quantization), and NF4 (used in QLoRA) are the three most common. They each calibrate scales to preserve the important weights better than naive round-to-nearest.
How much quality do you lose with INT4?
With modern methods (GPTQ/AWQ), typical quality loss on MMLU-style benchmarks is 1-3% for instruction-tuned 7B-70B models — often acceptable given the 4x memory win and throughput gains.
What's the difference between INT4 quantization and QLoRA?
INT4 quantization is a weight-compression technique for inference (and for the frozen base in QLoRA). QLoRA is a fine-tuning recipe that uses 4-bit NF4 for the frozen base model and trains LoRA adapters on top.
Sources
- Frantar et al. — GPTQ: Accurate Post-Training Quantization — accessed 2026-04-20
- Lin et al. — AWQ: Activation-aware Weight Quantization — accessed 2026-04-20