Curiosity · Concept

INT4 Quantization

INT4 quantization stores each model weight in just 4 bits, representing one of 16 discrete values, instead of the 16 bits used by bfloat16 training weights. Naive round-to-nearest INT4 destroys quality, so production pipelines use algorithms like GPTQ (one-shot second-order), AWQ (activation-aware), or NF4 (QLoRA). Combined with dequantization-on-the-fly in fused kernels, INT4 shrinks a 70B model from ~140 GB to ~35 GB, making single-GPU inference and consumer-card deployment realistic.

Quick reference

Proficiency: Intermediate
Also known as: 4-bit quantization, INT4
Prerequisites: Quantization, LLM inference

Frequently asked questions

What is INT4 quantization?

INT4 quantization represents model weights using 4 bits per value — 16 discrete levels per scaling group — instead of 16-bit floats. That gives roughly 4x compression versus bfloat16 training weights.

Which INT4 algorithms matter in practice?

GPTQ (one-shot second-order quantization), AWQ (activation-aware weight quantization), and NF4 (used in QLoRA) are the three most common. They each calibrate scales to preserve the important weights better than naive round-to-nearest.

How much quality do you lose with INT4?

With modern methods (GPTQ/AWQ), typical quality loss on MMLU-style benchmarks is 1-3% for instruction-tuned 7B-70B models — often acceptable given the 4x memory win and throughput gains.

What's the difference between INT4 quantization and QLoRA?

INT4 quantization is a weight-compression technique for inference (and for the frozen base in QLoRA). QLoRA is a fine-tuning recipe that uses 4-bit NF4 for the frozen base model and trains LoRA adapters on top.

Sources

Frantar et al. — GPTQ: Accurate Post-Training Quantization — accessed 2026-04-20
Lin et al. — AWQ: Activation-aware Weight Quantization — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related