Curiosity · Concept

Quantization

Modern LLMs are trained in 16- or 32-bit floating point but are almost always quantized for deployment. Quantization maps weight values to lower-precision formats (INT8, INT4, NF4, FP8), cutting model memory 2-8x and accelerating matrix multiplies on compatible hardware. Done well, quality loss is negligible; done poorly, reasoning and long-tail knowledge degrade.

Quick reference

Proficiency
Intermediate
Also known as
weight quantization, low-bit inference, model compression
Prerequisites
Neural networks (basics), Numerical representation / floating point

Frequently asked questions

What is model quantization?

It is the process of reducing the bit-width used to store and compute a model's weights (and sometimes activations). A 70B parameter model in FP16 takes ~140GB; quantized to 4 bits, the same model takes ~35GB and often fits on a single GPU.

What's the difference between INT8, INT4, and FP8?

INT8 and INT4 are integer formats that scale floating-point weights into integer ranges with a scale factor. FP8 (E4M3/E5M2) is a true floating-point format supported by NVIDIA Hopper/Ada, preserving dynamic range. FP8 is used for training too; INT4 is mostly inference-only.

What are GPTQ, AWQ, and GGUF?

GPTQ and AWQ are post-training quantization algorithms that decide which bits to keep carefully so quality loss is minimal. GGUF is a file format used by llama.cpp for efficient CPU/GPU inference with various quantization levels (Q4_K_M, Q5_K_M, etc.). They're the standard way open-source LLMs ship.

How much quality do I lose with quantization?

INT8 is usually indistinguishable from FP16. 4-bit (GPTQ/AWQ/NF4) loses 1-3% on benchmarks typically. Below 4 bits (3-bit, 2-bit) you start seeing real degradation, especially on reasoning-heavy tasks and rare-knowledge questions. Large MoE models are more robust to quantization than dense.

Sources

  1. Frantar et al. — GPTQ — accessed 2026-04-20
  2. Lin et al. — AWQ: Activation-aware Weight Quantization — accessed 2026-04-20
  3. Dettmers et al. — LLM.int8() — accessed 2026-04-20
  4. Hugging Face — Quantization overview — accessed 2026-04-20