Curiosity · Concept

INT8 Quantization

INT8 quantization represents each weight as an 8-bit integer with a per-tensor or per-channel scale factor. It is the default conservative compression level for LLMs: memory roughly halves versus FP16/bf16, INT8 matmuls run at higher throughput on H100/A100 tensor cores, and quality stays within fractions of a percent of the original for most models. LLM.int8() (Dettmers et al., 2022) showed you can reach INT8 on 175B-parameter models with a mixed-precision trick that keeps a small fraction of outlier activations in FP16.

Quick reference

Proficiency: Beginner
Also known as: 8-bit quantization, INT8
Prerequisites: Quantization

Frequently asked questions

What is INT8 quantization?

INT8 quantization maps weights (and optionally activations) from floating point to 8-bit integers using a learned scale factor. It is a lossy compression that usually costs very little accuracy on LLMs.

When should I use INT8 vs INT4?

Use INT8 as a conservative first step — 2x memory and throughput wins, almost no quality drop. Use INT4 when you need to fit a model into very constrained memory and can accept slightly more degradation.

What does LLM.int8() do differently?

LLM.int8() (Dettmers et al.) keeps the rare 'outlier' activation dimensions in FP16 while running the rest in INT8. This recovers quality on very large models (175B+) where naive INT8 would otherwise collapse.

Is INT8 supported on most GPUs?

Yes. NVIDIA tensor cores from Turing (T4) onward natively accelerate INT8 matmuls. Most inference runtimes (TensorRT-LLM, vLLM, TGI) support INT8 weight quantization out of the box.

Sources

Dettmers et al. — LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale — accessed 2026-04-20
Jacob et al. — Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related