Curiosity · Concept
INT8 Quantization
INT8 quantization represents each weight as an 8-bit integer with a per-tensor or per-channel scale factor. It is the default conservative compression level for LLMs: memory roughly halves versus FP16/bf16, INT8 matmuls run at higher throughput on H100/A100 tensor cores, and quality stays within fractions of a percent of the original for most models. LLM.int8() (Dettmers et al., 2022) showed you can reach INT8 on 175B-parameter models with a mixed-precision trick that keeps a small fraction of outlier activations in FP16.
Quick reference
- Proficiency
- Beginner
- Also known as
- 8-bit quantization, INT8
- Prerequisites
- Quantization
Frequently asked questions
What is INT8 quantization?
INT8 quantization maps weights (and optionally activations) from floating point to 8-bit integers using a learned scale factor. It is a lossy compression that usually costs very little accuracy on LLMs.
When should I use INT8 vs INT4?
Use INT8 as a conservative first step — 2x memory and throughput wins, almost no quality drop. Use INT4 when you need to fit a model into very constrained memory and can accept slightly more degradation.
What does LLM.int8() do differently?
LLM.int8() (Dettmers et al.) keeps the rare 'outlier' activation dimensions in FP16 while running the rest in INT8. This recovers quality on very large models (175B+) where naive INT8 would otherwise collapse.
Is INT8 supported on most GPUs?
Yes. NVIDIA tensor cores from Turing (T4) onward natively accelerate INT8 matmuls. Most inference runtimes (TensorRT-LLM, vLLM, TGI) support INT8 weight quantization out of the box.