Curiosity · Concept

GGUF Format

GGUF (GPT-Generated Unified Format) is the successor to GGML's older GGJT format, introduced by the llama.cpp project in 2023. A single .gguf file bundles model architecture metadata, tokenizer, chat template, and quantized tensor blocks — so a user only needs one artifact to run a model on CPU, Apple Silicon (Metal), or any GPU that llama.cpp targets. GGUF is the dominant format for local LLM inference: Ollama, LM Studio, Jan, and llamafile all load GGUF under the hood, and quant schemes like Q4_K_M, Q5_K_M, and Q8_0 are GGUF-specific.

Quick reference

Proficiency: Beginner
Also known as: GGUF, GPT-Generated Unified Format
Prerequisites: Quantization

Frequently asked questions

What is GGUF?

GGUF is a single-file binary format used by llama.cpp for quantized LLMs. One .gguf file contains the architecture config, tokenizer, chat template, and quantized tensor blocks — everything needed to load and run the model.

How does GGUF differ from safetensors?

Safetensors is a generic, framework-agnostic tensor format for full-precision or Hugging Face-style quantized weights. GGUF is purpose-built for llama.cpp's quant schemes, ships tokenizer/chat template inline, and is optimized for CPU + Metal + mmap loading.

What do the quant names like Q4_K_M mean?

They describe llama.cpp's K-quant schemes. Q4_K_M means 4-bit weights with K-quant block structure and medium-size mixed precision (some critical layers upcast). Smaller (Q2_K) = less memory and quality; larger (Q6_K, Q8_0) = near-FP16 quality.

Which tools consume GGUF?

llama.cpp is the reference implementation. Ollama, LM Studio, Jan, llamafile, and KoboldCpp all load GGUF files. TheBloke (and successors) publish tens of thousands of GGUF conversions on Hugging Face.

Sources

llama.cpp — GGUF specification — accessed 2026-04-20
Gerganov — llama.cpp repository — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related