Capability · Framework — fine-tuning
llama.cpp
llama.cpp by Georgi Gerganov kicked off the whole local-LLM movement in 2023 and remains its backbone. It runs quantised GGUF models on CPU, Apple Metal, CUDA, ROCm, Vulkan, and Intel GPUs — no Python, no CUDA toolkit, no Docker. Ollama, LM Studio, Jan, GPT4All, and most other 'run LLMs locally' apps bundle llama.cpp under the hood.
Framework facts
- Category
- fine-tuning
- Language
- C / C++
- License
- MIT
- Repository
- https://github.com/ggml-org/llama.cpp
Install
# macOS
brew install llama.cpp
# Or build from source
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp && make Quickstart
# Download a GGUF model from HF, then:
llama-cli -m ./llama-3.1-8b-q4_k_m.gguf \
-p "Write a haiku about RAG:" -n 64
# Or run the OpenAI-compatible server
llama-server -m ./model.gguf --port 8080 Alternatives
- Ollama — friendlier CLI wrapper around llama.cpp
- MLC LLM — cross-platform compiler approach
- vLLM — server-focused, Python-based
- Candle — Rust alternative from Hugging Face
Frequently asked questions
llama.cpp or Ollama?
Ollama is llama.cpp with a model registry, CLI, and daemon — easier for most users. llama.cpp gives finer control (custom quantisation, new model architectures, server tuning) and smaller install footprint. Use Ollama for personal use, llama.cpp when you want to embed inference in a product.
What quantisation should I use?
Q4_K_M is the usual sweet spot for 7-14B models on consumer hardware — near-full quality at 1/4 memory. Q6_K and Q5_K_M are close to fp16 quality. Q2_K and Q3_K_M degrade quality noticeably on smaller models.
Sources
- llama.cpp on GitHub — accessed 2026-04-20