Capability · Framework — fine-tuning

llama.cpp

llama.cpp by Georgi Gerganov kicked off the whole local-LLM movement in 2023 and remains its backbone. It runs quantised GGUF models on CPU, Apple Metal, CUDA, ROCm, Vulkan, and Intel GPUs — no Python, no CUDA toolkit, no Docker. Ollama, LM Studio, Jan, GPT4All, and most other 'run LLMs locally' apps bundle llama.cpp under the hood.

Framework facts

Category
fine-tuning
Language
C / C++
License
MIT
Repository
https://github.com/ggml-org/llama.cpp

Install

# macOS
brew install llama.cpp
# Or build from source
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp && make

Quickstart

# Download a GGUF model from HF, then:
llama-cli -m ./llama-3.1-8b-q4_k_m.gguf \
    -p "Write a haiku about RAG:" -n 64

# Or run the OpenAI-compatible server
llama-server -m ./model.gguf --port 8080

Alternatives

  • Ollama — friendlier CLI wrapper around llama.cpp
  • MLC LLM — cross-platform compiler approach
  • vLLM — server-focused, Python-based
  • Candle — Rust alternative from Hugging Face

Frequently asked questions

llama.cpp or Ollama?

Ollama is llama.cpp with a model registry, CLI, and daemon — easier for most users. llama.cpp gives finer control (custom quantisation, new model architectures, server tuning) and smaller install footprint. Use Ollama for personal use, llama.cpp when you want to embed inference in a product.

What quantisation should I use?

Q4_K_M is the usual sweet spot for 7-14B models on consumer hardware — near-full quality at 1/4 memory. Q6_K and Q5_K_M are close to fp16 quality. Q2_K and Q3_K_M degrade quality noticeably on smaller models.

Sources

  1. llama.cpp on GitHub — accessed 2026-04-20