Capability · Framework — fine-tuning

llama-cpp-python

llama-cpp-python by Andrei Betlen is the Python wrapper everyone uses to drive llama.cpp from Python. It exposes low-level token sampling, a high-level `Llama` class, and ships a FastAPI server that speaks OpenAI's chat-completions API, so you can point any OpenAI client at your local GGUF model. Metal, CUDA, Vulkan, and HIP backends are all supported via build flags.

Framework facts

Category
fine-tuning
Language
Python / C++
License
MIT
Repository
https://github.com/abetlen/llama-cpp-python

Install

# Basic (CPU)
pip install llama-cpp-python
# CUDA build
CMAKE_ARGS='-DGGML_CUDA=on' pip install llama-cpp-python --force-reinstall --no-cache-dir

Quickstart

from llama_cpp import Llama
llm = Llama.from_pretrained(
    repo_id='bartowski/Meta-Llama-3.1-8B-Instruct-GGUF',
    filename='*Q4_K_M.gguf',
    n_gpu_layers=-1,
)
print(llm.create_chat_completion([{'role':'user','content':'hi'}]))

Alternatives

  • Ollama — higher-level runtime
  • ctransformers — similar idea
  • vLLM — for server GPU inference

Frequently asked questions

llama-cpp-python or Ollama?

Ollama is easier if you just want to chat with a local model. llama-cpp-python is what you reach for when you need fine-grained Python control — custom sampling, tool-call parsing, embedding extraction, or embedding it in a bigger Python app.

How do I get GPU acceleration?

Install with the right CMake flag for your backend (`GGML_CUDA`, `GGML_METAL`, `GGML_VULKAN`, `GGML_HIPBLAS`). Prebuilt wheels for common combinations are published on PyPI.

Sources

  1. llama-cpp-python docs — accessed 2026-04-20
  2. llama-cpp-python GitHub — accessed 2026-04-20