Capability · Framework — fine-tuning
llama-cpp-python
llama-cpp-python by Andrei Betlen is the Python wrapper everyone uses to drive llama.cpp from Python. It exposes low-level token sampling, a high-level `Llama` class, and ships a FastAPI server that speaks OpenAI's chat-completions API, so you can point any OpenAI client at your local GGUF model. Metal, CUDA, Vulkan, and HIP backends are all supported via build flags.
Framework facts
- Category
- fine-tuning
- Language
- Python / C++
- License
- MIT
- Repository
- https://github.com/abetlen/llama-cpp-python
Install
# Basic (CPU)
pip install llama-cpp-python
# CUDA build
CMAKE_ARGS='-DGGML_CUDA=on' pip install llama-cpp-python --force-reinstall --no-cache-dir Quickstart
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id='bartowski/Meta-Llama-3.1-8B-Instruct-GGUF',
filename='*Q4_K_M.gguf',
n_gpu_layers=-1,
)
print(llm.create_chat_completion([{'role':'user','content':'hi'}])) Alternatives
- Ollama — higher-level runtime
- ctransformers — similar idea
- vLLM — for server GPU inference
Frequently asked questions
llama-cpp-python or Ollama?
Ollama is easier if you just want to chat with a local model. llama-cpp-python is what you reach for when you need fine-grained Python control — custom sampling, tool-call parsing, embedding extraction, or embedding it in a bigger Python app.
How do I get GPU acceleration?
Install with the right CMake flag for your backend (`GGML_CUDA`, `GGML_METAL`, `GGML_VULKAN`, `GGML_HIPBLAS`). Prebuilt wheels for common combinations are published on PyPI.
Sources
- llama-cpp-python docs — accessed 2026-04-20
- llama-cpp-python GitHub — accessed 2026-04-20