Capability · Framework — fine-tuning
llamafile
llamafile is a Mozilla/Ocho project that combines llama.cpp and Cosmopolitan Libc to produce a single file that is both a valid ELF, Mach-O, PE, and APE binary — meaning the same file runs a local LLM on nearly any desktop OS with no dependencies. It's become the easiest way to hand someone a model they can double-click, and has influenced how llama.cpp distributes releases.
Framework facts
- Category
- fine-tuning
- Language
- C / C++
- License
- Apache-2.0 (with LLaMA weights under their own license)
- Repository
- https://github.com/Mozilla-Ocho/llamafile
Install
# macOS/Linux — download a llamafile and chmod +x
curl -LO https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafile/resolve/main/Llama-3.2-3B-Instruct.Q6_K.llamafile
chmod +x Llama-3.2-3B-Instruct.Q6_K.llamafile Quickstart
# Run the model — opens a local chat UI and OpenAI-compatible API
./Llama-3.2-3B-Instruct.Q6_K.llamafile
# → http://localhost:8080 (chat + /v1/chat/completions) Alternatives
- Ollama — model registry + CLI manager
- LM Studio — desktop app
- llama.cpp — raw inference library
Frequently asked questions
Do I need Python or CUDA?
No. A llamafile is a single binary with weights baked in. Metal on Apple Silicon and CUDA/ROCm on discrete GPUs are auto-detected if present, otherwise it falls back to CPU.
Can I package my own model as a llamafile?
Yes — use zipalign to embed a GGUF into the llamafile-<version> shell; the repo has docs and a release script.
Sources
- llamafile GitHub — accessed 2026-04-20
- Mozilla blog — llamafile — accessed 2026-04-20