Capability · Framework — fine-tuning

llamafile

llamafile is a Mozilla/Ocho project that combines llama.cpp and Cosmopolitan Libc to produce a single file that is both a valid ELF, Mach-O, PE, and APE binary — meaning the same file runs a local LLM on nearly any desktop OS with no dependencies. It's become the easiest way to hand someone a model they can double-click, and has influenced how llama.cpp distributes releases.

Framework facts

Category
fine-tuning
Language
C / C++
License
Apache-2.0 (with LLaMA weights under their own license)
Repository
https://github.com/Mozilla-Ocho/llamafile

Install

# macOS/Linux — download a llamafile and chmod +x
curl -LO https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafile/resolve/main/Llama-3.2-3B-Instruct.Q6_K.llamafile
chmod +x Llama-3.2-3B-Instruct.Q6_K.llamafile

Quickstart

# Run the model — opens a local chat UI and OpenAI-compatible API
./Llama-3.2-3B-Instruct.Q6_K.llamafile
# → http://localhost:8080  (chat + /v1/chat/completions)

Alternatives

  • Ollama — model registry + CLI manager
  • LM Studio — desktop app
  • llama.cpp — raw inference library

Frequently asked questions

Do I need Python or CUDA?

No. A llamafile is a single binary with weights baked in. Metal on Apple Silicon and CUDA/ROCm on discrete GPUs are auto-detected if present, otherwise it falls back to CPU.

Can I package my own model as a llamafile?

Yes — use zipalign to embed a GGUF into the llamafile-<version> shell; the repo has docs and a release script.

Sources

  1. llamafile GitHub — accessed 2026-04-20
  2. Mozilla blog — llamafile — accessed 2026-04-20