Capability · Framework — fine-tuning
MLC LLM
MLC LLM's premise is simple: compile an LLM once and run it anywhere — iOS, Android, browsers via WebGPU, Windows/Mac/Linux desktops, AMD/NVIDIA/Apple Silicon. Built on Apache TVM, it's the de-facto stack for on-device LLM inference. Chrome's and Edge's client-side AI features trace back to MLC's compilation pipeline, and it's widely used for offline apps and privacy-sensitive deployments.
Framework facts
- Category
- fine-tuning
- Language
- Python / C++ / JavaScript
- License
- Apache 2.0
- Repository
- https://github.com/mlc-ai/mlc-llm
Install
pip install mlc-llm mlc-ai
# or for web
npm install @mlc-ai/web-llm Quickstart
// Web-LLM: run Llama in the browser
import * as webllm from '@mlc-ai/web-llm';
const engine = await webllm.CreateMLCEngine(
'Llama-3.1-8B-Instruct-q4f16_1-MLC',
{ initProgressCallback: p => console.log(p) }
);
const resp = await engine.chat.completions.create({
messages: [{ role: 'user', content: 'Hello' }]
}); Alternatives
- llama.cpp — CPU-first, mature
- Ollama — local model manager, uses llama.cpp
- ONNX Runtime — alternative cross-platform runtime
- Candle — Rust inference from Hugging Face
Frequently asked questions
Is browser LLM actually usable?
Yes, for small models (1-8B). WebGPU on modern Chrome/Edge/Safari runs quantised 3-8B models at usable speeds on consumer laptops. The practical ceiling is memory, not compute — most devices can hold 4-6GB worth of weights.
How is this different from llama.cpp?
llama.cpp is C++ and targets CPU (with GPU acceleration). MLC compiles through TVM for many backends including WebGPU and mobile GPUs. For web and mobile deployment MLC is the stronger choice; for servers and macOS desktops llama.cpp has broader adoption.
Sources
- MLC LLM — docs — accessed 2026-04-20
- Web-LLM — accessed 2026-04-20