Capability · Framework — fine-tuning

MLC LLM

MLC LLM's premise is simple: compile an LLM once and run it anywhere — iOS, Android, browsers via WebGPU, Windows/Mac/Linux desktops, AMD/NVIDIA/Apple Silicon. Built on Apache TVM, it's the de-facto stack for on-device LLM inference. Chrome's and Edge's client-side AI features trace back to MLC's compilation pipeline, and it's widely used for offline apps and privacy-sensitive deployments.

Framework facts

Category: fine-tuning
Language: Python / C++ / JavaScript
License: Apache 2.0
Repository: https://github.com/mlc-ai/mlc-llm

Install

pip install mlc-llm mlc-ai
# or for web
npm install @mlc-ai/web-llm

Quickstart

// Web-LLM: run Llama in the browser
import * as webllm from '@mlc-ai/web-llm';

const engine = await webllm.CreateMLCEngine(
  'Llama-3.1-8B-Instruct-q4f16_1-MLC',
  { initProgressCallback: p => console.log(p) }
);
const resp = await engine.chat.completions.create({
  messages: [{ role: 'user', content: 'Hello' }]
});

Alternatives

llama.cpp — CPU-first, mature
Ollama — local model manager, uses llama.cpp
ONNX Runtime — alternative cross-platform runtime
Candle — Rust inference from Hugging Face

Frequently asked questions

Is browser LLM actually usable?

Yes, for small models (1-8B). WebGPU on modern Chrome/Edge/Safari runs quantised 3-8B models at usable speeds on consumer laptops. The practical ceiling is memory, not compute — most devices can hold 4-6GB worth of weights.

How is this different from llama.cpp?

llama.cpp is C++ and targets CPU (with GPU acceleration). MLC compiles through TVM for many backends including WebGPU and mobile GPUs. For web and mobile deployment MLC is the stronger choice; for servers and macOS desktops llama.cpp has broader adoption.

Sources

MLC LLM — docs — accessed 2026-04-20
Web-LLM — accessed 2026-04-20