Capability · Framework — fine-tuning

OpenLLM

OpenLLM lets you `openllm serve qwen2.5:7b` and get an OpenAI-compatible chat endpoint. Under the hood it uses BentoML to package the model into a bento, which can be deployed to Kubernetes, BentoCloud, or serverless GPUs. The CLI fits nicely into the 'try many models locally then productionise the winner' workflow.

Framework facts

Category
fine-tuning
Language
Python
License
Apache-2.0
Repository
https://github.com/bentoml/OpenLLM

Install

pip install openllm
openllm hello          # pick a model to try

Quickstart

# serve a model
openllm serve llama3.2:3b

# call from python
from openai import OpenAI
client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')
print(client.chat.completions.create(
    model='llama3.2:3b',
    messages=[{'role':'user','content':'hello'}]
).choices[0].message.content)

Alternatives

  • Ollama — simpler desktop-first peer
  • vLLM — higher-performance serving engine
  • LM Studio — GUI-first alternative
  • LocalAI — similar OpenAI-compatible proxy

Frequently asked questions

OpenLLM vs Ollama?

Ollama optimises for dev ergonomics on a laptop; OpenLLM optimises for turning the served model into a deployable bento for production. If you intend to ship, OpenLLM saves a step.

Does OpenLLM need BentoML to deploy?

No — you can run `openllm serve` as a plain Docker container. BentoML is an option for packaging, not a requirement.

Sources

  1. OpenLLM — docs — accessed 2026-04-20
  2. OpenLLM GitHub — accessed 2026-04-20