Capability · Framework — fine-tuning
BentoML
BentoML is a Python-first framework for productionising ML and AI models. You define a Service with typed inputs/outputs, BentoML builds a 'Bento' (a self-contained model + code bundle), and runs it as a high-performance HTTP / gRPC / streaming server with adaptive micro-batching. OpenLLM is the dedicated LLM wrapper, and BentoCloud hosts Bentos with GPU autoscaling. Used heavily by teams that serve mixed ML / LLM workloads.
Framework facts
- Category
- fine-tuning
- Language
- Python
- License
- Apache-2.0
- Repository
- https://github.com/bentoml/BentoML
Install
pip install bentoml Quickstart
import bentoml
from bentoml.io import Text
@bentoml.service(resources={'gpu': 1})
class Summariser:
@bentoml.api
def summarise(self, text: str) -> str:
return text[:100] + '…'
# bentoml serve service:Summariser Alternatives
- Ray Serve — Python-native
- KServe — K8s-native
- NVIDIA Triton — GPU-heavy
Frequently asked questions
BentoML or Ray Serve?
BentoML is simpler if you just want to serve models with nice HTTP ergonomics. Ray Serve wins if you already run on Ray clusters or need fine-grained actor composition. For LLM-only workloads, vLLM + OpenLLM on BentoML is a common combo.
What is OpenLLM?
OpenLLM is BentoML's LLM wrapper that turns Hugging Face / vLLM models into OpenAI-compatible Bentos with one command — see our OpenLLM page.
Sources
- BentoML docs — accessed 2026-04-20
- BentoML GitHub — accessed 2026-04-20