Capability · Framework — fine-tuning

BentoML

BentoML is a Python-first framework for productionising ML and AI models. You define a Service with typed inputs/outputs, BentoML builds a 'Bento' (a self-contained model + code bundle), and runs it as a high-performance HTTP / gRPC / streaming server with adaptive micro-batching. OpenLLM is the dedicated LLM wrapper, and BentoCloud hosts Bentos with GPU autoscaling. Used heavily by teams that serve mixed ML / LLM workloads.

Framework facts

Category: fine-tuning
Language: Python
License: Apache-2.0
Repository: https://github.com/bentoml/BentoML

Install

pip install bentoml

Quickstart

import bentoml
from bentoml.io import Text

@bentoml.service(resources={'gpu': 1})
class Summariser:
    @bentoml.api
    def summarise(self, text: str) -> str:
        return text[:100] + '…'

# bentoml serve service:Summariser

Alternatives

Ray Serve — Python-native
KServe — K8s-native
NVIDIA Triton — GPU-heavy

Frequently asked questions

BentoML or Ray Serve?

BentoML is simpler if you just want to serve models with nice HTTP ergonomics. Ray Serve wins if you already run on Ray clusters or need fine-grained actor composition. For LLM-only workloads, vLLM + OpenLLM on BentoML is a common combo.

What is OpenLLM?

OpenLLM is BentoML's LLM wrapper that turns Hugging Face / vLLM models into OpenAI-compatible Bentos with one command — see our OpenLLM page.

Sources

BentoML docs — accessed 2026-04-20
BentoML GitHub — accessed 2026-04-20