Capability · Framework — fine-tuning
Ray Serve LLM
Ray Serve LLM (`ray.serve.llm`) is the production-grade way to serve LLMs on Ray. It provides an LLMServer deployment that wraps vLLM, Ray autoscaling that adjusts replicas to traffic, model multiplexing via LoRA, and a router that speaks OpenAI's API. It's the engine behind Anyscale's hosted endpoints.
Framework facts
- Category
- fine-tuning
- Language
- Python
- License
- Apache-2.0
- Repository
- https://github.com/ray-project/ray
Install
pip install 'ray[serve-llm]' vllm Quickstart
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
llm_config = LLMConfig(
model_loading_config={'model_id': 'meta-llama/Llama-3.2-3B-Instruct'},
deployment_config={'autoscaling_config': {'min_replicas':1,'max_replicas':4}},
engine_kwargs={'tensor_parallel_size':1},
)
app = build_openai_app({'llm_configs':[llm_config]})
serve.run(app) # OpenAI-compatible server at :8000 Alternatives
- vLLM standalone — simpler single-node
- TGI + Kubernetes — HF-native stack
- NVIDIA Triton — broader ML serving
- BentoML — deployment-focused peer
Frequently asked questions
Do I need Anyscale to use this?
No — Ray Serve LLM is open-source and runs on any Ray cluster (including KubeRay). Anyscale is Ray's managed commercial platform but not a requirement.
Ray Serve LLM vs vLLM?
vLLM is the inference engine. Ray Serve LLM adds multi-replica autoscaling, multi-model routing, and the OpenAI-compatible API on top. For more than one replica or model, Ray is the natural step up.
Sources
- Ray Serve LLM docs — accessed 2026-04-20
- Ray GitHub — accessed 2026-04-20