Capability · Framework — fine-tuning

Ray Serve LLM

Ray Serve LLM (`ray.serve.llm`) is the production-grade way to serve LLMs on Ray. It provides an LLMServer deployment that wraps vLLM, Ray autoscaling that adjusts replicas to traffic, model multiplexing via LoRA, and a router that speaks OpenAI's API. It's the engine behind Anyscale's hosted endpoints.

Framework facts

Category
fine-tuning
Language
Python
License
Apache-2.0
Repository
https://github.com/ray-project/ray

Install

pip install 'ray[serve-llm]' vllm

Quickstart

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config={'model_id': 'meta-llama/Llama-3.2-3B-Instruct'},
    deployment_config={'autoscaling_config': {'min_replicas':1,'max_replicas':4}},
    engine_kwargs={'tensor_parallel_size':1},
)
app = build_openai_app({'llm_configs':[llm_config]})
serve.run(app)  # OpenAI-compatible server at :8000

Alternatives

  • vLLM standalone — simpler single-node
  • TGI + Kubernetes — HF-native stack
  • NVIDIA Triton — broader ML serving
  • BentoML — deployment-focused peer

Frequently asked questions

Do I need Anyscale to use this?

No — Ray Serve LLM is open-source and runs on any Ray cluster (including KubeRay). Anyscale is Ray's managed commercial platform but not a requirement.

Ray Serve LLM vs vLLM?

vLLM is the inference engine. Ray Serve LLM adds multi-replica autoscaling, multi-model routing, and the OpenAI-compatible API on top. For more than one replica or model, Ray is the natural step up.

Sources

  1. Ray Serve LLM docs — accessed 2026-04-20
  2. Ray GitHub — accessed 2026-04-20