Capability · Framework — rag
Llama Stack
Llama Stack is Meta's attempt to create a common contract for the AI application stack. Instead of binding your code to a particular provider, you call a stable set of APIs — inference, safety, RAG, tool runtime, agents, evals — implemented by plug-in providers (Ollama, Fireworks, Together, NVIDIA, or local). The reference server makes it easy to swap implementations behind the same code.
Framework facts
- Category
- rag
- Language
- Python / TypeScript / Swift / Kotlin
- License
- MIT
- Repository
- https://github.com/meta-llama/llama-stack
Install
pip install llama-stack
llama stack build --template ollama --image-type venv
llama stack run ollama Quickstart
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url='http://localhost:5001')
resp = client.inference.chat_completion(
model='meta-llama/Llama-3.2-3B-Instruct',
messages=[{'role': 'user', 'content': 'Hello'}]
)
print(resp.completion_message.content) Alternatives
- LangChain + LangGraph — richer ecosystem, less standardisation
- LiteLLM — just the inference-provider abstraction
- OpenAI SDK — de-facto standard inference API
- Vercel AI SDK — JS-centric alternative
Frequently asked questions
Is Llama Stack only for Llama models?
No — despite the name, providers can plug in any model. Llama Stack is about the API contract, not the weights.
Should I replace LangChain with Llama Stack?
They operate at different levels. Llama Stack is the plumbing (providers + capabilities); LangChain / LangGraph are the orchestration on top. Many teams use both together.
Sources
- Llama Stack — docs — accessed 2026-04-20
- Llama Stack GitHub — accessed 2026-04-20