Capability · Framework — rag

Llama Stack

Llama Stack is Meta's attempt to create a common contract for the AI application stack. Instead of binding your code to a particular provider, you call a stable set of APIs — inference, safety, RAG, tool runtime, agents, evals — implemented by plug-in providers (Ollama, Fireworks, Together, NVIDIA, or local). The reference server makes it easy to swap implementations behind the same code.

Framework facts

Category: rag
Language: Python / TypeScript / Swift / Kotlin
License: MIT
Repository: https://github.com/meta-llama/llama-stack

Install

pip install llama-stack
llama stack build --template ollama --image-type venv
llama stack run ollama

Quickstart

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url='http://localhost:5001')
resp = client.inference.chat_completion(
    model='meta-llama/Llama-3.2-3B-Instruct',
    messages=[{'role': 'user', 'content': 'Hello'}]
)
print(resp.completion_message.content)

Alternatives

LangChain + LangGraph — richer ecosystem, less standardisation
LiteLLM — just the inference-provider abstraction
OpenAI SDK — de-facto standard inference API
Vercel AI SDK — JS-centric alternative

Frequently asked questions

Is Llama Stack only for Llama models?

No — despite the name, providers can plug in any model. Llama Stack is about the API contract, not the weights.

Should I replace LangChain with Llama Stack?

They operate at different levels. Llama Stack is the plumbing (providers + capabilities); LangChain / LangGraph are the orchestration on top. Many teams use both together.

Sources

Llama Stack — docs — accessed 2026-04-20
Llama Stack GitHub — accessed 2026-04-20