Capability · Framework — fine-tuning
Hugging Face Inference Endpoints
Inference Endpoints let you pick a model from the Hugging Face Hub, choose a GPU and region, and get a production HTTPS endpoint in minutes. Under the hood it uses TGI for LLMs, TEI for embeddings, and the Inference Toolkit for classic transformer tasks, with autoscaling and SSO built in.
Framework facts
- Category
- fine-tuning
- Language
- HTTP / Python client
- License
- Proprietary (runtimes are Apache-2.0)
- Repository
- https://huggingface.co/docs/inference-endpoints/
Install
pip install huggingface_hub
huggingface-cli login Quickstart
from huggingface_hub import InferenceClient
client = InferenceClient(base_url='https://<your-endpoint>.endpoints.huggingface.cloud')
resp = client.chat_completion(
messages=[{'role': 'user', 'content': 'Summarise VSET in one sentence.'}],
max_tokens=128,
)
print(resp.choices[0].message.content) Alternatives
- Modal — serverless GPU for custom code
- Replicate — model marketplace
- Baseten — model serving platform
- Together.ai — hosted open models
Frequently asked questions
Endpoints vs the free Inference API?
The free API is rate-limited and shared. Endpoints are dedicated hardware you control — predictable latency, higher throughput, private networking, and SLAs.
Can I deploy a private fine-tune?
Yes. Push the fine-tuned model to a private Hub repo, then deploy it as a private endpoint with your org credentials.
Sources
- Inference Endpoints — docs — accessed 2026-04-20
- TGI — GitHub — accessed 2026-04-20