Capability · Framework — fine-tuning

Hugging Face Inference Endpoints

Inference Endpoints let you pick a model from the Hugging Face Hub, choose a GPU and region, and get a production HTTPS endpoint in minutes. Under the hood it uses TGI for LLMs, TEI for embeddings, and the Inference Toolkit for classic transformer tasks, with autoscaling and SSO built in.

Framework facts

Category: fine-tuning
Language: HTTP / Python client
License: Proprietary (runtimes are Apache-2.0)
Repository: https://huggingface.co/docs/inference-endpoints/

Install

pip install huggingface_hub
huggingface-cli login

Quickstart

from huggingface_hub import InferenceClient

client = InferenceClient(base_url='https://<your-endpoint>.endpoints.huggingface.cloud')
resp = client.chat_completion(
    messages=[{'role': 'user', 'content': 'Summarise VSET in one sentence.'}],
    max_tokens=128,
)
print(resp.choices[0].message.content)

Alternatives

Modal — serverless GPU for custom code
Replicate — model marketplace
Baseten — model serving platform
Together.ai — hosted open models

Frequently asked questions

Endpoints vs the free Inference API?

The free API is rate-limited and shared. Endpoints are dedicated hardware you control — predictable latency, higher throughput, private networking, and SLAs.

Can I deploy a private fine-tune?

Yes. Push the fine-tuned model to a private Hub repo, then deploy it as a private endpoint with your org credentials.

Sources

Inference Endpoints — docs — accessed 2026-04-20
TGI — GitHub — accessed 2026-04-20