Capability · Framework — fine-tuning

Hugging Face Inference Endpoints

Inference Endpoints let you pick a model from the Hugging Face Hub, choose a GPU and region, and get a production HTTPS endpoint in minutes. Under the hood it uses TGI for LLMs, TEI for embeddings, and the Inference Toolkit for classic transformer tasks, with autoscaling and SSO built in.

Framework facts

Category
fine-tuning
Language
HTTP / Python client
License
Proprietary (runtimes are Apache-2.0)
Repository
https://huggingface.co/docs/inference-endpoints/

Install

pip install huggingface_hub
huggingface-cli login

Quickstart

from huggingface_hub import InferenceClient

client = InferenceClient(base_url='https://<your-endpoint>.endpoints.huggingface.cloud')
resp = client.chat_completion(
    messages=[{'role': 'user', 'content': 'Summarise VSET in one sentence.'}],
    max_tokens=128,
)
print(resp.choices[0].message.content)

Alternatives

  • Modal — serverless GPU for custom code
  • Replicate — model marketplace
  • Baseten — model serving platform
  • Together.ai — hosted open models

Frequently asked questions

Endpoints vs the free Inference API?

The free API is rate-limited and shared. Endpoints are dedicated hardware you control — predictable latency, higher throughput, private networking, and SLAs.

Can I deploy a private fine-tune?

Yes. Push the fine-tuned model to a private Hub repo, then deploy it as a private endpoint with your org credentials.

Sources

  1. Inference Endpoints — docs — accessed 2026-04-20
  2. TGI — GitHub — accessed 2026-04-20