Curiosity · AI Model

Nemotron Ultra 253B

Nemotron Ultra 253B, released by NVIDIA in 2025, is the heaviest open-weights model in the Llama Nemotron series. Built from Llama 3.1 405B via efficient fine-tuning and architecture slicing, it targets enterprise-grade reasoning on code, math, and RAG workloads, and is optimised for high throughput on NVIDIA GPUs with TensorRT-LLM.

Model specs

Vendor: NVIDIA
Family: Llama Nemotron
Released: 2025-03
Context window: 128,000 tokens
Modalities: text

Strengths

Top-tier open-weights reasoning at launch
Highly optimised for NVIDIA hardware and TensorRT-LLM
Includes curated datasets and open training recipes

Limitations

Enormous footprint — multi-node serving required
Focused on NVIDIA infrastructure; portability outside the stack is uneven
Llama community license imposes some commercial restrictions

Use cases

Enterprise reasoning workloads served via NVIDIA NIM
Research on frontier-scale open-weights post-training
High-throughput inference on H100 and B200 fleets
Customisation platform for domain-specific agents

Benchmarks

Benchmark	Score	As of
MMLU-Pro	≈78%	2025-03
MATH	≈82%	2025-03
HumanEval	≈89%	2025-03

Frequently asked questions

What is Nemotron Ultra 253B?

Nemotron Ultra 253B is NVIDIA's top-tier open-weights reasoning model, derived from Llama 3.1 405B via extensive post-training and architectural adjustments for efficient inference on NVIDIA GPUs.

How is Nemotron Ultra different from Llama 3.1 405B?

NVIDIA applies pruning, architecture slicing, and reasoning-focused fine-tuning, producing a 253B model that matches or beats the base 405B on many evals while being substantially cheaper to serve.

Where can I run Nemotron Ultra?

Weights are available on Hugging Face under the 'nvidia' organisation, and NVIDIA offers hosted access through NIM microservices in their AI Foundry.

Sources

NVIDIA — Llama Nemotron Ultra — accessed 2026-04-20
Hugging Face — nvidia/Llama-3_1-Nemotron-Ultra-253B — accessed 2026-04-20