Curiosity · AI Model

Nomic Embed Text v2

Nomic Embed Text v2 is Nomic AI's second-generation open embedding model — published with open weights, open training data, and open training code for full reproducibility. It is multilingual, supports Matryoshka truncation, and is designed as a drop-in replacement for closed APIs when transparency and self-hosting matter.

Model specs

Vendor: Nomic AI
Family: Nomic Embed
Released: 2025-02
Context window: 8,192 tokens
Modalities: text

Strengths

Fully open — weights, data, and training code released
Matryoshka truncation for compact vectors
8k-token input for longer chunks
Multilingual out of the box

Limitations

Trails closed APIs (text-embedding-3-large, voyage-3) on some English retrieval benchmarks
Self-hosted deployment requires GPU infrastructure
Smaller ecosystem of framework integrations than OpenAI

Use cases

Fully reproducible research embeddings
On-prem RAG for regulated industries
Multilingual enterprise search
Low-cost vector stores with Matryoshka-shrunk dims

Benchmarks

Benchmark	Score	As of
MTEB (multilingual)	≈60	2025-02
MIRACL	≈55	2025-02

Frequently asked questions

What is Nomic Embed Text v2?

Nomic Embed Text v2 is an open-source multilingual text embedding model from Nomic AI, shipped with open weights, open training data, and open training code so that researchers and engineers can reproduce and audit the model end-to-end.

Why does reproducibility matter?

Many closed embedding APIs give you a vector but no insight into the training data, biases, or safety properties of the model. Nomic's fully open release lets regulators, auditors, and researchers verify the model's behaviour and retrain or fine-tune as needed.

How do I run Nomic Embed v2 locally?

The weights are on Hugging Face; any GPU-equipped host running transformers, sentence-transformers, or Nomic's own GPT4All-style runtime can serve embeddings. CPU inference is feasible for smaller corpora.

How does Nomic Embed v2 compare to Jina v3 and BGE-M3?

All three are open-weight multilingual embedders with roughly similar MTEB scores. Nomic leads on transparency (open data + code), Jina on task-LoRA flexibility, BGE-M3 on hybrid (dense+sparse+multi-vector) outputs.

Sources

Nomic — Embed Text v2 — accessed 2026-04-20
Hugging Face — nomic-ai/nomic-embed-text-v2-moe — accessed 2026-04-20