Curiosity · AI Model

Phi-4 Multimodal

Phi-4 Multimodal is Microsoft's 2025 small multimodal language model — 5.6 billion parameters unifying text, vision, and speech in a single architecture using a mixture-of-LoRAs design. Released on HuggingFace and Azure AI Foundry, it is the most capable Phi-family multimodal model, optimised for on-device assistants, meeting summarisers, and visual Q&A on edge hardware.

Model specs

Vendor: Microsoft
Family: Phi-4
Released: 2025-02
Context window: 128,000 tokens
Modalities: text, vision, audio

Strengths

Unified text + vision + speech in one small model
Competitive ASR quality vs. larger speech-specific models
Runs on laptops with a modest GPU (or CPU with quantisation)

Limitations

Not tuned for video or real-time streaming ASR
Smaller and less capable than Claude / GPT-4o for complex multimodal reasoning
Phi custom licence (MSR) — check for production commercial use

Use cases

On-device voice assistants
Meeting transcription + summarisation
Visual Q&A on images and documents
Multimodal agents running offline on laptops

Benchmarks

Benchmark	Score	As of
MMLU	~71%	2026-04
MMMU	~55%	2026-04
OpenASR (EN)	~4.8% WER	2026-04

Frequently asked questions

What is Phi-4 Multimodal?

Phi-4 Multimodal is Microsoft's 5.6-billion-parameter small language model that unifies text, vision, and speech in a single architecture via a mixture-of-LoRAs design. It is released on HuggingFace and Azure AI Foundry.

How does Phi-4 Multimodal compare to Gemma 2 2B?

Phi-4 Multimodal is larger (5.6B vs. 2.6B) and supports vision and speech, while Gemma 2 2B is text-only. For multimodal on-device use, Phi-4 Multimodal is the stronger choice.

Sources

Phi-4 Multimodal on HuggingFace — accessed 2026-04-20
Microsoft — Phi-4 Multimodal announcement — accessed 2026-04-20