Curiosity · AI Model
Phi-4 Multimodal
Phi-4 Multimodal is Microsoft's 2025 small multimodal language model — 5.6 billion parameters unifying text, vision, and speech in a single architecture using a mixture-of-LoRAs design. Released on HuggingFace and Azure AI Foundry, it is the most capable Phi-family multimodal model, optimised for on-device assistants, meeting summarisers, and visual Q&A on edge hardware.
Model specs
- Vendor
- Microsoft
- Family
- Phi-4
- Released
- 2025-02
- Context window
- 128,000 tokens
- Modalities
- text, vision, audio
Strengths
- Unified text + vision + speech in one small model
- Competitive ASR quality vs. larger speech-specific models
- Runs on laptops with a modest GPU (or CPU with quantisation)
Limitations
- Not tuned for video or real-time streaming ASR
- Smaller and less capable than Claude / GPT-4o for complex multimodal reasoning
- Phi custom licence (MSR) — check for production commercial use
Use cases
- On-device voice assistants
- Meeting transcription + summarisation
- Visual Q&A on images and documents
- Multimodal agents running offline on laptops
Benchmarks
| Benchmark | Score | As of |
|---|---|---|
| MMLU | ~71% | 2026-04 |
| MMMU | ~55% | 2026-04 |
| OpenASR (EN) | ~4.8% WER | 2026-04 |
Frequently asked questions
What is Phi-4 Multimodal?
Phi-4 Multimodal is Microsoft's 5.6-billion-parameter small language model that unifies text, vision, and speech in a single architecture via a mixture-of-LoRAs design. It is released on HuggingFace and Azure AI Foundry.
How does Phi-4 Multimodal compare to Gemma 2 2B?
Phi-4 Multimodal is larger (5.6B vs. 2.6B) and supports vision and speech, while Gemma 2 2B is text-only. For multimodal on-device use, Phi-4 Multimodal is the stronger choice.
Sources
- Phi-4 Multimodal on HuggingFace — accessed 2026-04-20
- Microsoft — Phi-4 Multimodal announcement — accessed 2026-04-20