Curiosity · AI Model

Phi-4 Multimodal

Phi-4 Multimodal is Microsoft's 2025 small multimodal language model — 5.6 billion parameters unifying text, vision, and speech in a single architecture using a mixture-of-LoRAs design. Released on HuggingFace and Azure AI Foundry, it is the most capable Phi-family multimodal model, optimised for on-device assistants, meeting summarisers, and visual Q&A on edge hardware.

Model specs

Vendor
Microsoft
Family
Phi-4
Released
2025-02
Context window
128,000 tokens
Modalities
text, vision, audio

Strengths

  • Unified text + vision + speech in one small model
  • Competitive ASR quality vs. larger speech-specific models
  • Runs on laptops with a modest GPU (or CPU with quantisation)

Limitations

  • Not tuned for video or real-time streaming ASR
  • Smaller and less capable than Claude / GPT-4o for complex multimodal reasoning
  • Phi custom licence (MSR) — check for production commercial use

Use cases

  • On-device voice assistants
  • Meeting transcription + summarisation
  • Visual Q&A on images and documents
  • Multimodal agents running offline on laptops

Benchmarks

BenchmarkScoreAs of
MMLU~71%2026-04
MMMU~55%2026-04
OpenASR (EN)~4.8% WER2026-04

Frequently asked questions

What is Phi-4 Multimodal?

Phi-4 Multimodal is Microsoft's 5.6-billion-parameter small language model that unifies text, vision, and speech in a single architecture via a mixture-of-LoRAs design. It is released on HuggingFace and Azure AI Foundry.

How does Phi-4 Multimodal compare to Gemma 2 2B?

Phi-4 Multimodal is larger (5.6B vs. 2.6B) and supports vision and speech, while Gemma 2 2B is text-only. For multimodal on-device use, Phi-4 Multimodal is the stronger choice.

Sources

  1. Phi-4 Multimodal on HuggingFace — accessed 2026-04-20
  2. Microsoft — Phi-4 Multimodal announcement — accessed 2026-04-20