Curiosity · AI Model
Microsoft Florence-2
Florence-2 is Microsoft's compact, open vision foundation model — 232M (base) and 771M (large) parameter variants — trained on the FLD-5B dataset of 5.4B annotations across 126M images. A single sequence-to-sequence architecture handles captioning, detection, segmentation, grounding, and OCR by swapping task prompts. Released under MIT on Hugging Face, it is widely used as a small-footprint vision backbone in research and product pipelines.
Model specs
- Vendor
- Microsoft Research
- Family
- Florence
- Released
- 2024-06
- Context window
- 1,024 tokens
- Modalities
- text, vision
Strengths
- Unified prompt-driven interface across many vision tasks
- Tiny footprint — runs on consumer GPUs
- Permissive MIT licence
- Strong quality-per-parameter thanks to FLD-5B pre-training
Limitations
- Not a conversational VLM — no free-form chat
- Fixed prompt grammar — out-of-template prompts under-perform
- Smaller context and detail vs 70B-class VLMs
- Needs post-processing for structured outputs (bboxes, masks)
Use cases
- Dataset labelling and auto-annotation
- Light-weight visual grounding in agent pipelines
- Edge / on-device OCR and captioning
- Pre-processing stage feeding a downstream LLM
Benchmarks
| Benchmark | Score | As of |
|---|---|---|
| COCO caption CIDEr (large) | ≈140 | 2024-06 |
| COCO detection mAP (large, zero-shot) | ≈43 | 2024-06 |
| TextCaps CIDEr | ≈78 | 2024-06 |
Frequently asked questions
What is Florence-2?
Florence-2 is a compact open vision foundation model from Microsoft that uses a single seq2seq architecture to perform captioning, detection, segmentation, grounding, and OCR — triggered by task prompts.
How big is Florence-2?
There are two released variants: Florence-2-base with 232M parameters and Florence-2-large with 771M parameters, both under the MIT licence.
Is Florence-2 a chat model?
No. It is a task-prompt model, not a conversational VLM. For free-form chat over images, pair it with an LLM or use a model like Qwen2-VL.
Sources
- Florence-2 on Hugging Face — accessed 2026-04-20
- Florence-2 paper (arXiv) — accessed 2026-04-20