Curiosity · AI Model
Shengshu Vidu
Vidu (from Beijing-based Shengshu Technology with Tsinghua University) is a text- and image-to-video diffusion-transformer model based on U-ViT, the same team's Universal Vision Transformer architecture that predates Sora's DiT. Released publicly in July 2024 with an international expansion, Vidu generates up to 8-second 1080p clips with strong consistency, camera control, and 'Subject Consistency' that preserves characters across multiple shots — positioning it as one of the earliest credible alternatives to Sora.
Model specs
- Vendor
- Shengshu Technology
- Family
- Vidu
- Released
- 2024-07
- Context window
- 1 tokens
- Modalities
- text, vision, video
- Input price
- n/a
- Output price
- n/a
- Pricing as of
- 2026-04-20
Strengths
- Subject Consistency across multiple shots
- U-ViT backbone with strong theoretical grounding
- Global access via vidu.studio
- Competitive visual quality at release
Limitations
- Closed model — credit-based API
- Max clip length shorter than Kling
- Ecosystem outside China less developed
- Prompt adherence behind Veo 3 on complex scenes
Use cases
- Multi-shot character sequences
- Image-to-video animation
- Creative short-form content in Chinese market
- Research reference for U-ViT architecture
Benchmarks
| Benchmark | Score | As of |
|---|---|---|
| U-ViT ablation vs DiT-XL/2 | matches or exceeds at equal compute | 2023-01 |
| Artificial Analysis video ranking | top-10 at 2024 close | 2024-12 |
Frequently asked questions
What is Vidu?
Vidu is a text- and image-to-video model from Shengshu Technology / Tsinghua University, based on the U-ViT diffusion-transformer architecture.
What is U-ViT?
U-ViT is a Universal Vision Transformer architecture for diffusion models developed by the Shengshu team. It is conceptually similar to DiT but predates it.
How do I access Vidu?
Via vidu.studio or the Vidu API. Credit-based, international access available.
Sources
- Vidu Studio — accessed 2026-04-20
- U-ViT paper (arXiv) — accessed 2026-04-20