Curiosity · AI Model

Shengshu Vidu

Vidu (from Beijing-based Shengshu Technology with Tsinghua University) is a text- and image-to-video diffusion-transformer model based on U-ViT, the same team's Universal Vision Transformer architecture that predates Sora's DiT. Released publicly in July 2024 with an international expansion, Vidu generates up to 8-second 1080p clips with strong consistency, camera control, and 'Subject Consistency' that preserves characters across multiple shots — positioning it as one of the earliest credible alternatives to Sora.

Model specs

Vendor
Shengshu Technology
Family
Vidu
Released
2024-07
Context window
1 tokens
Modalities
text, vision, video
Input price
n/a
Output price
n/a
Pricing as of
2026-04-20

Strengths

  • Subject Consistency across multiple shots
  • U-ViT backbone with strong theoretical grounding
  • Global access via vidu.studio
  • Competitive visual quality at release

Limitations

  • Closed model — credit-based API
  • Max clip length shorter than Kling
  • Ecosystem outside China less developed
  • Prompt adherence behind Veo 3 on complex scenes

Use cases

  • Multi-shot character sequences
  • Image-to-video animation
  • Creative short-form content in Chinese market
  • Research reference for U-ViT architecture

Benchmarks

BenchmarkScoreAs of
U-ViT ablation vs DiT-XL/2matches or exceeds at equal compute2023-01
Artificial Analysis video rankingtop-10 at 2024 close2024-12

Frequently asked questions

What is Vidu?

Vidu is a text- and image-to-video model from Shengshu Technology / Tsinghua University, based on the U-ViT diffusion-transformer architecture.

What is U-ViT?

U-ViT is a Universal Vision Transformer architecture for diffusion models developed by the Shengshu team. It is conceptually similar to DiT but predates it.

How do I access Vidu?

Via vidu.studio or the Vidu API. Credit-based, international access available.

Sources

  1. Vidu Studio — accessed 2026-04-20
  2. U-ViT paper (arXiv) — accessed 2026-04-20