Curiosity · AI Model

Shengshu Vidu

Vidu (from Beijing-based Shengshu Technology with Tsinghua University) is a text- and image-to-video diffusion-transformer model based on U-ViT, the same team's Universal Vision Transformer architecture that predates Sora's DiT. Released publicly in July 2024 with an international expansion, Vidu generates up to 8-second 1080p clips with strong consistency, camera control, and 'Subject Consistency' that preserves characters across multiple shots — positioning it as one of the earliest credible alternatives to Sora.

Model specs

Vendor: Shengshu Technology
Family: Vidu
Released: 2024-07
Context window: 1 tokens
Modalities: text, vision, video
Input price: n/a
Output price: n/a
Pricing as of: 2026-04-20

Strengths

Subject Consistency across multiple shots
U-ViT backbone with strong theoretical grounding
Global access via vidu.studio
Competitive visual quality at release

Limitations

Closed model — credit-based API
Max clip length shorter than Kling
Ecosystem outside China less developed
Prompt adherence behind Veo 3 on complex scenes

Use cases

Multi-shot character sequences
Image-to-video animation
Creative short-form content in Chinese market
Research reference for U-ViT architecture

Benchmarks

Benchmark	Score	As of
U-ViT ablation vs DiT-XL/2	matches or exceeds at equal compute	2023-01
Artificial Analysis video ranking	top-10 at 2024 close	2024-12

Frequently asked questions

What is Vidu?

Vidu is a text- and image-to-video model from Shengshu Technology / Tsinghua University, based on the U-ViT diffusion-transformer architecture.

What is U-ViT?

U-ViT is a Universal Vision Transformer architecture for diffusion models developed by the Shengshu team. It is conceptually similar to DiT but predates it.

How do I access Vidu?

Via vidu.studio or the Vidu API. Credit-based, international access available.

Sources

Vidu Studio — accessed 2026-04-20
U-ViT paper (arXiv) — accessed 2026-04-20