UnityVideo : Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Jiehui Huang¹ · Yuechen Zhang² · Xu He³ · Yuan Gao⁴ · Zhi Cen⁴ · Bin Xia² ·
Yan Zhou⁴ · Xin Tao⁴ · Pengfei Wan⁴ · Jiaya Jia^1,✉

¹HKUST · ²CUHK · ³Tsinghua University · ⁴Kling Team, Kuaishou Technology

^✉Corresponding Author

📢 Code will be released soon! Stay tuned! 🚀

📢 News

[2025.12.15] 🎉 Part of the OpenUni dataset is now open-sourced! Check it out on 🤗 Hugging Face
[2025.12.08] 🔥 arXiv paper released !

📖 Introduction

UnityVideo is a unified generalist framework for multi-task multi-modal video understanding that enables:

🎨 Text-to-Video Generation: Create high-quality videos from text descriptions
🎮 Controllable Generation: Fine-grained control over video generation with various modalities
🔍 Modality Estimation: Estimate depth, normal, and other modalities from video
🌟 Zero-Shot Generalization: Strong generalization to novel objects and styles without additional training

Our unified architecture achieves state-of-the-art performance across multiple video generation benchmarks while maintaining efficiency and scalability.

🔥 Highlights

✅ Unified Framework: Single model handles multiple video understanding tasks
✅ Multi-Modal Support: Seamlessly processes text, image, and video inputs
✅ World-Aware Generation: Enhanced physical understanding and consistency
✅ Flexible Control: Support for various control signals (depth, edge, pose, etc.)
✅ High Quality: State-of-the-art visual quality and temporal consistency
✅ Efficient Training: Joint multi-task learning improves data efficiency

🎯 Method

UnityVideo employs a unified multi-modal multi-task learning framework that consists of:

Multi-Modal Encoder: Processes diverse input modalities (text, image, video)
Unified Transformer Backbone: Shared representation learning across tasks
Task-Specific Heads: Specialized decoders for different generation and estimation tasks
Joint Training Strategy: Simultaneous optimization across all tasks

This architecture enables knowledge sharing and improves generalization across different video understanding tasks.

📊 Results Gallery

🎬 Text-to-Video Generation

More examples coming Soon

🎮 Controllable Generation

More examples coming Soon

🔍 Modality Estimation

More examples coming Soon

🗓️ TODO List

⚖️ License

This repository is released under the Apache-2.0 license as found in the LICENSE file.

🚀 Stay Tuned for Updates!

Follow this project to get notified when we release the code!

📚 Citation

If you find this work useful for your research, please cite:

@article{huang2024unityvideo,
  title={UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation},
  author={Huang, Jiehui and Zhang, Yuechen and He, Xu and Gao, Yuan and Cen, Zhi and Xia, Bin and Zhou, Yan and Tao, Xin and Wan, Pengfei and Jia, Jiaya},
  journal={arXiv preprint arXiv:2512.07831},
  year={2025}
}

Name	Name	Last commit message	Last commit date
Latest commit History 12 Commits 12 Commits
assets	assets
LICENSE	LICENSE
README.md	README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UnityVideo : Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

📢 Code will be released soon! Stay tuned! 🚀

📢 News

📖 Introduction

🔥 Highlights

🎯 Method

📊 Results Gallery

🎬 Text-to-Video Generation

🎮 Controllable Generation

🔍 Modality Estimation

🗓️ TODO List

⚖️ License

🚀 Stay Tuned for Updates!

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

UnityVideo : Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

📢 Code will be released soon! Stay tuned! 🚀

📢 News

📖 Introduction

🔥 Highlights

🎯 Method

📊 Results Gallery

🎬 Text-to-Video Generation

🎮 Controllable Generation

🔍 Modality Estimation

🗓️ TODO List

⚖️ License

🚀 Stay Tuned for Updates!

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages