Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

JackAILab/UnityVideo

Open more actions menu
 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
12 Commits
 
 
 
 
 
 

Repository files navigation

UnityVideo Logo

UnityVideo : Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

arXiv Project Page License Model Dataset

Jiehui Huang1 · Yuechen Zhang2 · Xu He3 · Yuan Gao4 · Zhi Cen4 · Bin Xia2 ·
Yan Zhou4 · Xin Tao4 · Pengfei Wan4 · Jiaya Jia1,✉

1HKUST · 2CUHK · 3Tsinghua University · 4Kling Team, Kuaishou Technology

Corresponding Author


📢 Code will be released soon! Stay tuned! 🚀


📢 News

  • [2025.12.15] 🎉 Part of the OpenUni dataset is now open-sourced! Check it out on 🤗 Hugging Face
  • [2025.12.08] 🔥 arXiv paper released !

📖 Introduction

UnityVideo is a unified generalist framework for multi-task multi-modal video understanding that enables:

  • 🎨 Text-to-Video Generation: Create high-quality videos from text descriptions
  • 🎮 Controllable Generation: Fine-grained control over video generation with various modalities
  • 🔍 Modality Estimation: Estimate depth, normal, and other modalities from video
  • 🌟 Zero-Shot Generalization: Strong generalization to novel objects and styles without additional training

Our unified architecture achieves state-of-the-art performance across multiple video generation benchmarks while maintaining efficiency and scalability.


🔥 Highlights

  • Unified Framework: Single model handles multiple video understanding tasks
  • Multi-Modal Support: Seamlessly processes text, image, and video inputs
  • World-Aware Generation: Enhanced physical understanding and consistency
  • Flexible Control: Support for various control signals (depth, edge, pose, etc.)
  • High Quality: State-of-the-art visual quality and temporal consistency
  • Efficient Training: Joint multi-task learning improves data efficiency

🎯 Method

UnityVideo employs a unified multi-modal multi-task learning framework that consists of:

  1. Multi-Modal Encoder: Processes diverse input modalities (text, image, video)
  2. Unified Transformer Backbone: Shared representation learning across tasks
  3. Task-Specific Heads: Specialized decoders for different generation and estimation tasks
  4. Joint Training Strategy: Simultaneous optimization across all tasks

This architecture enables knowledge sharing and improves generalization across different video understanding tasks.


📊 Results Gallery

🎬 Text-to-Video Generation

More examples coming Soon

🎮 Controllable Generation

More examples coming Soon

🔍 Modality Estimation

More examples coming Soon

🗓️ TODO List

  • Release training code
  • Release inference code
  • Release pretrained models
  • Add Gradio demo, Colab notebook, and more usage examples
  • Release data
  • Release arXiv paper

⚖️ License

This repository is released under the Apache-2.0 license as found in the LICENSE file.

🚀 Stay Tuned for Updates!

Follow this project to get notified when we release the code!


📚 Citation

If you find this work useful for your research, please cite:

@article{huang2024unityvideo,
  title={UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation},
  author={Huang, Jiehui and Zhang, Yuechen and He, Xu and Gao, Yuan and Cen, Zhi and Xia, Bin and Zhou, Yan and Tao, Xin and Wan, Pengfei and Jia, Jiaya},
  journal={arXiv preprint arXiv:2512.07831},
  year={2025}
}

About

This project is the official implementation of "UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.