Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ByteVisionLab/TokenFlow

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TokenFlow🚀: Unified Image Tokenizer for Multimodal Understanding and Generation

TokenFlow  huggingface weights  project page  Visitors GitHub closed issues

🌿 Introduction

We present TokenFlow, a unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. TokenFlow introduce an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment through a shared mapping mechanism.

radar.

TokenFlow excels in both multimodal understanding and image generation. For multimodal understanding, we surpass the flagship models such as LLaVA-1.5 and EMU3 by a large margin. For text-to-image generation, we also achieve comparable performance to SDXL in 256×256 resolution.

teasor.

📰 News

2025.02.27: TokenFlow got accepted to CVPR 2025.

2024.12.9: Code and checkpoints are released.

2024.12.5: 🎉🎉🎉 TokenFlow is released! 🎉🎉🎉 See our project page and paper .

⚙️ Getting Started

See GETTING_STARTED.md for detailed instructions of training and evaluation of (1) TokenFlow, (2) multimodal understanding model and (3) text-to-image generation model.

🤗 Checkpoints

Text-to-Image Model

Model Size Tokenizer Weight Model Weight
7B TokenFlow TokenFlow-t2i

Multimodal Understanding Model

Language Backbone Tokenizer Weight Model Weight
Qwen-2.5-14B TokenFlow-XL TokenFlow-llava-qwen2.5-14B-finetuning

📑 Open-source Plan

  • Release the checkpoint of tokenizer, text-to-image model & multimodal understanding model.
  • Release the training & inference code for tokenizer.
  • Release the training & inference code for text-to-image generation.
  • Release the training & inference code for multimodal understanding.

Acknowledgement

We thank the great work from VAR, LlamaGen and LLaVA.

📄 Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using

@article{qu2024tokenflow,
  title={Tokenflow: Unified image tokenizer for multimodal understanding and generation},
  author={Qu, Liao and Zhang, Huichao and Liu, Yiheng and Wang, Xu and Jiang, Yi and Gao, Yiming and Ye, Hu and Du, Daniel K and Yuan, Zehuan and Wu, Xinglong},
  journal={arXiv preprint arXiv:2412.03069},
  year={2024}
}

🔥 Open positions

We are hiring interns and full-time researchers at ByteDance, with a focus on multimodal understanding and generation (preferred base: Hangzhou, Beijing, and Shenzhen). If you are interested, please contact quliao1117@gmail.com.

About

[CVPR 2025] 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
Morty Proxy This is a proxified and sanitized view of the page, visit original site.