TokenFlow🚀: Unified Image Tokenizer for Multimodal Understanding and Generation

🌿 Introduction

We present TokenFlow, a unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. TokenFlow introduce an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment through a shared mapping mechanism.

TokenFlow excels in both multimodal understanding and image generation. For multimodal understanding, we surpass the flagship models such as LLaVA-1.5 and EMU3 by a large margin. For text-to-image generation, we also achieve comparable performance to SDXL in 256×256 resolution.

📰 News

2025.02.27: TokenFlow got accepted to CVPR 2025.

2024.12.9: Code and checkpoints are released.

2024.12.5: 🎉🎉🎉 TokenFlow is released! 🎉🎉🎉 See our project page and paper .

⚙️ Getting Started

See GETTING_STARTED.md for detailed instructions of training and evaluation of (1) TokenFlow, (2) multimodal understanding model and (3) text-to-image generation model.

🤗 Checkpoints

Text-to-Image Model

Model Size	Tokenizer Weight	Model Weight
7B	TokenFlow	TokenFlow-t2i

Multimodal Understanding Model

Language Backbone	Tokenizer Weight	Model Weight
Qwen-2.5-14B	TokenFlow-XL	TokenFlow-llava-qwen2.5-14B-finetuning

📑 Open-source Plan

Release the checkpoint of tokenizer, text-to-image model & multimodal understanding model.
Release the training & inference code for tokenizer.
Release the training & inference code for text-to-image generation.
Release the training & inference code for multimodal understanding.

Acknowledgement

We thank the great work from VAR, LlamaGen and LLaVA.

📄 Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using

@article{qu2024tokenflow,
  title={Tokenflow: Unified image tokenizer for multimodal understanding and generation},
  author={Qu, Liao and Zhang, Huichao and Liu, Yiheng and Wang, Xu and Jiang, Yi and Gao, Yiming and Ye, Hu and Du, Daniel K and Yuan, Zehuan and Wu, Xinglong},
  journal={arXiv preprint arXiv:2412.03069},
  year={2024}
}

🔥 Open positions

We are hiring interns and full-time researchers at ByteDance, with a focus on multimodal understanding and generation (preferred base: Hangzhou, Beijing, and Shenzhen). If you are interested, please contact quliao1117@gmail.com.

Name	Name	Last commit message	Last commit date
Latest commit History 35 Commits
assets	assets
i2t	i2t
t2i	t2i
tokenflow	tokenflow
.gitignore	.gitignore
GETTING_STARTED.md	GETTING_STARTED.md
LICENSE	LICENSE
README.md	README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TokenFlow🚀: Unified Image Tokenizer for Multimodal Understanding and Generation

🌿 Introduction

📰 News

⚙️ Getting Started

🤗 Checkpoints

📑 Open-source Plan

Acknowledgement

📄 Citation

🔥 Open positions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

License

ByteVisionLab/TokenFlow

Folders and files

Latest commit

History

Repository files navigation

TokenFlow🚀: Unified Image Tokenizer for Multimodal Understanding and Generation

🌿 Introduction

📰 News

⚙️ Getting Started

🤗 Checkpoints

📑 Open-source Plan

Acknowledgement

📄 Citation

🔥 Open positions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages