Ming-UniVision: Joint Image Understanding and Geneation with a Continuous Unified Tokenizer

📄 Technical Report | 📖Project Page ｜🤗 Hugging Face｜ 🤖 ModelScope

🌍 Introduction

🌐 Ming-UniVision is a groundbreaking multimodal large language model (MLLM) that unifies vision understanding, generation, and editing within a single autoregressive next-token prediction (NTP) framework, powered by MingTok — the first continuous, unified visual tokenizer. By eliminating discrete quantization and leveraging a shared continuous latent space, Ming-UniVision enables seamless, end-to-end multimodal reasoning across diverse tasks. Trained on high-fidelity continuous visual representations, Ming-UniVision supports multi-round, in-context vision-language interactions, such as iterative question answering, image generation, and semantic editing — all without needing to decode intermediate states into pixels. This enables efficient, coherent, and human-like multimodal dialogue with consistent feature dynamics throughout.

🌐 First NTP MLLM with Continuous Unified Vision Representations: Ming-UniVision unifies vision and language via next-token prediction using continuous visual tokens — no discrete quantization, full autoregressive generative paradigm, and support for both understanding and generation in a shared latent space.
🖼️ First Continuous Unified Visual Tokenizer: MingTok-Vision enables both understanding and generation in a single continuous space, preserving semantic and perceptual quality.
⚡ 3.5× Faster Training Convergence: Shared representation reduces conflict between tasks, enabling faster, more stable joint training.
🔄 Multi-Round In-Context Vision Tasks: Perform iterative reasoning, generation, and editing in one latent space — no image decoding needed mid-process.
🔗 Single Space, Unified Workflow: All modalities and tasks share one coherent feature space — simpler training, efficient inference, true autoregressive fusion.

📌 Updates

[2025.10.09] 📄 Technical Report Released!
The full technical report is now available on arXiv:
👉 arXiv:2510.06590 | Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer
Dive into the architecture, unified continuous tokenizer, and end-to-end autoregressive framework that power our system.
[2025.10.02] 🔥 We’re live!
We’re thrilled to announce the release of Ming-UniVision and MingTok-Vision — the first joint autoregressive vision-language system with unified continuous visual tokenization!

✨ Enable seamless multimodal reasoning, generation, and editing in a single latent space.
🚀 Faster training, richer semantics, and true end-to-end autoregression — no quantization, no compromises.

👉 Check out our blog post to learn how we’re redefining unified vision-language intelligence.

📊 Evaluation

MingTok-Vision achieves strong image reconstruction capability and Ming-UniVision enables unified multimodal understanding and generation within a single continuous latent space.

Image Reconstruction

MingTok-Vision achieves competitive reconstruction quality with high PSNR and low rFID, demonstrating its ability to preserve both perceptual fidelity and semantic structure in a continuous representation.

Table 1. Image reconstruction performance on ImageNet-val-50k.

Tokenizer	Res.	# Tokens	rFID ↓	PSNR ↑	SSIM ↑	LPIPS ↓
Specialized tokenizers
SD-VAE	256	1024	1.06	28.62	0.86	-
GigaTok	256	256	0.51	21.32	0.69	0.21
VA-VAE	256	256	0.26	28.59	0.80	0.09
HieraTok	256	256	1.04	23.90	0.72	0.09
DC-AE	512	64	0.22	26.15	0.71	0.08
MAE-Tok	512	128	0.62	-	-	-
TexTok	512	256	0.73	24.45	0.66	0.19
Unified tokenizers
UniTok	256	256	0.38	-	-	-
TokenFlow	384	729	0.63	22.77	0.73	-
MingTok-Vision	512	256	0.54	30.77	0.62	0.14
MingTok-Vision †	512	256	0.38	31.09	0.64	0.12

† denotes using semantic decoder after joint pre-training.

Visual Understanding

Ming-UniVision achieves competitive performance on multimodal understanding benchmarks, showing that continuous latent tokens can effectively support high-level vision-language reasoning without discrete quantization.

Table 2. Quantitative evaluations on MMBench, MMStar, MMMU, MathVista, HallusionBench, AI2D, MM-Vet, OCRBench, and MME.

Model	MMB ↑	MMS ↑	MMMU ↑	MathV ↑	Hall ↑	AI2D ↑	MM-Vet ↑	OCRBench ↑	MME ↑
Understanding Only
Emu3-Chat	58.5	-	31.6	-	-	-	37.2	687	-
Qwen2.5-VL-3B	79.1	55.9	53.1	62.3	46.3	81.6	-	797	2157
Qwen2.5-VL-7B	83.5	63.9	58.6	68.2	52.9	83.9	67.1	864	2347
InternVL2.5-4B	81.1	58.3	52.3	60.5	46.3	81.4	60.6	828	2338
InternVL2.5-8B	84.6	62.8	56.0	64.4	50.1	84.5	62.8	822	2344
DeepSeek-VL2	79.6	61.3	51.1	62.8	-	81.4	-	811	2253
Unified model, Separate representation
Janus-Pro-7B	79.2	-	41.0	-	-	-	50.0	-	-
LMFusion	-	-	41.7	-	-	-	-	-	1603
MetaQuery-L	78.6	-	53.1	-	-	-	63.2	-	-
Show-o2-7B	79.3	56.6	48.9	-	-	78.6	-	-	-
BLIP3-o 4B	78.6	-	46.6	-	-	-	60.1	-	2161
BAGEL	85.0	-	55.3	73.1	-	-	67.2	-	2388
Unified model, Unified representation
VILA-U	-	-	-	-	-	-	33.5	-	1402
TokenFlow-XL	76.8	-	43.2	-	-	-	48.2	-	1922
UniTok	-	-	-	-	-	-	33.9	-	1448
Harmon-1.5B	65.5	-	38.9	-	-	-	-	-	1476
TokLIP	67.6	-	43.1	-	-	-	29.8	-	-

Ming-UniVision-16B-A3B (Ours)	78.5	63.7	40.3	66.6	47.8	82.8	64.2	724	2023

Visual Generation

Ming-UniVision achieves top performance among unified representation models in text-to-image generation, demonstrating superior object composition and spatial reasoning capabilities.

Table 3. Evaluation of text-to-image generation ability on GenEval and DPG-Bench. † denotes performance obtained by rewritten prompts.

Method	Single Obj. ↑	Two Obj. ↑	Counting ↑	Colors ↑	Position ↑	Color Attri. ↑	Overall ↑	DPG-Bench ↑
Generation Only
LlamaGen	0.71	0.34	0.21	0.58	0.07	0.04	0.32	-
PixArt-α	0.98	0.50	0.44	0.80	0.08	0.07	0.48	-
SDv2.1	0.98	0.51	0.44	0.85	0.07	0.17	0.50	-
DALL-E 2	0.94	0.66	0.49	0.77	0.10	0.19	0.52	-
Emu3-Gen	0.98	0.71	0.34	0.81	0.17	0.21	0.54	80.60
SDXL	0.98	0.74	0.39	0.85	0.15	0.23	0.55	74.65
DALL-E 3	0.96	0.87	0.47	0.83	0.43	0.45	0.67	83.50
SD3-Medium	0.99	0.94	0.72	0.89	0.33	0.60	0.74	84.08
Unified model, Separate representation
Show-o	0.95	0.52	0.49	0.82	0.11	0.28	0.53	-
Ming-Lite-Uni	0.99	0.76	0.53	0.87	0.26	0.30	0.62	-
Janus-Pro-1B	0.98	0.82	0.51	0.89	0.65	0.56	0.73	82.63
Janus-Pro-7B	0.99	0.89	0.59	0.90	0.79	0.66	0.80	84.19
Show-o2-7B	1.00	0.87	0.58	0.92	0.52	0.62	0.76	86.14
MetaQuery-L†	-	-	-	-	-	-	0.78	81.10
Blip3-o 4B	-	-	-	-	-	-	0.81	79.36
BAGEL	0.99	0.94	0.81	0.88	0.64	0.63	0.82	-
Unified model, Unified representation
Harmon-1.5B	0.99	0.86	0.66	0.85	0.74	0.48	0.79	-
TokenFlow-XL	0.95	0.60	0.41	0.81	0.16	0.24	0.55	73.38

Ming-UniVision-16B-A3B (Ours)	1.00	0.93	0.59	0.93	0.92	0.70	0.85	82.12

📥 Model Downloads

Model	Hugging Face	ModelScope
Ming-UniVision-16B-A3B	Download	Download
MingTok-Vision	Download	Download

🔗 Both models are publicly available for research. Visit the respective pages for model details, inference examples, and integration guides.

🚀 Example Usage

🔧 Installation

First, clone the repository and install the required dependencies:

git clone https://github.com/inclusionAI/Ming-UniVision.git
cd Ming-UniVision
conda create -n ming python=3.10 -y
conda activate ming
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu128
pip install flash_attn
modelscope download --model inclusionAI/Ming-UniVision-16B-A3B --local_dir ./models/Ming-UniVision-16B-A3B
modelscope download --model inclusionAI/MingTok-Vision --local_dir ./models/MingTok-Vision

💬 Quick start

# >=44GB VRAM
python app.py

# >=22GB VRAM
python app.py --dtype int8

# >=14GB VRAM
python app.py --dtype int4

🖼️ Image Reconstruction with MingTok-Vision

Here's a simple demo to reconstruct an image using the continuous latent space of MingTok-Vision:

# test_infer_recon_image.py
import torch
from PIL import Image
import torchvision.transforms as T
from omegaconf import OmegaConf
from mingtok.modeling_mingtok import MingTok
from mingtok.utils import CenterCropProcessor

if __name__ == "__main__":
    # Load model
    mingtok_model = MingTok.from_pretrained("./models/MingTok-Vision")
    mingtok_model = mingtok_model.cuda()

    img_path = "mingtok/asset/mingtok.png"
    save_path = "mingtok/asset/mingtok_recon.png"

    # Load and preprocess image
    image = Image.open(img_path).convert("RGB")
    processor = CenterCropProcessor(image_size=512, mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
    image = processor(image).cuda().unsqueeze(0)

    # Reconstruct
    out = mingtok_model.forward_enc_dec(image)

    # Denormalize and save
    output_mean = torch.Tensor([0.5, 0.5, 0.5]).view(1, -1, 1, 1).cuda()
    output_std = torch.Tensor([0.5, 0.5, 0.5]).view(1, -1, 1, 1).cuda()
    output_image = (out * output_std + output_mean)[0]
    output_image = T.ToPILImage()(output_image)
    output_image.save(save_path)
    print(f"Reconstructed image saved to {save_path}")

💬 Multi-Modal Inference with Ming-UniVision

Use MingUniVisionInfer for unified tasks including image generation, understanding, as well as single- and multi-round editing.

from mingunivisioninfer import MingUniVisionInfer
model = MingUniVisionInfer("./models/Ming-UniVision-16B-A3B")

# single round generation
image_gen_prompt = "Please generate the corresponding image based on the description. A cute girl."
messages = [{
  "role": "HUMAN",
  "content": [{"type": "text", "text": image_gen_prompt},],
}]
output_text = model.generate(messages, max_new_tokens=512, output_image_prefix="a_cute_girl")
model.reset_inner_state()

# single ground understanding
messages = [{
  "role": "HUMAN",
  "content": [
    {"type": "image", "image": "a_cute_girl.png"},
    {"type": "text", "text": "Please describe the picture in detail."},
  ],
}]
output_text = model.generate(messages, max_new_tokens=512)
print(output_text)
model.reset_inner_state()

# multi-round editing
messages = [{
  "role": "HUMAN",
  "content": [
    {"type": "image", "image": "a_cute_girl.png"},
    {"type": "text", "text": "Given the edit instruction: Change the color of her cloth to red, please identify the editing region"},
  ],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_0")

messages = [{
  "role": "HUMAN",
  "content": [
    {"type": "text", "text": "Change the color of her cloth to red."},
  ],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_1")

messages = [{
  "role": "HUMAN",
  "content": [
    {"type": "text", "text": "Refine the image for better clarity."},
  ],
}]
output_text = model.generate(messages, max_new_tokens=512, for_edit=True, output_image_prefix="edit_round_2")

model.reset_inner_state()

# single round text-based conversation
messages = [{
  "role": "HUMAN",
  "content": [
    {"type": "text", "text": "请详细介绍鹦鹉的习性。"},
  ],
}]

output_text = model.generate(messages, max_new_tokens=512)
print(output_text)
model.reset_inner_state()

📌 Tips:

Image generation: Use descriptive prompts + output_image_prefix to save output.
Image understanding: Include "image" and "text" in the same message.
Image editing: Chain multiple generate(..., for_edit=True) calls with unique output_image_prefix names.
Multi-turn interactions are supported via internal state — call model.reset_inner_state() to reset.
Input types: "text" and "image" — flexible order, mixed inputs allowed.

📝 Note (Model Limitations):

The current model was trained with only two-turn conversations, and has not been optimized for alternating rounds of image understanding and generation, although it may generalize to more than two turns during inference. As a result, performance may be limited in complex, multi-modal dialogue scenarios requiring deep contextual reasoning across turns.
This open-sourced version was trained using mixed-resolution strategies: high resolution for image understanding, but lower resolution for image editing and generation. Additionally, large-scale interleaved image-text data was not included during pretraining.
Due to these factors, image editing quality and consistency may be suboptimal compared to fully end-to-end, high-resolution multimodal models. We are actively working on improved versions with unified resolution training and richer interleaved data.

Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.

✍️ Citation

If you find our work useful in your research or applications, please consider citing:

@article{huang2025mingunivision,
  title={Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer},
  author={Huang, Ziyuan and Zheng, DanDan and Zou, Cheng and Liu, Rui and Wang, Xiaolong and Ji, Kaixiang and Chai, Weilong and Sun, Jianxin and Wang, Libin and Lv, Yongjie and Huang, Taozhi and Liu, Jiajia and Guo, Qingpei and Yang, Ming and Chen, Jingdong and Zhou, Jun},
  journal={arXiv preprint arXiv:2510.06590},
  year={2025}
}

Name	Name	Last commit message	Last commit date
Latest commit History 16 Commits
docker	docker
examples	examples
figures	figures
ming_sdk	ming_sdk
mingtok	mingtok
mingunivision	mingunivision
sentence_manager	sentence_manager
talker	talker
vllm	vllm
.gitignore	.gitignore
LEGAL.md	LEGAL.md
LICENSE	LICENSE
README.md	README.md
app.py	app.py
requirements.txt	requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ming-UniVision: Joint Image Understanding and Geneation with a Continuous Unified Tokenizer

🌍 Introduction

📌 Updates

📊 Evaluation

Image Reconstruction

Visual Understanding

Visual Generation

📥 Model Downloads

🚀 Example Usage

🔧 Installation

💬 Quick start

🖼️ Image Reconstruction with MingTok-Vision

💬 Multi-Modal Inference with Ming-UniVision

✍️ Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

License

inclusionAI/Ming-UniVision

Folders and files

Latest commit

History

Repository files navigation

Ming-UniVision: Joint Image Understanding and Geneation with a Continuous Unified Tokenizer

🌍 Introduction

📌 Updates

📊 Evaluation

Image Reconstruction

Visual Understanding

Visual Generation

📥 Model Downloads

🚀 Example Usage

🔧 Installation

💬 Quick start

🖼️ Image Reconstruction with MingTok-Vision

💬 Multi-Modal Inference with Ming-UniVision

✍️ Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages