OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
- [2025/09/22]๐ฅAfter a year of community evaluation, our work has been accepted by NIPS 2025. Congratulations!
- [2025/05/26]๐ฅOur [OmniCharacter]โbuilt on MMEvol and OpenOmni seriesโhas been accepted to the main track of ACL 2025. Youโre all welcome to give it a try!
- [2025/05/15]๐ฅTwo paper has beed accepted by ACL 2025 main based on our findings (LLaMA-Omin2 and OmniCharacter). We warmly welcome everyone to use our work.
- [2025/05/05]๐ฅOur gate fusion technology for more acurrate speech content generation is adopted by LLaMA-Omni2
- [2025/02/12]๐ฅAdd some missing file and fix all possible bug
- [2025/01/13]๐ฅOpenOmni is coming! We release the code, model and data
- [2025/01/09]๐ฅAfter two months of company audit! We release the paper
- [2024/11/14]๐ฅWe submit the paper for peer review openreview
- [2024/09/15]๐ฅWe write the first line of OpenOmni project for fully open-sourced pioneering OmniLLM in end-to-end manner.
- Setup
- Model
- Preparation
- Train
- Evaluation
- Example
- Citation
Please follow the instructions below to install the required packages.
- Clone this repository
git clone https://github.com/RainBowLuoCS/OpenOmni.git
cd OpenOmni
- Install Package
conda create -n openomni python=3.10 -y
conda activate openomni
pip install --upgrade pip # enable PEP 660 support
pip install -e ".[train]"
pip install -r requirements.txt
- Install additional packages for training
pip install flash-attn --no-build-isolation
After downloading the weights and configuring the paths properly. Two open-sourced speech tokenizer are needed for speech discretization and reconstruction with different vocabulary size! CosVoice for 6K CTC Mode and GLM4Voice for 16K AR Mode
Fast inference for omnimodal input (speech,text,image and video)
python inference.py
Fast interation for omnimodal input (speech,text,image and video)
python demo.py
Here are the pretrained weights and instruction tuning weights
Stage | Model | Speech Projector | Image Projector |
IT Data | Download |
---|---|---|---|---|---|
1-1 | OpenOMNI-Qwen2-7B-Stage1-1 | ckpt | ckpt | openomni_stage1-1.json | ckpt |
2-1 | OpenOMNI-Qwen2-7B-Stage2-1 | ckpt | ckpt | openomni_stage2-1.json | ckpt |
2-2 | OpenOMNI-Qwen2-7B-Stage2-2 | ckpt | ckpt | openomni_stage2-2.json | ckpt |
3-1 | OpenOMNI-Qwen2-7B-Stage3-1 | ckpt | ckpt | openomni_stage3-1.json | ckpt |
3-2 | OpenOMNI-Qwen2-7B-Stage3-2 | ckpt | ckpt | openomni_stage3-2.json | ckpt |
Please follow MMEvol to prepare the corresponding images-text datasets. Here we only provide the details of speech-text datasets.
The following is the data directory tree of OpenOmni
datasets
โโโ json # data receipe
โ โโโ openomni_stage1-1.json # speech2text pretraining
โ โโโ openomni_stage2-1.json # image2text pretraining
โ โโโ openomni_stage2-2.json # image2text instruction tuning
โ โโโ openomni_stage3-1.json # text2speech pretraining
โ โโโ openomni_stage3-2.json # text2speech emotional injection
โโโ asr # classic bilingual speech corpus
โ โโโ AISHELL-4
โ โโโ LibriSPeech
โ โโโ WeNetSpeech
โโโ audio_en # synthetic english speech corpus for question
โโโ audio_llava # synthetic bilingual speech corpus for answer
โโโ audio_zh # synthetic chinese speech corpus for question
โโโ audio_unit # synthetic bilingual speech corpus for answer
โโโ audio_prefer # synthetic emotional bilingual speech corpus for answer
โโโ audio_reject # synthetic emotional bilingual speech corpus for answer
โโโ audio_ultrachat # synthetic bilingual speech corpus for answer
โโโ ai2d
โ โโโ abc_images
โ โโโ annotations
โ โโโ images
โ โโโ questions
โ โโโ categories.json
......
- All file/path starting with "audio" are self-synthesized.
- DPO contains approximately 9k entries for "prefer" and "reject," covering 9 types of emotions.
More details about data curation can be found in our paper.
Please download the MMEvol, AIShell-4, LibriSPeech, WeNetSpeech, OpenOmni Data and organize the data following Preparation before training . Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
bash scripts/train/llama3/speech2text_pretrain.sh
bash scripts/train/qwen2/speech2text_pretrain.sh
Please make sure you download and organize the data following Preparation before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
bash scripts/train/llama3/image2text_pretrain.sh
bash scripts/train/qwen2/image2text_pretrain.sh
Please make sure you download and organize the data following Preparation before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
bash scripts/train/llama3/image2text_finetune.sh
bash scripts/train/qwen2/image2text_finetue.sh
Please make sure you download and organize the data following Preparation before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
bash scripts/train/llama3/text2speech_ pretrain.sh
bash scripts/train/qwen2/text2speech_ pretrain.sh
Please make sure you download and organize the data following Preparation before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
bash scripts/train/llama3/text2speech_ dpo.sh
bash scripts/train/qwen2/text2speech_ dpo.sh
datasets
โโโ json # data receipe
โ โโโ aishell2_eval.jsonl # aishell evaluation
โ โโโ librispeech_eval.jsonl # image2text pretraining
โ โโโ wenetspeech_eval.json # image2text instruction tuning
โ โโโ openomni_emotion_val.json
โโโ OmniBench # OmniBench
โ โโโ mmdata
โ โโโ dataset
โ โโโ eval.json
โโโ Ov-Odyssey # Ov-Odyssey Bench
โ โโโ av_odyssey_part1.parquet
โ โโโ av_odyssey_part2.parquet
โ โโโ av_odyssey_part3.parquet
โ โโโ av_odyssey_part4.parquet
โ โโโ av_odyssey_part5.parquet
Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)
python openomni/eval/llama3/asr_eavl.py
python openomni/eval/qwen2/asr_eavl.py
Model | LibriSpeech-test-clean | LibriSpeech-test-other | AIShell2-dev | AIShell2-test | WeNetSpeech-testnet | WeNetSpeech-testmeeting |
---|---|---|---|---|---|---|
VITA | 8.1 | 18.4 | 12.2 | 16.5 | ||
EMOVA | 4.0 | 8.6 | 10.6 | 10.3 | ||
MINI-OMNI | 4.5 | 9.7 | ||||
Freeze-Omni | 3.29 | 7.4 | 8.57 | 10.09 | ||
ours | 2.57 | 5.6 | 6.81 | 6.87 | 7.63 |
Refer to MMEvol for detailed OpenCampass Vision Language Evaluation
# run on all 9 datasets
./script/run_inference.sh OpenOmni-Qwen "MME MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA MMStar MMVet AI2D_TEST OCRBench HallusionBench POPE BLINK" all
# The following are instructions for running on a single dataset
# MME
./script/run_inference.sh OpenOmni-Qwen MME all
# MMMU_DEV_VAL
./script/run_inference.sh OpenOmni-Qwen MMMU_DEV_VAL all
# MathVista_MINI
./script/run_inference.sh OpenOmni-Qwen MathVista_MINI all
.....
Please download OmniBench and run the following command
python openomni/eval/llama3/omni_eavl.py
python openomni/eval/qwen2/omni_eavl.py
Please download Ov-Odyssey and run the following command
python openomni/eval/llama3/ov_odyssey_eavl.py
python openomni/eval/qwen2/ov_odyssey_eavl.py
python openomni/eval/llama3/t2s_eavl.py
python openomni/eval/qwen2/t2s_eavl.py
python openomni/eval/llama3/et2s_eavl.py
python openomni/eval/qwen2/et2s_eavl.py
ๅๆฏๅ๏ผๅๆฏๅ๏ผๅๅๆฏๅๅ๏ผๅๅๆฏๅๅใ |
้ปๅ่ฅๅ็ฐ๏ผ็ฐๅ่ฅๅ้ป๏ผ้ปๅ่ฅๅ็ฐไผๆฅๅ๏ผ็ฐๅ่ฅๆฅๅไผๅ้ปใ |
ๅ่ก่ไธๅ่ก่็ฎ๏ผไธๅ่ก่ๅๅ่ก่็ฎใ |
zh_0.webm |
zh_1.webm |
zh_2.webm |
ๅ ซ็พๆ ๅ ตๅฅๅๅก๏ผ็ฎๅ ตๅนถๆๅ่พน่ท๏ผ็ฎๅ ตๆๆๆ ๅ ต็ขฐ๏ผๆ ๅ ตๆ็ขฐ็ฎๅ ต็ฎใ |
็บขๅคๅฐ๏ผ้ปๅคๅฐ๏ผ็ฒ็บขๅคๅฐ๏ผ่ฑๅคๅฐใ |
็้ๅนดๅนดๆๅๅจ๏ผๅๅจๅฟตๅฟตๆ็้ใ |
zh_3.webm |
zh_4.webm |
zh_5.webm |
She sells seashells by the seashore. |
Peter Piper picked a peck of pickled peppers. |
Six slippery snails slid slowly seaward. |
en_0.webm |
en_1.webm |
en_2.webm |
Six sleek swans swam swiftly southwards. |
I saw Susie sitting in a shoeshine shop. |
Can you can a can as a canner can can a can? |
en_3.webm |
en_4.webm |
en_5.webm |
I am so sad. |
why are you doing this to me. |
what a nice day. |
i am very scared. |
sad_en.mp4 |
angry_en.mp4 |
happy_en.mp4 |
fearful_en.mp4 |
ๆ็็ๅพ้พ่ฟ. |
ไฝ ไธบไปไน่ฆ่ฟๆ ท๏ผๆ็็ๅพ็ๆฐ. |
ไปๅคฉๅคฉๆฐ็ๅฅฝ. |
ๆ็ๆ็นๅฎณๆ. |
sad.mp4 |
angry.mp4 |
happy.mp4 |
fearful.mp4 |
demo.mov
If you find this repo useful for your research, please consider citing the paper
@article{luo2025openomni,
title={OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis},
author={Luo, Run and Lin, Ting-En and Zhang, Haonan and Wu, Yuchuan and Liu, Xiong and Yang, Min and Li, Yongbin and Chen, Longze and Li, Jiaming and Zhang, Lei and others},
journal={arXiv preprint arXiv:2501.04561},
year={2025}
}
@article{luo2024mmevol,
title={Mmevol: Empowering multimodal large language models with evol-instruct},
author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},
journal={ACL 2025},
year={2024}
}
@article{zhang2025omnicharacter,
title={OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction},
author={Zhang, Haonan and Luo, Run and Liu, Xiong and Wu, Yuchuan and Lin, Ting-En and Zeng, Pengpeng and Qu, Qiang and Fang, Feiteng and Yang, Min and Gao, Lianli and others},
journal={ACL 2025},
year={2025}
}
if you have any question, please consider following concat for help
-
Run Luo โ r.luo@siat.ac.cn
-
Haonan Zhang โ zchiowal@gmail.com
- LLaVA and LLaVA-Omni: the codebase we built upon. Thanks for their brilliant contributions to the community! We just can't wait to use OpenOmni.
- VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!
- CosVoice: the amazing open-sourced speech tokenizer for speech discretization and reconstruction with 6k vocabulary size!
- GLM4Voice: the amazing open-sourced speech tokenizer for speech discretization and reconstruction with 16k vocabulary size!