Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

csuhan/OneLLM

Open more actions menu

Repository files navigation

OneLLM: One Framework to Align All Modalities with Language

[Project Page] [Paper] [HF Demo🤗] [Modelscope Demo🤖] [Model🤗] [Data]

News

  • 2024.02.27 OneLLM is accepted by CVPR 2024!🎉
  • 2023.12.01 Release model weights and inference code.

Contents

Install

  1. Clone the repo into a local folder.
git clone https://github.com/csuhan/OneLLM

cd OneLLM
  1. Install packages.
conda create -n onellm python=3.9 -y
conda activate onellm

pip install -r requirements.txt

# install pointnet
cd model/lib/pointnet2
python setup.py install
  1. Install Apex. (Optional)
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

Models

We provide a preview model on the Hugging Face at: csuhan/OneLLM-7B.

Demo

Huggingface Demo: csuhan/OneLLM.

Local Demo: Assume you have downloaded the weights to ${WEIGHTS_DIR}. Then run the following command to start a gradio demo locally.

python demos/multi_turn_mm.py --gpu_ids 0 --tokenizer_path config/llama2/tokenizer.model --llama_config config/llama2/7B.json --pretrained_path ${WEIGHTS_DIR}/consolidated.00-of-01.pth

CLI Demo:

python demos/cli.py --image_path ${IMAGE_PATH} --gpu_ids 0 --tokenizer_path config/llama2/tokenizer.model --llama_config config/llama2/7B.json --pretrained_path ${WEIGHTS_DIR}/consolidated.00-of-01.pth

Data

Please check Data.md for more detail.

Evaluation

Please check Evaluation.md for more detail.

Training

Image-Text Pretraining

Single Node 8-GPU Training: exps/image_text_pretrain_8gpu.sh

Show More
torchrun --nproc_per_node=8 main_pretrain.py \
--epochs 1 --dataset image \
--batch_size 40 --accum_iter 16 \
--model_parallel_size 1 \
--data_parallel sdp \
--save_consolidated \
--llama_type onellm \
--llama_ckpt_dir ${LLAMA_7B_PATH} \
--llama_config config/llama2/7B.json \
--tokenizer_path config/llama2/tokenizer.model \
--auto_resume \
--weight_decay 0.1 --output_dir ${OUTPUT_DIR} \
--warmup_iters 2000 --lr_decay_iters 200000 --lr 5e-5 --min_lr 5e-6 --clip_grad 2 \
--save_freq 1000 \
2>&1 | tee -a ${OUTPUT_DIR}/output.log

Multi Nodes DDP Training:

Run N scripts on N nodes at the time, then we can launch a multi-node DDP training. Following is an example script for one node:

MASTER_ADDR=IP_ADDRESS_OF_NODE_1
NNODES=N
MASTER_PORT=29500
NPROC_PER_NODE=8

RANK=0 or 1 or ... or N

bash
torchrun \
--nnodes=$NNODES \
--nproc_per_node=8 \
--node_rank=$RANK \
--master_port=$MASTER_PORT \
--master_addr=$MASTER_ADDR \
main_pretrain.py \
--epochs 1 --dataset image \
--batch_size 40 --accum_iter 16 \
--model_parallel_size 1 \
--data_parallel sdp \
--save_consolidated \
--llama_type onellm \
--llama_ckpt_dir ${LLAMA_7B_PATH} \
--llama_config config/llama2/7B.json \
--tokenizer_path config/llama2/tokenizer.model \
--auto_resume \
--weight_decay 0.1 --output_dir ${OUTPUT_DIR} \
--warmup_iters 2000 --lr_decay_iters 200000 --lr 5e-5 --min_lr 5e-6 --clip_grad 2 \
--save_freq 1000 \
2>&1 | tee -a ${OUTPUT_DIR}/output.log

Multi Node SLURM Training: exps/image_text_pretrain_slurm.sh

Show More
#!/bin/bash
#SBATCH --gres=gpu:8
#SBATCH -n 16
#SBATCH -N 2
#SBATCH --cpus-per-task=16

srun python -u main_pretrain.py \
--epochs 1 --dataset image \
--batch_size 40 --accum_iter 8 \
--model_parallel_size 1 \
--data_parallel sdp \
--save_consolidated \
--llama_type onellm \
--llama_ckpt_dir ${LLAMA_7B_PATH} \
--llama_config config/llama2/7B.json \
--tokenizer_path config/llama2/tokenizer.model \
--auto_resume \
--weight_decay 0.1 --output_dir ${OUTPUT_DIR} \
--warmup_iters 2000 --lr_decay_iters 200000 --lr 5e-5 --min_lr 5e-6 --clip_grad 2 \
--save_freq 1000 \
2>&1 | tee -a ${OUTPUT_DIR}/output.log

Multimodal-Text Pretraining

Stage II Pretraining: Assume we have the pretrained ${IMAGE_TEXT_MODEL}, run exps/multimodal_text_pretrain_stage2.sh for video-audio-point-text pretraining.

Stage III Pretraining: Assume we have the pretrained ${STAGE2_MODEL}, run exps/multimodal_text_pretrain_stage3.sh for depth-normal-imu-fmri-text pretraining.

Instruction Tuning

Assume we have the pretrained ${STAGE3_MODEL}, run exps/multimodal_text_finetune.sh for multimodal instruction tuning.

Citation

@InProceedings{han2023onellm,
  title={OneLLM: One Framework to Align All Modalities with Language},
  author={Han, Jiaming and Gong, Kaixiong and Zhang, Yiyuan and Wang, Jiaqi and Zhang, Kaipeng and Lin, Dahua and Qiao, Yu and Gao, Peng and Yue, Xiangyu},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}

Acknowledgement

LLaMA, LLaMA-Adapter, LLaMA2-Accessory, Meta-Transformer, ChatBridge

License

This project is developed based on Llama 2, please refer to the LLAMA 2 Community License.

About

[CVPR 2024] OneLLM: One Framework to Align All Modalities with Language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.