Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

gyhdog99/RACRO2

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

📄 Paper | 💻 Github | 🤗 RACRO-Models | 🤗 RACRO-Demo

This repository provides training, inference, and evaluation instructions for the paper:

Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning
Yunhao Gou*, Kai Chen*, Zhili Liu*, Lanqing Hong, Xin Jin, Zhenguo Li, James T. Kwok , Yu Zhang†
*Equal contribution †Corresponding Author
Arxiv Preprint, 2025

⚡ News

📖 Introduction

This paper introduces Reasoning-Aligned Perceptual Decoupling via Caption Reward Optimization (RACRO), a novel framework that enables scalable and modular multimodal reasoning by aligning visual perception with a powerful text-only reasoner. RACRO addresses the key challenge of generating image captions that are both faithful and sufficiently informative for downstream reasoning. It leverages a reasoning-guided reinforcement learning strategy to train the visual extractor, using reward signals derived from the performance of a fixed, high-capacity text-only LLM. This decoupled design avoids costly retraining of vision-language alignments and allows seamless plug-and-play upgrades to more advanced reasoners. Experiments on multimodal math and science benchmarks show that RACRO achieves state-of-the-art performance among open models.

📈 Results

📦 Contents

🎯 Model Zoos

Model Dataset 🤗 Huggingface Base Model Reasoner
RACRO-3B-CRO ViRL39K TBD Qwen2.5-VL-3B-Instruct DS-R1-Distilled-7B
RACRO-3B-CRO-GRPO ViRL39K TBD Qwen2.5-VL-3B-Instruct DS-R1-Distilled-7B
RACRO-7B-CRO ViRL39K [Link] Qwen2.5-VL-7B-Instruct DS-R1-Distilled-7B
RACRO-7B-CRO-GRPO ViRL39K [Link] Qwen2.5-VL-7B-Instruct DS-R1-Distilled-7B

🔧 Installation

# Create conda environment
conda create -n racro python==3.11
conda activate racro

# Install dependencies
cd verl-main
pip install -r requirements.txt

# Install flash-attention (see: https://github.com/Dao-AILab/flash-attention/releases)
# Follow the instructions for your CUDA version

# Install verl-main
pip install -e .

# Install VLMEvalKit
cd ../VLMEvalKit
pip install -e .

🔥 Quick Start with vLLM

from transformers import AutoProcessor, AutoTokenizer
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

########################
# === Configuration ===
########################
IMAGE_PATH = "./assets/images/demo_example.jpg"
QUESTION = "When the canister is momentarily stopped by the spring, by what distance $d$ is the spring compressed?"

MLLM_MODEL_PATH = "KaiChen1998/RACRO-7B-CRO-GRPO"
LLM_MODEL_PATH = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" # feel free to use more advanced reasoners!

########################
# === Prompts ===
########################
SYSTEM_PROMPT_CAP = "You are given an image and a relevant question. Based on the query, please describe the image in details. Do not try to answer the question."
SYSTEM_PROMPT_LLM = "You are a helpful assistant."

CAPTION_PROMPT = "Question: {}\nPlease describe the image. DO NOT try to answer the question!"
LLM_PROMPT = """In the following text, you will receive a detailed caption of an image and a relevant question. In addition, you will be provided with a tentative model response. You goal is to answer the question using these information.

### The detailed caption of the provided image: {}

### Note that the caption might contain incorrect solutions, do not be misguided by them.

### A problem to be solved: {}

### A tentative model response: {}

### Note that the above tentative response might be inaccurate (due to calculation errors, incorrect logic/reasoning and so on), under such a case, please ignore it and give your own solutions. However, if you do not have enough evidence to show it is wrong, please output the tentative response."""

########################
# === Initialize Models ===
########################
processor = AutoProcessor.from_pretrained(MLLM_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL_PATH)

mllm = LLM(model=MLLM_MODEL_PATH, tensor_parallel_size=1, gpu_memory_utilization=0.8,
           device='cuda:0', dtype="bfloat16", limit_mm_per_prompt={"image": 1})

llm = LLM(model=LLM_MODEL_PATH, tensor_parallel_size=1, gpu_memory_utilization=0.8,
          device='cuda:1', dtype="bfloat16")

mllm_sampling = SamplingParams(temperature=0, max_tokens=8192)
llm_sampling = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=8192)

########################
# === Build Prompts ===
########################
def build_messages(image_path, question):
    cap_msgs = [
        {"role": "system", "content": SYSTEM_PROMPT_CAP},
        {"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": CAPTION_PROMPT.format(question)}]}
    ]
    qa_msgs = [
        {"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": question + " Please think step by step. The final answer MUST BE put in \\boxed{}."}]}
    ]
    return cap_msgs, qa_msgs

# === Run Captioning and QA ===
def run_mllm(image_tensor, cap_prompt, qa_prompt):
    cap_output = mllm.generate([{"multi_modal_data": {"image": image_tensor}, "prompt": cap_prompt[0]}], sampling_params=mllm_sampling)
    qa_output = mllm.generate([{"multi_modal_data": {"image": image_tensor}, "prompt": qa_prompt[0]}], sampling_params=mllm_sampling)
    return cap_output[0].outputs[0].text, qa_output[0].outputs[0].text

# === Final Reasoning Step ===
def run_llm_reasoning(caption, question, answer):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT_LLM},
        {"role": "user", "content": LLM_PROMPT.format(caption, question, answer)}
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    output = llm.generate([{"prompt": prompt}], sampling_params=llm_sampling)
    return output[0].outputs[0].text

########################
# === Pipeline ===
########################
cap_msgs, qa_msgs = build_messages(IMAGE_PATH, QUESTION)
cap_prompt = processor.apply_chat_template([cap_msgs], tokenize=False, add_generation_prompt=True)
qa_prompt = processor.apply_chat_template([qa_msgs], tokenize=False, add_generation_prompt=True)

image_tensor, _ = process_vision_info(cap_msgs)
caption_text, tentative_answer = run_mllm(image_tensor, cap_prompt, qa_prompt)
final_answer = run_llm_reasoning(caption_text, QUESTION, tentative_answer)

print("Final Answer:\n", final_answer)

🏋️‍♂️ Training

  1. Prepare training data. Download ViRL39K dataset and preprocess it:
python verl-main/examples/data_preprocess/virl39k_pre.py \
  --src-parquet /cache/data/datasets/ViRL39K/39Krelease.parquet \
  --tgt-dir /cache/data/huggingface_datasets/virl39k_hf_no_deepscaler 

python verl-main/examples/data_preprocess/virl39k.py \
  --src-hf-dataset /cache/data/huggingface_datasets/virl39k_hf_no_deepscaler/ \
  --tgt-parquet /cache/data/huggingface_datasets/virl39k_no_deepscaler_caption.parquet
  1. Launch training. We provide an example which trains Qwen2.5-VL-3B-Instruct using Deepseek-R1-Distilled-7B as the reasoner:
bash verl-main/examples/grpo_trainer/captioner3b_7b.sh
  1. Convert checkpoints to HuggingFace format
bash verl-main/scripts/convert2hf.sh

🔍 Inference and Evaluation

  1. Set environment variables:
DATASET=MathVista_MINI
MODEL_ROOT=/cache/data/huggingface_models/
MLLM_NAME=Qwen2.5-VL-3B-Instruct
LLM_NAME=DeepSeek-R1-Distill-Qwen-7B
  1. Generate Tentative QA Response
python VLMEvalKit/run.py --data ${DATASET} \
  --model MLLM --model_path ${MODEL_ROOT}/${MLLM_NAME} \
  --tensor_parallel_size 1 --function_type qa_cot \
  --system_prompt "You are a helpful assistant." \
  --work-dir VLMEvalKit/outputs --api_nproc 32  \
  --suffix _model_${MLLM_NAME}_prompt_qa_cot
  1. Generate Query-Conditioned Captions
python VLMEvalKit/run.py --data ${DATASET} \ 
  --model MLLM --model_path ${MODEL_ROOT}/${MLLM_NAME} \
  --tensor_parallel_size 1 --function_type query_cond_3 \
  --system_prompt "You are given an image and a relevant question. Based on the query, please describe the image in detail. Do not try to answer the question." \
  --mode infer --work-dir VLMEvalKit/outputs \
  --suffix _model_${MLLM_NAME}_prompt_query_cond_3
  1. LLM Reasoning
QA_FILE=VLMEvalKit/outputs/MLLM/MLLM_${DATASET}_model_${MLLM_NAME}_prompt_qa_cot.csv
CAPTION_FILE=VLMEvalKit/outputs/MLLM/MLLM_${DATASET}_model_${MLLM_NAME}_prompt_query_cond_3.csv

python VLMEvalKit/run.py --data ${DATASET} \
  --model LLM --model_path ${MODEL_ROOT}/${LLM_NAME} \
  --tensor_parallel_size 4 --function_type joint \
  --work-dir VLMEvalKit/outputs --api_nproc 32 \
  --qa-file ${QA_FILE} \ 
  --caption-file ${CAPTION_FILE} \
  --suffix _model_${LLM_NAME}_cap_qa_${MLLM_NAME}_prompt_joint

🤝 Acknowledgements

Citation

@article{gou2025perceptual,
  author    = {Gou, Yunhao and Chen, Kai and Liu, Zhili and Hong, Lanqing and Jin, Xin and Li, Zhenguo and Kwok, James T. and Zhang, Yu}, 
  title     = {Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning},
  journal   = {arXiv preprint arXiv:2506.04559},
  year      = {2025},
}

About

Official PyTorch implementation of RACRO (https://www.arxiv.org/abs/2506.04559)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
Morty Proxy This is a proxified and sanitized view of the page, visit original site.