Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Yangr116/VST

Open more actions menu

Repository files navigation

Visual Spatial Tuning

Paper Project Page Weights

We introduce Visual Spatial Tuning (VST), a comprehensive framework designed to cultivate Vision-Language Models (VLMs) with human-like visuospatial abilities—from spatial perception to advanced reasoning.

Teaser Image


🔥 News

  • Training code has been updated and verified, please see Train, which is very efficient because of data packing.

💡 Key Highlights

VST-P: 4.1M samples across 19 skills, spanning single images, multi-image scenarios, and videos—boosting spatial perception in VLMs.
VST-R: 135K curated samples that teach models to reason in space, including step-by-step reasoning and rule-based data for reinforcement learning.
Progressive Training Pipeline: Start with supervised fine-tuning to build foundational spatial perception, then reinforce spatial reasoning abilities via RL. VST achieves state-of-the-art results on spatial benchmarks (34.8% on MMSI-Bench, 61.2% on VSIBench) without compromising general capabilities.
Vision-Language-Action Models Enhanced: The VST paradigm significantly strengthens robotic learning.


📊 Dataset Overview

Dataset Image

🖼️ VST-Perception (VST-P)

  • 4.1M samples across 19 tasks for supervised fine-tuning.
  • Covers three primary vision scenarios: single-image, multi-image, and video.
  • VLMs tuned on VST-P show strong improvements in spatial perception:
    • ~20% boost on CVBench-3D
    • ~5% increase on BLINK
    • ~16% gain on VSIBench

🧠 VST-Reasoning (VST-R)

  • 135K samples, split into:
    • Reasoning steps (CoT): Teach models how to reason spatially.
    • Rule-checkable data: Used in online RL to further enhance reasoning skills.
  • VLMs tuned on VST-R demonstrate:
    • 8.9% improvement on MMSI-Bench

🏷️ Model Card

Model Name 🤗 HuggingFace
VST-3B-SFT rayruiyang/VST-3B-SFT
VST-3B-RL rayruiyang/VST-3B-RL
VST-7B-SFT rayruiyang/VST-7B-SFT
VST-7B-RL rayruiyang/VST-7B-RL
Click to see performance 📈

📈 Spatial & General Benchmarks

ModelsCV3DSRMMSIBLINKVSIMMStarMMBRealworldQAMMMUOCRBAI2D
VST-3B-SFT84.454.130.259.157.958.080.968.445.283.782.5
VST-3B-RL84.256.531.357.257.758.980.568.549.880.982.4
VST-7B-SFT85.554.632.062.160.663.183.372.250.685.584.9
VST-7B-RL86.560.134.862.661.263.583.068.549.486.183.5

📈 VSIBench

MethodsAvg.Obj. CountAbs. Dist.Obj. SizeRoom SizeRel. DistRel. Dir.Route PlanAppr. Order
VST-3B-SFT57.969.345.471.862.459.046.038.770.2
VST-3B-RL57.766.645.072.860.959.947.640.768.3
VST-7B-SFT60.672.044.474.368.359.755.844.965.2
VST-7B-RL61.271.643.875.569.260.055.644.369.2

📈 SUN RGBD 3D Object Detection

MethodsAP@15
Seed1.5-VL33.5
Gemini-2.0-Pro32.5
Gemini Robotics-ER48.3
VST-3B-SFT37.3
VST-3B-RL40.1
VST-7B-SFT41.6
VST-7B-RL44.2

⚡ Getting Started

pip install transformers
# It's highly recommanded to use `[decord]` feature for faster video loading.
pip install qwen-vl-utils

Cookbook

Using 🤗 Transformers to Chat

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

THINK_SYSTEM_PROMPT = "You are a helpful assistant. You should first think about the reasoning process in the mind and then provide the user with the answer. The reasoning process is enclosed within <think> </think> tags, i.e. <think> reasoning process here </think> answer here."
think_mesg = {
                "role": "system",
                "content": [{"type": "text", "text": THINK_SYSTEM_PROMPT}],
            }

enable_thinking=False

model_path="rayruiyang/VST-7B-RL"

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

# default processer
processor = AutoProcessor.from_pretrained(model_path, min_pixels = 256*28*28, max_pixels=1280*28*28)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "http://images.cocodataset.org/train2017/000000075668.jpg",
            },
            {"type": "text", "text": "Consider the real-world 3D locations of the objects. Is the 'no motorcycle' sign directly above the red bus?"},
        ],
    }
]

if enable_thinking:
    messages.insert(0, think_mesg)


# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

Train

git clone https://github.com/Yangr116/VST
cd VST
# install veomni
git clone -b v0.1.3 https://github.com/ByteDance-Seed/VeOmni.git third_party/VeOmni
cd third_party/VeOmni
pip install -e .
# install requirements
cd ../..
pip install -r requirements.txt
# install flash-attn (recommend)
pip install flash-attn --no-build-isolation

NOTE: we use torch2.5.1+cu124, other torch version is also fine.

Please follow docs/train.md to prepare data and train models.

Evaluation

Please see docs/evaluation.md

📜 License

This project is licensed under the Apache License. See the LICENSE file for details.

The VST-3B model is fine-tuned from Qwen2.5VL-3B, its license is Qwen2.5VL-3B LICENSE.

Acknowledgement

Thanks for the projects: Qwen2.5VL, VeOmni, EasyR1, and VLMEvalKit.

If you find VST useful for your research or applications, please ⭐ star the repo or cite our work:

@article{vst,
  title={Visual Spatial Tuning},
  author={Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao},
  journal={arXiv preprint arXiv:2511.05491},
  year={2025}
}
Morty Proxy This is a proxified and sanitized view of the page, visit original site.