Benchmark and Evaluations, RL Alignment, Applications, and Challenges of Large Vision Language Models
🌐 Language: English · 简体中文

A most Frontend Collection and survey of vision-language model papers, and models GitHub repository
🧭 The Evolution of VLM Architectures
VLM design has gone through three distinct architectural eras in just five years — and Era 3 has split into two parallel branches. Early models bridged a frozen vision encoder to a frozen language model with a learnable connector (CLIP, BLIP, Flamingo). The 2023–2025 generation made a pretrained LLM the trunk and treated vision as a bolt-on adapter (LLaVA, Qwen2.5-VL, GPT-4V). The latest 2025–2026 generation drops the bridge entirely and trains a single transformer from scratch on mixed-modality data — but it forks along the output axis:
- Era 3a — Native Multimodal Input → Text Out. Image, video, and (sometimes) audio enter a single early-fused token stream, but generation is still autoregressive text. This is the design used by today's general-purpose flagships: Qwen3.5 / Qwen3.6, Gemma 4, Gemini 3, GPT-5.4, Phi-4-Reasoning-Vision, Claude Opus 4.6.
- Era 3b — Omni-Modal Unified I/O. The same fused trunk plus dedicated image decoder (VAE / MMDiT / flow-matching) and/or audio codec decoder heads, so the model can also generate images and speech. This is the design used by unified models: BAGEL, Qwen3.5-Omni, InternVL-U, Emu3 / Emu3.5, Erin 5.0, DeepSeek-Janus-Pro.
Reading the diagram (left → right). Era 1 uses a two-tower design with a learnable cross-modal bridge (e.g. Q-Former) into a frozen LM — text-only output. Era 2 puts a pretrained LLM at the center; an MLP/Resampler projects visual tokens into the LLM's vocabulary, and the LLM does all the reasoning — still text-only output. Era 3a drops the bridge: image, video, audio, and text share a single tokenizer/embedding space and flow through one transformer trained from scratch — but the output is still autoregressive text. Era 3b keeps that fused trunk and adds decoder heads (image VAE/MMDiT, audio codec) so the model can natively output text, image, and/or speech. Era 3a and Era 3b coexist; the choice is essentially "how much do you want non-text generation?"
Below we compile awesome papers and model and github repositories that
- State-of-the-Art VLMs Collection of newest to oldest VLMs (we'll keep updating new models and benchmarks).
- Evaluate VLM benchmarks and corresponding link to the works
- Post-training/Alignment Newest related work for VLM alignment including RL, sft.
- Applications applications of VLMs in embodied AI, robotics, etc.
- Contribute surveys, perspectives, and datasets on the above topics.
Progressive research reports
We track new VLMs, benchmarks, and post-training methods that haven't yet been folded into the main tables in dated mini-surveys:
- 📰
2026-06-02 — latest: Mamoda2.5 (AR-Diffusion DiT-MoE, 95.9× faster editing), VLM3 (native 3D learners), AlphaGRPO (RL for unified-model generation), Stage-wise Preference Optimization, FastOCR / WindowQuant (KV-cache efficiency), Fast-dDrive / CLOVER / CoWorld-VLA (driving VLA), Lost in Fog (reasoning-as-safety-signal), LiteGUI (SFT-free GUI agents), Health-Conditioned VLA, POLAR, TOC-Bench / VGenST-Bench (video), HalluCXR (medical) — 16 new entries since May 16.
- 📰
2026-05-16 — LensVLM (Apple), Nemotron 3 Nano Omni (NVIDIA), LLaDA2.0-Uni, PLaMo 2.1-VL, S-GRPO / Faithful GRPO / GRPO-TTA / OpenSearch-VL, MindVLA-U1 (surpasses human driving), VLADriver-RAG, Green-VLA, Anticipation-VLA, VLA Foundry, LAMO, ScreenExplorer, VideoZeroBench, Video-Oasis, MedThinkVQA, data curation at 87× less compute — 34 new entries since April 28.
- 📰
2026-04-28 — Qwen3.6-27B & Qwen3.6-35B-A3B, Claude Mythos (gated), S1-VL, GLM-5V-Turbo, FreshPER / GMPO / ARPO / GRPO-VPS, QUOTA, Fast-dVLM, VLA-World, SpanVLA, VLA-Forget, R-VLM, UILoop, WebForge, WorldMark, Video-MME-v2, CrossMath, BabyVision, SlowBA — 30 new entries since April 13.
- 📰
2026-04-13 — LFM2.5-VL-450M, EXAONE 4.5, Gemma 4, Granite 4.0 3B Vision, InternVL-U, GLM-4.6V, Vero, MolmoWeb, UniDriveVLA, QAPruner, Firebolt-VL, CoME-VL, and more.
- 📰
2026-03-25 — GPT-5.4, Phi-4-Reasoning-Vision-15B, Gemini 3.0, Qwen3.5, Claude Opus 4.6, Molmo2, and more.
Welcome to contribute and discuss!
🤩 Papers marked with a ⭐️ are contributed by the maintainers of this repository. If you find them useful, we would greatly appreciate it if you could give the repository a star or cite our paper.
@InProceedings{Li_2025_CVPR,
author = {Li, Zongxia and Wu, Xiyang and Du, Hongyang and Liu, Fuxiao and Nghiem, Huy and Shi, Guangyao},
title = {A Survey of State of the Art Large Vision Language Models: Benchmark Evaluations and Challenges},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2025},
pages = {1587-1606}
}
| Model |
Year |
Architecture |
Training Data |
Parameters |
Vision Encoder/Tokenizer |
Pretrained Backbone Model |
| LensVLM (Apple / Duke) |
05/07/2026 |
Decoder-only |
Rendered text-as-image + selective context expansion |
9B |
Rendered-image ViT |
Qwen3.5-9B-Base |
| Nemotron 3 Nano Omni (NVIDIA) |
04/28/2026 |
Hybrid MoE (omni-modal: vision + audio + text) |
Vision + audio + text joint training |
30B total · 3B active |
Dynamic-res ViT + Conv3D temporal |
Nemotron 3 |
| LLaDA2.0-Uni (Huawei / InclusionAI) |
04/28/2026 |
Discrete diffusion LLM (dLLM) + MoE |
Multimodal understanding + generation |
MoE-based |
SigLIP-VQ semantic tokenizer |
LLaDA2.0 |
| PLaMo 2.1-VL (Preferred Networks) |
04/21/2026 |
Decoder-only |
Japanese-focused multimodal |
2B / 8B |
ViT |
PLaMo 2.1 |
| Qwen3.6-27B (Alibaba) |
04/22/2026 |
Decoder-only / natively multimodal input (thinking + non-thinking) |
Multimodal pretraining + agentic mid-training |
27B dense |
Native multimodal ViT |
Qwen3.6 |
| Qwen3.6-35B-A3B (Alibaba) |
04/15/2026 |
MoE / natively multimodal input |
Multimodal pretraining + agentic SFT/RL |
35B total · 3B active |
Native multimodal ViT |
Qwen3.6 |
| LFM2.5-VL-450M (Liquid AI) |
04/11/2026 |
Liquid Foundation Model |
Undisclosed |
450M |
Non-overlapping tile ViT |
LFM2.5 |
| EXAONE 4.5 (LG AI Research) |
04/09/2026 |
Unified VL |
Undisclosed |
33B |
Proprietary vision encoder |
EXAONE 4.5 |
| Claude Mythos (Anthropic, gated preview) |
04/07/2026 |
Decoder-only (frontier; Project Glasswing gated) |
Undisclosed |
Undisclosed |
Undisclosed |
Undisclosed |
| Gemma 4 (Google) |
04/02/2026 |
Decoder-only / MoE |
Undisclosed (140+ languages) |
E2B / E4B / 26B MoE / 31B Dense |
Native multimodal |
Gemini 3 |
| Granite 4.0 3B Vision (IBM) |
04/01/2026 |
Decoder-only |
Enterprise document corpora |
3B |
Undisclosed |
Granite 4.0 |
| GLM-5V-Turbo (Zhipu / Z.AI) |
04/01/2026 |
Natively multimodal (vision-coding) with Multi-Token Prediction |
30+ task joint RL |
Undisclosed |
CogViT |
GLM-5 |
| InternVL-U (Shanghai AI Lab) |
03/10/2026 |
Unified (MLLM + MMDiT) |
Multimodal understanding + generation |
4B |
InternViT |
InternVL |
| GPT-5.4 / GPT-5.4 Thinking (OpenAI) |
03/06/2026 |
Decoder-only |
Undisclosed |
Undisclosed |
Undisclosed |
Undisclosed |
| Phi-4-Reasoning-Vision-15B (Microsoft) |
03/04/2026 |
Decoder-only |
Curated synthetic + filtered data |
15B |
High-res dynamic-resolution ViT |
Phi-4 |
| Gemini 3.0 (Google) |
03/2026 |
Unified Model |
Undisclosed |
Undisclosed |
Undisclosed |
Undisclosed |
| Qwen3.5 (Alibaba) |
02/16/2026 |
Unified VL (early fusion) |
Trillions of multimodal tokens |
0.8B–397B (MoE, 17B active) |
ViT (native) |
Qwen3.5 |
| Claude Opus 4.6 (Anthropic) |
02/2026 |
Decoder-only |
Undisclosed |
Undisclosed |
Undisclosed |
Undisclosed |
| Erin 5.0 (Baidu) |
02/05/2026 |
Unified Model (Visual, Text, Audio) |
Unified Modality Dataset |
- |
CNN–ViT (Understanding)/Next-Frame-and-Scale Prediction (Generation) |
Unified Autoregressive Transformer |
| Molmo2 (Allen AI) |
01/15/2026 |
Decoder-only |
7 new video + 2 multi-image datasets (9.19M videos) |
4B / 7B / 8B |
Bi-directional attention ViT |
Qwen 3 / OLMo |
| Gemini 3 |
11/18/2025 |
Unified Model |
Undisclosed |
- |
- |
- |
| Emu3.5 |
10/30/2025 |
Deconder-only |
Unified Modality Dataset |
- |
SigLIP |
Qwen3 |
| DeepSeek-OCR |
10/20/2025 |
Encoder-Deconder |
70% OCR, 20% general vision, 10% text-only |
3B |
DeepEncoder |
DeepSeek-3B |
| Qwen3-VL |
10/11/2025 |
Decoder-Only |
- |
8B/4B |
ViT |
Qwen3 |
| Qwen3-VL-MoE |
09/25/2025 |
Decoder-Only |
- |
235B-A22B |
ViT |
Qwen3 |
| Qwen3-Omni (Visual/Audio/Text) |
09/21/2025 |
- |
Video/Audio/Image |
30B |
ViT |
Qwen3-Omni-MoE-Thinker |
| LLaVA-Onevision-1.5 |
09/15/2025 |
- |
Mid-Training-85M & SFT |
8B |
Qwen2VLImageProcessor |
Qwen3 |
| InternVL3.5 |
08/25/2025 |
Decoder-Only |
multimodal & text-only |
30B/38B/241B |
InternViT-300M/6B |
Qwen3 / GPT-OSS |
| SkyWork-Unipic-1.5B |
07/29/2025 |
- |
image/video.. |
- |
- |
- |
| Grok 4 |
07/09/2025 |
- |
image/video.. |
1-2 Trillion |
- |
- |
| Kwai Keye-VL (Kuaishou) |
07/02/2025 |
Decdoer-only |
image/video.. |
8B |
ViT |
QWen-3-8B |
| OmniGen2 |
06/23/2025 |
Decdoer-only & VAE |
LLaVA-OneVision/ SAM-LLaVA.. |
- |
ViT |
QWen-2.5-VL |
| Gemini-2.5-Pro |
06/17/2025 |
- |
- |
- |
- |
- |
| GPT-o3/o4-mini |
06/10/2025 |
Decoder-only |
Undisclosed |
Undisclosed |
Undisclosed |
Undisclosed |
| Mimo-VL (Xiaomi) |
06/04/2025 |
Decdoer-only |
24 Trillion MLLM tokens |
7B |
[Qwen2.5-ViT |
Mimo-7B-base |
| BAGEL (Bytedance) |
05/20/2025 |
Unified Model |
Video/Image/Text |
7B |
SigLIP2-so400m/14](https://arxiv.org/abs/2502.14786) |
Qwen2.5 |
| BLIP3-o |
05/14/2025 |
Decdoer-only |
(BLIP3-o 60K) GPT-4o Generated Image Generation Data |
4/8B |
ViT |
QWen-2.5-VL |
| InternVL-3 |
04/14/2025 |
Decdoer-only |
200 Billion Tokens |
1/2/8/9/14/38/78B |
ViT-300M/6B |
InterLM2.5/QWen2.5 |
| LLaMA4-Scout/Maverick |
04/04/2025 |
Decdoer-only |
40/20 Trillion Tokens |
17B |
MetaClip |
LLaMA4 |
| Qwen2.5-Omni |
03/26/2025 |
Decdoer-only |
Video/Audio/Image/Text |
7B |
Qwen2-Audio/Qwen2.5-VL ViT |
End-to-End Mini-Omni |
| QWen2.5-VL |
01/28/2025 |
Decdoer-only |
Image caption, VQA, grounding agent, long video |
3B/7B/72B |
Redesigned ViT |
Qwen2.5 |
| GLM-4.6V (Zhipu / Z.AI) |
12/2025 |
Decoder-only |
Undisclosed |
106B / 9B (Flash) |
Undisclosed |
GLM-4.6 |
| Ola |
2025 |
Decoder-only |
Image/Video/Audio/Text |
7B |
OryxViT |
Qwen-2.5-7B, SigLIP-400M, Whisper-V3-Large, BEATs-AS2M(cpt2) |
| Ocean-OCR |
2025 |
Decdoer-only |
Pure Text, Caption, Interleaved, OCR |
3B |
NaViT |
Pretrained from scratch |
| SmolVLM |
2025 |
Decoder-only |
SmolVLM-Instruct |
250M & 500M |
SigLIP |
SmolLM |
| DeepSeek-Janus-Pro |
2025 |
Decoder-only |
Undisclosed |
7B |
SigLIP |
DeepSeek-Janus-Pro |
| Inst-IT |
2024 |
Decoder-only |
Inst-IT Dataset, LLaVA-NeXT-Data |
7B |
CLIP/Vicuna, SigLIP/Qwen2 |
LLaVA-NeXT |
| DeepSeek-VL2 |
2024 |
Decoder-only |
WiT, WikiHow |
4.5B x 74 |
SigLIP/SAMB |
DeepSeekMoE |
| xGen-MM (BLIP-3) |
2024 |
Decoder-only |
MINT-1T, OBELICS, Caption |
4B |
ViT + Perceiver Resampler |
Phi-3-mini |
| TransFusion |
2024 |
Encoder-decoder |
Undisclosed |
7B |
VAE Encoder |
Pretrained from scratch on transformer architecture |
| Baichuan Ocean Mini |
2024 |
Decoder-only |
Image/Video/Audio/Text |
7B |
CLIP ViT-L/14 |
Baichuan |
| LLaMA 3.2-vision |
2024 |
Decoder-only |
Undisclosed |
11B-90B |
CLIP |
LLaMA-3.1 |
| Pixtral |
2024 |
Decoder-only |
Undisclosed |
12B |
CLIP ViT-L/14 |
Mistral Large 2 |
| Qwen2-VL |
2024 |
Decoder-only |
Undisclosed |
7B-14B |
EVA-CLIP ViT-L |
Qwen-2 |
| NVLM |
2024 |
Encoder-decoder |
LAION-115M |
8B-24B |
Custom ViT |
Qwen-2-Instruct |
| Emu3 |
2024 |
Decoder-only |
Aquila |
7B |
MoVQGAN |
LLaMA-2 |
| Claude 3 |
2024 |
Decoder-only |
Undisclosed |
Undisclosed |
Undisclosed |
Undisclosed |
| InternVL |
2023 |
Encoder-decoder |
LAION-en, LAION- multi |
7B/20B |
Eva CLIP ViT-g |
QLLaMA |
| InstructBLIP |
2023 |
Encoder-decoder |
CoCo, VQAv2 |
13B |
ViT |
Flan-T5, Vicuna |
| CogVLM |
2023 |
Encoder-decoder |
LAION-2B ,COYO-700M |
18B |
CLIP ViT-L/14 |
Vicuna |
| PaLM-E |
2023 |
Decoder-only |
All robots, WebLI |
562B |
ViT |
PaLM |
| LLaVA-1.5 |
2023 |
Decoder-only |
COCO |
13B |
CLIP ViT-L/14 |
Vicuna |
| Gemini |
2023 |
Decoder-only |
Undisclosed |
Undisclosed |
Undisclosed |
Undisclosed |
| GPT-4V |
2023 |
Decoder-only |
Undisclosed |
Undisclosed |
Undisclosed |
Undisclosed |
| BLIP-2 |
2023 |
Encoder-decoder |
COCO, Visual Genome |
7B-13B |
ViT-g |
Open Pretrained Transformer (OPT) |
| Flamingo |
2022 |
Decoder-only |
M3W, ALIGN |
80B |
Custom |
Chinchilla |
| BLIP |
2022 |
Encoder-decoder |
COCO, Visual Genome |
223M-400M |
ViT-B/L/g |
Pretrained from scratch |
| CLIP |
2021 |
Encoder-decoder |
400M image-text pairs |
63M-355M |
ViT/ResNet |
Pretrained from scratch |
2. 🗂️ Benchmarks and Evaluation
2.1. Datasets for Training VLMs
| Dataset |
Task |
Size |
| 20/20 Vision — Data Curation (DatologyAI)(05/2026) |
Data Curation for VLM Training |
+11.7pp avg improvement; 17–87× less compute |
| MolmoWebMix (Allen AI)(04/2026) |
Web Agent Training Trajectories |
100K+ synthetic + 30K human demos |
| Vero-600K(04/2026) |
Broad Visual Reasoning RL Training |
600K samples from 59 datasets, 6 task categories |
| BigEarthNet.txt(03/2026) |
Multi-sensor Earth Observation Image-Text |
464K images, 9.6M text annotations |
| OmniScience(02/2026) |
Scientific Image Understanding |
1.5M figure-caption-context triplets |
| MaD-Mix(02/2026) |
Multi-modal Data Mixture Optimization |
Framework (0.5B–7B scale) |
| OVID(2026) |
Open Video Pre-training |
10M hours, 300M frame-caption pairs |
| Molmo2 Video Datasets(01/2026) |
Video Captions, QA, Tracking, Pointing |
9.19M videos (7 video + 2 multi-image datasets) |
| MMFineReason(/1/30/2026) |
REasoning |
1.8M |
| FineVision(09/04/2025) |
Mixed Domain |
24.3 M/4.48TB |
2.2. Datasets and Evaluation for VLM
🧮 Visual Math (+ Visual Math Reasoning)
| Dataset |
Task |
Eval Protocol |
Annotators |
Size (K) |
Code / Site |
| MathVision |
Visual Math |
MC / Answer Match |
Human |
3.04 |
Repo |
| MathVista |
Visual Math |
MC / Answer Match |
Human |
6 |
Repo |
| MathVerse |
Visual Math |
MC |
Human |
4.6 |
Repo |
| VisNumBench |
Visual Number Reasoning |
MC |
Python Program generated/Web Collection/Real life photos |
1.91 |
Repo |
💬 Benchmark for Unified Models
| Dataset |
Task |
Eval Protocol |
Annotators |
Size (K) |
Code / Site |
| ROVER |
Reciprocal Cross-Modal Reasoning |
Visual Gen + Verbal Gen Eval |
Human |
1.3 (1,876 images) |
Paper |
| RealUnify |
Math, World knowledge, Image Gen |
Direct & StepWise Eval (Sec 3.3) |
Script & Humanverification |
1.0 |
Repo |
| Uni-MMMU |
Science, Code, Image Gen |
DreamSim (Image Gen Eval) & String Matching (Understanding Eval) |
- |
1.0 |
Repo |
| Dataset |
Task |
Eval Protocol |
Annotators |
Size (K) |
Code / Site |
| VideoZeroBench |
Spatio-temporal Evidence Verification for Long-Video QA |
5-level progressive evidence tightening |
Human |
0.5 (500 questions, 13 domains) |
Paper |
| Video-Oasis |
Diagnostic Meta-benchmark for Video Understanding |
Measures % solvable without visual/temporal context |
Meta-analysis |
— |
Paper |
| LoVR |
Long Video Retrieval in Multimodal Contexts |
Retrieval accuracy |
— |
— |
Paper |
| MMOU |
Omni-modal Long Video Understanding |
MC |
Human |
15 (9,038 videos) |
Paper |
| Video-MMMU |
Knowledge Acquisition from Professional Videos |
MC + Knowledge Gain |
Expert |
0.9 (300 videos) |
Paper |
| MMVU |
Expert-Level Multi-Discipline Video Understanding |
MC |
Expert |
3 (27 subjects) |
Paper |
| VideoHallu |
Video Understanding |
LLM Eval |
Human |
3.2 |
Repo |
| Video SimpleQA |
Video Understanding |
LLM Eval |
Human |
2.03 |
Repo |
| MovieChat |
Video Understanding |
LLM Eval |
Human |
1 |
Repo |
| Perception‑Test |
Video Understanding |
MC |
Crowd |
11.6 |
Repo |
| VideoMME |
Video Understanding |
MC |
Experts |
2.7 |
Site |
| EgoSchem |
Video Understanding |
MC |
Synth / Human |
5 |
Site |
| Inst‑IT‑Bench |
Fine‑grained Image & Video |
MC & LLM |
Human / Synth |
2 |
Repo |
💬 Multimodal Conversation
| Dataset |
Task |
Eval Protocol |
Annotators |
Size (K) |
Code / Site |
| VisionArena |
Multimodal Conversation |
Pairwise Pref |
Human |
23 |
Repo |
🧠 Multimodal General Intelligence
| Dataset |
Task |
Eval Protocol |
Annotators |
Size (K) |
Code / Site |
| OmniEarth |
Geospatial / Remote Sensing VLM Eval |
MC + Open VQA |
Human (verified) |
44.2 (9,275 images, 28 tasks) |
Paper |
| MultiHaystack |
Multimodal Retrieval & Reasoning |
Retrieval + QA |
Human |
0.75 (46K+ candidates) |
Paper |
| DatBench |
Discriminative, Faithful VLM Eval |
MC (format-aware) |
Synth |
- |
Paper |
| MMLU |
General MM |
MC |
Human |
15.9 |
Repo |
| MMStar |
General MM |
MC |
Human |
1.5 |
Site |
| NaturalBench |
General MM |
Yes/No, MC |
Human |
10 |
HF |
| PHYSBENCH |
Visual Math Reasoning |
MC |
Grad STEM |
0.10 |
Repo |
🔎 Visual Reasoning / VQA (+ Multilingual & OCR)
| Dataset |
Task |
Eval Protocol |
Annotators |
Size (K) |
Code / Site |
| EMMA |
Visual Reasoning |
MC |
Human + Synth |
2.8 |
Repo |
| MMTBENCH |
Visual Reasoning & QA |
MC |
AI Experts |
30.1 |
Repo |
| MM‑Vet |
OCR / Visual Reasoning |
LLM Eval |
Human |
0.2 |
Repo |
| MM‑En/CN |
Multilingual MM Understanding |
MC |
Human |
3.2 |
Repo |
| GQA |
Visual Reasoning & QA |
Answer Match |
Seed + Synth |
22 |
Site |
| VCR |
Visual Reasoning & QA |
MC |
MTurks |
290 |
Site |
| VQAv2 |
Visual Reasoning & QA |
Yes/No, Ans Match |
MTurks |
1100 |
Repo |
| MMMU |
Visual Reasoning & QA |
Ans Match, MC |
College |
11.5 |
Site |
| MMMU-Pro |
Visual Reasoning & QA |
Ans Match, MC |
College |
5.19 |
Site |
| R1‑Onevision |
Visual Reasoning & QA |
MC |
Human |
155 |
Repo |
| VLM²‑Bench |
Visual Reasoning & QA |
Ans Match, MC |
Human |
3 |
Site |
| VisualWebInstruct |
Visual Reasoning & QA |
LLM Eval |
Web |
0.9 |
Site |
📝 Visual Text / Document Understanding (+ Charts)
| Dataset |
Task |
Eval Protocol |
Annotators |
Size (K) |
Code / Site |
| TableVision |
Spatially Grounded Table Reasoning |
3-level Cognitive Eval |
Human |
6.8 (13 sub-categories) |
Paper |
| TextVQA |
Visual Text Understanding |
Ans Match |
Expert |
28.6 |
Repo |
| DocVQA |
Document VQA |
Ans Match |
Crowd |
50 |
Site |
| ChartQA |
Chart Graphic Understanding |
Ans Match |
Crowd / Synth |
32.7 |
Repo |
🌄 Text‑to‑Image Generation
| Dataset |
Task |
Eval Protocol |
Annotators |
Size (K) |
Code / Site |
| MSCOCO‑30K |
Text‑to‑Image |
BLEU, ROUGE, Sim |
MTurks |
30 |
Site |
| GenAI‑Bench |
Text‑to‑Image |
Human Rating |
Human |
80 |
HF |
🚨 Hallucination Detection / Control
2.3. Benchmark Datasets, Simulators, and Generative Models for Embodied VLM
| Benchmark |
Domain |
Type |
Project |
| Drive-Bench |
Embodied AI |
Autonomous Driving |
Website |
| Habitat, Habitat 2.0, Habitat 3.0 |
Robotics (Navigation) |
Simulator + Dataset |
Website |
| Gibson |
Robotics (Navigation) |
Simulator + Dataset |
Website, Github Repo |
| iGibson1.0, iGibson2.0 |
Robotics (Navigation) |
Simulator + Dataset |
Website, Document |
| Isaac Gym |
Robotics (Navigation) |
Simulator |
Website, Github Repo |
| Isaac Lab |
Robotics (Navigation) |
Simulator |
Website, Github Repo |
| AI2THOR |
Robotics (Navigation) |
Simulator |
Website, Github Repo |
| ProcTHOR |
Robotics (Navigation) |
Simulator + Dataset |
Website, Github Repo |
| VirtualHome |
Robotics (Navigation) |
Simulator |
Website, Github Repo |
| ThreeDWorld |
Robotics (Navigation) |
Simulator |
Website, Github Repo |
| VIMA-Bench |
Robotics (Manipulation) |
Simulator |
Website, Github Repo |
| VLMbench |
Robotics (Manipulation) |
Simulator |
Github Repo |
| CALVIN |
Robotics (Manipulation) |
Simulator |
Website, Github Repo |
| GemBench |
Robotics (Manipulation) |
Simulator |
Website, Github Repo |
| WebArena |
Web Agent |
Simulator |
Website, Github Repo |
| UniSim |
Robotics (Manipulation) |
Generative Model, World Model |
Website |
| GAIA-1 |
Robotics (Automonous Driving) |
Generative Model, World Model |
Website |
| LWM |
Embodied AI |
Generative Model, World Model |
Website, Github Repo |
| Genesis |
Embodied AI |
Generative Model, World Model |
Github Repo |
| EMMOE |
Embodied AI |
Generative Model, World Model |
Paper |
| RoboGen |
Embodied AI |
Generative Model, World Model |
Website |
| UnrealZoo |
Embodied AI (Tracking, Navigation, Multi Agent) |
Simulator |
Website |
3.1. RL Alignment for VLM
| Title |
Year |
Paper |
RL |
Code |
| OpenSearch-VL: Multi-Turn Fatal-Aware GRPO for Multimodal Search Agents |
05/07/2026 |
Paper |
Fatal-aware GRPO; handles tool-call failures in agentic multi-turn RL |
- |
| GRPO-TTA: Test-Time Visual Tuning via GRPO-Driven RL |
05/05/2026 |
Paper |
GRPO for test-time visual encoder tuning; no ground-truth labels needed |
- |
| S-GRPO: Unified Post-Training for Large VLMs |
04/2026 |
Paper |
Supervised GRPO; injects ground-truth trajectories to solve cold-start |
- |
| Faithful GRPO (FGRPO): Constrained Policy Optimization for Visual Spatial Reasoning |
04/2026 |
Paper |
Lagrangian-constrained GRPO; inconsistency 24.5% → 1.7% |
- |
| Vero: An Open RL Recipe for General Visual Reasoning |
04/2026 |
Paper |
Task-routed rewards; GRPO-based |
Code |
| wDPO: Winsorized Direct Preference Optimization for Robust Alignment |
03/2026 |
Paper |
wDPO |
- |
| f-GRPO and Beyond: Divergence-Based RL for General LLM Alignment |
02/2026 |
Paper |
f-GRPO / f-HAL |
- |
| From Sight to Insight: Improving Visual Reasoning of MLLMs via Reinforcement Learning |
01/2026 |
Paper |
GRPO (6 reward functions) |
- |
| SaFeR-VLM: Safety-Aware Reinforcement Learning for Multimodal Reasoning |
2026 (ICLR) |
Paper |
GRPO + safety reward |
- |
| SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning |
11/2025 |
Paper |
Dual-Reward (Thinking + Judging) |
- |
| GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA |
10/2025 |
Paper |
GIFT (convex MSE loss) |
- |
| Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning |
10/12/2025 |
Paper |
GRPO |
- |
| Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play |
09/29/2025 |
Paper |
GRPO |
- |
| Vision-SR1: Self-rewarding vision-language model via reasoning decomposition |
08/26/2025 |
Paper |
GRPO |
- |
| Group Sequence Policy Optimization |
06/24/2025 |
Paper |
GSPO |
- |
| Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning |
05/20/2025 |
Paper |
GRPO |
- |
| VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning |
2025/04/10 |
Paper |
GRPO |
Code |
| OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement |
2025/03/21 |
Paper |
GRPO |
Code |
| Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning |
2025/03/10 |
Paper |
GRPO |
Code |
| OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference |
2025 |
Paper |
DPO |
Code |
| Multimodal Open R1/R1-Multimodal-Journey |
2025 |
- |
GRPO |
Code |
| R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization |
2025 |
Paper |
GRPO |
Code |
| Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning |
2025 |
- |
PPO/REINFORCE++/GRPO |
Code |
| MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning |
2025 |
Paper |
REINFORCE Leave-One-Out (RLOO) |
Code |
| MM-RLHF: The Next Step Forward in Multimodal LLM Alignment |
2025 |
Paper |
DPO |
Code |
| LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL |
2025 |
Paper |
PPO |
Code |
| Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models |
2025 |
Paper |
GRPO |
Code |
| Unified Reward Model for Multimodal Understanding and Generation |
2025 |
Paper |
DPO |
Code |
| Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step |
2025 |
Paper |
DPO |
Code |
| All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning |
2025 |
Paper |
Online RL |
- |
| Video-R1: Reinforcing Video Reasoning in MLLMs |
2025 |
Paper |
GRPO |
Code |
| Title |
Year |
Paper |
Website |
Code |
| Why Does RL Generalize Better Than SFT? A Data-Centric Perspective (DC-SFT) |
02/2026 |
Paper |
- |
- |
| The Synergy Dilemma of Long-CoT SFT and RL |
2026 (TMLR) |
Paper |
- |
- |
| Layer-wise Analysis of Supervised Fine-Tuning |
04/2026 |
Paper |
- |
- |
| AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of VLMs |
2026/03 |
Paper |
- |
- |
| CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models |
2026/03 |
Paper |
- |
- |
| MERGETUNE: Continued Fine-Tuning of Vision-Language Models |
2026/01 (ICLR 2026) |
Paper |
- |
- |
| Mask Fine-Tuning (MFT): Unlocking Hidden Capabilities in Vision-Language Models |
2025/12 |
Paper |
- |
- |
| Image-LoRA: Towards Minimal Fine-Tuning of VLMs |
2025/12 |
Paper |
- |
- |
| Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning |
2025/12 |
Paper |
- |
- |
| Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models |
2025/04/21 |
Paper |
Website |
Code |
| OMNICAPTIONER: One Captioner to Rule Them All |
2025/04/09 |
Paper |
Website |
Code |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning |
2024 |
Paper |
Website |
Code |
| LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression |
2024 |
Paper |
Website |
Code |
| ViTamin: Designing Scalable Vision Models in the Vision-Language Era |
2024 |
Paper |
Website |
Code |
| Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model |
2024 |
Paper |
- |
- |
| Should VLMs be Pre-trained with Image Data? |
2025 |
Paper |
- |
- |
| VisionArena: 230K Real World User-VLM Conversations with Preference Labels |
2024 |
Paper |
- |
Code |
3.3. VLM Alignment github
| Title |
Year |
Paper |
Website |
Code |
| EvoPrompt: Evolving Prompt Adaptation for Vision-Language Models |
2026/03 |
Paper |
- |
- |
| MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation |
2026/02 |
Paper |
- |
- |
| Multimodal Prompt Optimizer (MPO): Joint Optimization of Multimodal Prompts |
2025/10 |
Paper |
- |
- |
| Evolutionary Prompt Optimization Discovers Emergent Multimodal Reasoning Strategies |
2025/03 |
Paper |
- |
- |
| In-ContextEdit:EnablingInstructionalImageEditingwithIn-Context GenerationinLargeScaleDiffusionTransformer |
2025/04/30 |
Paper |
Website |
Code |
| Title |
Year |
Paper Link |
| Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI |
2024 |
Paper |
| ScreenAI: A Vision-Language Model for UI and Infographics Understanding |
2024 |
Paper |
| ChartLlama: A Multimodal LLM for Chart Understanding and Generation |
2023 |
Paper |
| SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement |
2024 |
📄 Paper |
| Training a Vision Language Model as Smartphone Assistant |
2024 |
Paper |
| ScreenAgent: A Vision-Language Model-Driven Computer Control Agent |
2024 |
Paper |
| Embodied Vision-Language Programmer from Environmental Feedback |
2024 |
Paper |
| VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method |
2025 |
📄 Paper |
| MP-GUI: Modality Perception with MLLMs for GUI Understanding |
2025 |
📄 Paper |
4.2. Generative Visual Media Applications
| Title |
Year |
Paper |
Website |
Code |
| GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning |
2023 |
📄 Paper |
🌍 Website |
💾 Code |
| Spurious Correlation in Multimodal LLMs |
2025 |
📄 Paper |
- |
- |
| WeGen: A Unified Model for Interactive Multimodal Generation as We Chat |
2025 |
📄 Paper |
- |
💾 Code |
| VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
4.3. Robotics and Embodied AI
| Title |
Year |
Paper |
Website |
Code |
| AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation |
2024 |
📄 Paper |
🌍 Website |
- |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities |
2024 |
📄 Paper |
🌍 Website |
- |
| Vision-language model-driven scene understanding and robotic object manipulation |
2024 |
📄 Paper |
- |
- |
| Guiding Long-Horizon Task and Motion Planning with Vision Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers |
2023 |
📄 Paper |
🌍 Website |
- |
| VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model |
2024 |
📄 Paper |
- |
- |
| Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems? |
2023 |
📄 Paper |
🌍 Website |
- |
| DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| MotionGPT: Human Motion as a Foreign Language |
2023 |
📄 Paper |
- |
💾 Code |
| Learning Reward for Robot Skills Using Large Language Models via Self-Alignment |
2024 |
📄 Paper |
- |
- |
| Language to Rewards for Robotic Skill Synthesis |
2023 |
📄 Paper |
🌍 Website |
- |
| Eureka: Human-Level Reward Design via Coding Large Language Models |
2023 |
📄 Paper |
🌍 Website |
- |
| Integrated Task and Motion Planning |
2020 |
📄 Paper |
- |
- |
| Jailbreaking LLM-Controlled Robots |
2024 |
📄 Paper |
🌍 Website |
- |
| Robots Enact Malignant Stereotypes |
2022 |
📄 Paper |
🌍 Website |
- |
| LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions |
2024 |
📄 Paper |
- |
- |
| Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics |
2024 |
📄 Paper |
🌍 Website |
- |
| EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents |
2025 |
📄 Paper |
🌍 Website |
💾 Code & Dataset |
| Gemini Robotics: Bringing AI into the Physical World |
2025 |
📄 Technical Report |
🌍 Website |
- |
| GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation |
2024 |
📄 Paper |
🌍 Website |
- |
| Magma: A Foundation Model for Multimodal AI Agents |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| DayDreamer: World Models for Physical Robot Learning |
2022 |
📄 Paper |
🌍 Website |
💾 Code |
| Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models |
2025 |
📄 Paper |
- |
- |
| RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| Unified Video Action Model |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| Anticipation-VLA: Long-Horizon Embodied Tasks via Anticipation-Based Subgoal Generation |
05/02/2026 |
📄 Paper |
- |
- |
| Green-VLA: Staged VLA for Generalist Robots |
05/2026 |
📄 Paper |
🌍 Website |
- |
| VLA Foundry: Unified Framework for Training VLAs |
04/2026 |
📄 Paper |
- |
- |
| OmniVLA-RL: Spatial Understanding + Online RL for VLA |
04/2026 |
📄 Paper |
- |
- |
| DAM-VLA: A Dynamic Action Model-Based Vision-Language-Action Framework for Robot Manipulation |
03/2026 |
📄 Paper |
- |
- |
| NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models |
03/2026 |
📄 Paper |
- |
- |
| Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control |
02/2026 |
📄 Paper |
- |
- |
| ST4VLA: Spatial Guided Training for Vision-Language-Action Models |
02/2026 |
📄 Paper |
- |
- |
| Title |
Year |
Paper |
Website |
Code |
| VIMA: General Robot Manipulation with Multimodal Prompts |
2022 |
📄 Paper |
🌍 Website |
|
| Instruct2Act: Mapping Multi-Modality Instructions to Robotic Actions with Large Language Model |
2023 |
📄 Paper |
- |
- |
| Creative Robot Tool Use with Large Language Models |
2023 |
📄 Paper |
🌍 Website |
- |
| RoboVQA: Multimodal Long-Horizon Reasoning for Robotics |
2024 |
📄 Paper |
- |
- |
| RT-1: Robotics Transformer for Real-World Control at Scale |
2022 |
📄 Paper |
🌍 Website |
- |
| RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control |
2023 |
📄 Paper |
🌍 Website |
- |
| Open X-Embodiment: Robotic Learning Datasets and RT-X Models |
2023 |
📄 Paper |
🌍 Website |
- |
| ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| Masked World Models for Visual Control |
2022 |
📄 Paper |
🌍 Website |
💾 Code |
| Multi-View Masked World Models for Visual Robotic Manipulation |
2023 |
📄 Paper |
🌍 Website |
💾 Code |
| Title |
Year |
Paper |
Website |
Code |
| ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings |
2022 |
📄 Paper |
- |
- |
| LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation |
2024 |
📄 Paper |
- |
- |
| LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action |
2022 |
📄 Paper |
🌍 Website |
- |
| NaVILA: Legged Robot Vision-Language-Action Model for Navigation |
2022 |
📄 Paper |
🌍 Website |
- |
| VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation |
2024 |
📄 Paper |
- |
- |
| Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning |
2023 |
📄 Paper |
🌍 Website |
- |
| Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments |
2025 |
📄 Paper |
- |
- |
| Navigation World Models |
2024 |
📄 Paper |
🌍 Website |
- |
4.3.3. Human-robot Interaction
| Title |
Year |
Paper |
Website |
Code |
| MUTEX: Learning Unified Policies from Multimodal Task Specifications |
2023 |
📄 Paper |
🌍 Website |
- |
| LaMI: Large Language Models for Multi-Modal Human-Robot Interaction |
2024 |
📄 Paper |
🌍 Website |
- |
| VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models |
2024 |
📄 Paper |
- |
- |
4.3.4. Autonomous Driving
| Title |
Year |
Paper |
Website |
Code |
| MindVLA-U1: Unified Streaming VLA for Autonomous Driving (surpasses human-level driving) |
05/12/2026 |
📄 Paper |
- |
- |
| VLADriver-RAG: Retrieval-Augmented VLA for Autonomous Driving |
05/08/2026 |
📄 Paper |
- |
- |
| OneDrive: Unified Heterogeneous Decoding for Driving VLMs |
04/2026 |
📄 Paper |
- |
- |
| UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving |
04/2026 |
📄 Paper |
- |
- |
| AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving |
03/2026 |
📄 Paper |
- |
- |
| DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe Autonomous Driving |
03/2026 |
📄 Paper |
- |
- |
| HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving |
02/2026 |
📄 Paper |
- |
- |
| OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model |
03/2025 |
📄 Paper |
- |
- |
| Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives |
01/07/2025 |
📄 Paper |
🌍 Website |
- |
| DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| GPT-Driver: Learning to Drive with GPT |
2023 |
📄 Paper |
- |
- |
| LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving |
2023 |
📄 Paper |
🌍 Website |
- |
| Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving |
2023 |
📄 Paper |
- |
- |
| Referring Multi-Object Tracking |
2023 |
📄 Paper |
- |
💾 Code |
| VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision |
2023 |
📄 Paper |
- |
💾 Code |
| MotionLM: Multi-Agent Motion Forecasting as Language Modeling |
2023 |
📄 Paper |
- |
- |
| DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models |
2023 |
📄 Paper |
🌍 Website |
- |
| VLP: Vision Language Planning for Autonomous Driving |
2024 |
📄 Paper |
- |
- |
| DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model |
2023 |
📄 Paper |
- |
- |
| Title |
Year |
Paper |
Website |
Code |
| DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis |
2024 |
📄 Paper |
- |
💾 Code |
| LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration – A Robot Sous-Chef Application |
2024 |
📄 Paper |
- |
- |
| Pretrained Language Models as Visual Planners for Human Assistance |
2023 |
📄 Paper |
- |
- |
| Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research |
2024 |
📄 Paper |
- |
- |
| Image and Data Mining in Reticular Chemistry Using GPT-4V |
2023 |
📄 Paper |
- |
- |
| Title |
Year |
Paper |
Website |
Code |
| A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis |
2023 |
📄 Paper |
- |
- |
| CogAgent: A Visual Language Model for GUI Agents |
2023 |
📄 Paper |
- |
💾 Code |
| WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models |
2024 |
📄 Paper |
- |
💾 Code |
| ShowUI: One Vision-Language-Action Model for GUI Visual Agent |
2024 |
📄 Paper |
- |
💾 Code |
| ScreenAgent: A Vision Language Model-driven Computer Control Agent |
2024 |
📄 Paper |
- |
💾 Code |
| Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation |
2024 |
📄 Paper |
- |
💾 Code |
| LAMO: Scalable Lightweight GUI Agents via Multi-Role Orchestration |
04/2026 |
📄 Paper |
- |
- |
| ScreenExplorer: Autonomous GUI Exploration via Curiosity-Driven VLM Agents |
2026 (ICLR) |
📄 Paper |
- |
- |
| InfiGUIAgent: Generalist GUI Agent with Native Reasoning and Reflection |
2026 (EACL) |
📄 Paper |
- |
- |
| MolmoWeb: Open Visual Web Agent and Open Data for the Open Web |
04/2026 |
📄 Paper |
🌍 Website |
💾 Code |
| Title |
Year |
Paper |
Website |
Code |
| X-World: Accessibility, Vision, and Autonomy Meet |
2021 |
📄 Paper |
- |
- |
| Context-Aware Image Descriptions for Web Accessibility |
2024 |
📄 Paper |
- |
- |
| Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models |
2024 |
📄 Paper |
- |
- |
| Title |
Year |
Paper |
Website |
Code |
| Medical Thinking with Multiple Images (MedThinkVQA) |
04/2026 |
📄 Paper |
- |
- |
| MedVRAG: Iterative Multimodal RAG for Medical QA |
04/2026 |
📄 Paper |
- |
- |
| GMAI-VL: General Medical AI Vision-Language Model |
2026 (AAAI) |
📄 Paper |
- |
- |
| CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework |
03/2026 |
📄 Paper |
- |
- |
| MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images |
02/2026 |
📄 Paper |
- |
- |
| Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning |
12/2025 |
📄 Paper |
- |
💾 Code |
| Frontiers in Intelligent Colonoscopy |
02/2025 |
📄 Paper |
- |
💾 Code |
| VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge |
2024 |
📄 Paper |
- |
💾 Code |
| Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology |
2024 |
📄 Paper |
- |
- |
| M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization |
2023 |
📄 Paper |
- |
- |
| MedCLIP: Contrastive Learning from Unpaired Medical Images and Text |
2022 |
📄 Paper |
- |
💾 Code |
| Med-Flamingo: A Multimodal Medical Few-Shot Learner |
2023 |
📄 Paper |
- |
💾 Code |
| Title |
Year |
Paper |
Website |
Code |
| Analyzing K-12 AI Education: A Large Language Model Study of Classroom Instruction on Learning Theories, Pedagogy, Tools, and AI Literacy |
2024 |
📄 Paper |
- |
- |
| Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-Like and Personalized Early Adolescence |
2024 |
📄 Paper |
- |
- |
| Harnessing Large Vision and Language Models in Agriculture: A Review |
2024 |
📄 Paper |
- |
- |
| A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping |
2024 |
📄 Paper |
- |
- |
| Vision-Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models |
2024 |
📄 Paper |
- |
💾 Code |
| DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images |
2024 |
📄 Paper |
- |
- |
| MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models |
2024 |
📄 Paper |
- |
💾 Code |
| Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps |
2024 |
📄 Paper |
- |
💾 Code |
| He is Very Intelligent, She is Very Beautiful? On Mitigating Social Biases in Language Modeling and Generation |
2021 |
📄 Paper |
- |
- |
| UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling |
2024 |
📄 Paper |
- |
- |
| Title |
Year |
Paper |
Website |
Code |
| MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs |
11/2025 |
📄 Paper |
🌍 ICML 2026 |
💾 Code |
| VL-Calibration: Decoupled Confidence Calibration for VLM Reasoning |
04/2026 |
📄 Paper |
- |
- |
| Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models |
04/2026 |
📄 Paper |
- |
- |
| VLMs Need Words: Vision Language Models Ignore Visual Detail in Favor of Semantic Anchors |
04/2026 |
📄 Paper |
- |
- |
| HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token |
03/2026 |
📄 Paper |
🌍 ACL |
- |
| Tone Matters: The Impact of Linguistic Tone on Hallucination in VLMs |
01/2026 |
📄 Paper |
- |
- |
| Object Hallucination in Image Captioning |
2018 |
📄 Paper |
- |
- |
| Evaluating Object Hallucination in Large Vision-Language Models |
2023 |
📄 Paper |
- |
💾 Code |
| Detecting and Preventing Hallucinations in Large Vision Language Models |
2023 |
📄 Paper |
- |
- |
| HallE-Control: Controlling Object Hallucination in Large Multimodal Models |
2023 |
📄 Paper |
- |
💾 Code |
| Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs |
2024 |
📄 Paper |
- |
💾 Code |
| BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models |
2023 |
📄 Paper |
- |
💾 Code |
| AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning |
2023 |
📄 Paper |
- |
💾 Code |
| Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models |
2024 |
📄 Paper |
- |
💾 Code |
| AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation |
2023 |
📄 Paper |
- |
💾 Code |
| Title |
Year |
Paper |
Website |
Code |
| SaFeR-VLM: Safety into Multimodal Reasoning via Reinforcement Learning |
2026 (ICLR) |
📄 Paper |
- |
- |
| HoliSafe: Holistic Safety Evaluation for Vision-Language Models |
2026 (ICLR) |
📄 Paper |
- |
- |
| JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments |
2023 |
📄 Paper |
- |
- |
| SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models |
2024 |
📄 Paper |
- |
- |
| JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks |
2024 |
📄 Paper |
- |
- |
| SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models |
2024 |
📄 Paper |
- |
💾 Code |
| Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models |
2024 |
📄 Paper |
- |
- |
| Jailbreaking Attack against Multimodal Large Language Model |
2024 |
📄 Paper |
- |
- |
| Embodied Red Teaming for Auditing Robotic Foundation Models |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| Safety Guardrails for LLM-Enabled Robots |
2025 |
📄 Paper |
- |
- |
| Title |
Year |
Paper |
Website |
Code |
| Hallucination of Multimodal Large Language Models: A Survey |
2024 |
📄 Paper |
- |
- |
| Bias and Fairness in Large Language Models: A Survey |
2023 |
📄 Paper |
- |
- |
| Fairness and Bias in Multimodal AI: A Survey |
2024 |
📄 Paper |
- |
- |
| Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision–Language Models |
2023 |
📄 Paper |
- |
- |
| FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks |
2024 |
📄 Paper |
- |
- |
| FairCLIP: Harnessing Fairness in Vision-Language Learning |
2024 |
📄 Paper |
- |
- |
| FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models |
2024 |
📄 Paper |
- |
- |
| Benchmarking Vision Language Models for Cultural Understanding |
2024 |
📄 Paper |
- |
- |
5.4.1 Multi-modality Alignment
| Title |
Year |
Paper |
Website |
Code |
| Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding |
2024 |
📄 Paper |
- |
- |
| Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement |
2024 |
📄 Paper |
- |
- |
| Assessing and Learning Alignment of Unimodal Vision and Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| Extending Multi-modal Contrastive Representations |
2023 |
📄 Paper |
- |
💾 Code |
| OneLLM: One Framework to Align All Modalities with Language |
2023 |
📄 Paper |
- |
💾 Code |
| What You See is What You Read? Improving Text-Image Alignment Evaluation |
2023 |
📄 Paper |
🌍 Website |
💾 Code |
| Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
5.4.2 Commonsense and Physics Alignment
| Title |
Year |
Paper |
Website |
Code |
| VBench: Comprehensive BenchmarkSuite for Video Generative Models |
2023 |
📄 Paper |
🌍 Website |
💾 Code |
| VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| PhysBench: Benchmarking and Enhancing VLMs for Physical World Understanding |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| VideoPhy: Evaluating Physical Commonsense for Video Generation |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| WorldSimBench: Towards Video Generation Models as World Simulators |
2024 |
📄 Paper |
🌍 Website |
- |
| WorldModelBench: Judging Video Generation Models As World Models |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation |
2025 |
📄 Paper |
- |
💾 Code |
| Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency |
2025 |
📄 Paper |
- |
💾 Code |
| Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding |
2025 |
📄 Paper |
- |
- |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| Do generative video models understand physical principles? |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| How Far is Video Generation from World Model: A Physical Law Perspective |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| Imagine while Reasoning in Space: Multimodal Visualization-of-Thought |
2025 |
📄 Paper |
- |
- |
| VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
5.5 Efficient Training and Fine-Tuning
| Title |
Year |
Paper |
Website |
Code |
| MODIX: Training-Free Multimodal Information-Driven Positional Index Scaling |
04/2026 |
📄 Paper |
- |
- |
| QAPruner: Quantization-Aware Vision Token Pruning for MLLMs |
04/2026 |
📄 Paper |
- |
- |
| Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation |
04/2026 |
📄 Paper |
- |
- |
| CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning |
04/2026 |
📄 Paper |
- |
- |
| LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules |
02/2026 |
📄 Paper |
- |
- |
| GRACE: Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs |
01/2026 |
📄 Paper |
- |
- |
| VLMQ: Post-Training Quantization for Large Vision-Language Models |
2026 (ICLR) |
📄 Paper |
- |
- |
| VILA: On Pre-training for Visual Language Models |
2023 |
📄 Paper |
- |
- |
| SimVLM: Simple Visual Language Model Pretraining with Weak Supervision |
2021 |
📄 Paper |
- |
- |
| LoRA: Low-Rank Adaptation of Large Language Models |
2021 |
📄 Paper |
- |
💾 Code |
| QLoRA: Efficient Finetuning of Quantized LLMs |
2023 |
📄 Paper |
- |
- |
| Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback |
2022 |
📄 Paper |
- |
💾 Code |
| RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback |
2023 |
📄 Paper |
- |
- |
5.6 Scarce of High-quality Dataset
| Title |
Year |
Paper |
Website |
Code |
| A Prescription for Better VLMs through Data Curation Alone (20/20 Vision) |
05/2026 |
📄 Paper |
- |
- |
| A Survey on Bridging VLMs and Synthetic Data |
2025 |
📄 Paper |
- |
💾 Code |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning |
2024 |
📄 Paper |
Website |
💾 Code |
| SLIP: Self-supervision meets Language-Image Pre-training |
2021 |
📄 Paper |
- |
💾 Code |
| Synthetic Vision: Training Vision-Language Models to Understand Physics |
2024 |
📄 Paper |
- |
- |
| Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings |
2024 |
📄 Paper |
- |
- |
| KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data |
2024 |
📄 Paper |
- |
- |
| Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation |
2024 |
📄 Paper |
- |
- |