Om AI Lab

Open Multimodal AGI Research

Building the foundational brains for the physical world.

🌌 About Us

At Om AI Lab, we believe the future of AI extends far beyond pure text. We are dedicated to building the "brains" for next-generation systems by focusing on the intersection of Spatial Intelligence, Visual Reasoning, and Embodied Agents.

Our research spans across open-vocabulary perception, reinforced vision-language models, and real-time inference. We aim to bridge the critical gap between high-level logical reasoning and fine-grained visual action—building models that don't just "see" the world, but intuitively understand and interact with it.

🚢 Flagship VLX Model Series

📹 VLX-Flow: A real-time VLM for streaming video understanding .
🔍 VLX-Seek: Fine-grained visual perception and grounding for physical AI.
🚗 VLX-Go: Efficient general-purpose embodied navigation in the wild.

🚀 Core Research Tracks

🧠 Reinforced & Advanced Visual Reasoning

Models that think, reason, and understand the visual world at a granular level.

🌟 VLM-R1: Solving Visual Understanding with Reinforced VLMs. (Highly active)
🔎 ZoomEye: Enhancing Multimodal LLMs with human-like zooming capabilities through tree-based image exploration.
🌍 ImageRAG: Enhancing ultrahigh-resolution remote sensing imagery analysis.

👁️ Real-Time Perception & Open-World Visual Detection

Foundational spatial understanding optimized for edge and on-premise speeds.

⚡ OmDet: Real-time, highly accurate, open-vocabulary end-to-end object detection.
🔍 VLM-FO1: Bridging the gap between high-level reasoning and fine-grained perception in Vision-Language Models.
📐 GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training.

🤖 Multimodal Agents & Embodied AI

Action-oriented intelligence for physical and virtual environments.

🛠️ OmAgent: A comprehensive framework to build multimodal language agents for fast prototyping and production.
🎯 OmTrackVLA: Open and reproducible research for tracking Vision-Language-Action (VLA) models.

📊 Benchmarks & Evaluation

Rigorous standards for the open-source multimodal community.

📏 OVDEval: A comprehensive evaluation benchmark for Open-Vocabulary Detection.
📝 VL-CheckList: Evaluating Vision & Language Pretraining Models with Objects, Attributes, and Relations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Om AI Lab

Om AI Lab

🌌 About Us

🚢 Flagship VLX Model Series

🚀 Core Research Tracks

🧠 Reinforced & Advanced Visual Reasoning

👁️ Real-Time Perception & Open-World Visual Detection

🤖 Multimodal Agents & Embodied AI

📊 Benchmarks & Evaluation

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!

Search code, repositories, users, issues, pull requests...

Uh oh!

Uh oh!

Om AI Lab

🌌 About Us

🚢 Flagship VLX Model Series

🚀 Core Research Tracks

🧠 Reinforced & Advanced Visual Reasoning

👁️ Real-Time Perception & Open-World Visual Detection

🤖 Multimodal Agents & Embodied AI

📊 Benchmarks & Evaluation

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!