Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

👁️ Om AI Lab

Open Multimodal AGI Research

Website Hugging Face X (Twitter)

Pioneering the next generation of multimodal AI models for Spatial Intelligence and Embodied AI.


🌌 About Us

At Om AI Lab, we believe the future of AI extends far beyond pure text. We are dedicated to building the "brains" for next-generation systems by focusing on the intersection of Spatial Intelligence, Visual Reasoning, and Embodied Agents.

Our research spans across open-vocabulary perception, reinforced vision-language models, and real-time inference. We aim to bridge the critical gap between high-level logical reasoning and fine-grained visual action—building models that don't just "see" the world, but intuitively understand and interact with it.


🚀 Core Research Tracks

🧠 Reinforced & Advanced VLMs

Models that think, reason, and understand the visual world at a granular level.

  • 🌟 VLM-R1: Solving Visual Understanding with Reinforced VLMs. (Highly active)
  • 🔍 VLM-FO1: Bridging the gap between high-level reasoning and fine-grained perception in Vision-Language Models.
  • 🔎 ZoomEye: Enhancing Multimodal LLMs with human-like zooming capabilities through tree-based image exploration.

👁️ Real-Time Perception & Open-World Detection

Foundational spatial understanding optimized for edge and on-premise speeds.

  • OmDet: Real-time, highly accurate, open-vocabulary end-to-end object detection.
  • 📐 GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training.
  • 🌍 ImageRAG: Enhancing ultrahigh-resolution remote sensing imagery analysis.

🤖 Multimodal Agents & Embodied AI

Action-oriented intelligence for physical and virtual environments.

  • 🛠️ OmAgent: A comprehensive framework to build multimodal language agents for fast prototyping and production.
  • 🎯 OpenTrackVLA: Open and reproducible research for tracking Vision-Language-Action (VLA) models.

📊 Benchmarks & Evaluation

Rigorous standards for the open-source multimodal community.

  • 📏 OVDEval: A comprehensive evaluation benchmark for Open-Vocabulary Detection.
  • 📝 VL-CheckList: Evaluating Vision & Language Pretraining Models with Objects, Attributes, and Relations.
Building the foundational brains for the physical world.

Join us in exploring the spatial frontier.

Pinned Loading

  1. VLM-R1 VLM-R1 Public

    Solve Visual Understanding with Reinforced VLMs

    Python 6k 380

  2. OmDet OmDet Public

    Real-time and accurate open-vocabulary end-to-end object detection

    Python 1.4k 116

  3. OmAgent OmAgent Public

    [EMNLP-2024] Build multimodal language agents for fast prototype and production

    Python 2.6k 289

  4. VLM-FO1 VLM-FO1 Public

    VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

    Python 309 14

  5. OpenTrackVLA OpenTrackVLA Public

    Open & Reproducible Research for Tracking VLAs

    Python 207 11

  6. ZoomEye ZoomEye Public

    [EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

    Python 81 9

Repositories

Loading
Type
Select type
Language
Select language
Sort
Select order
Showing 10 of 23 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…

Morty Proxy This is a proxified and sanitized view of the page, visit original site.