FastVLM: Efficient Vision Encoding for Vision Language Models

apple/ml-fastvlm 17 Dec 2024

At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency.

2,960
10.19 stars / hour

Continuous Thought Machines

SakanaAI/continuous-thought-machines 8 May 2025

The CTM has two core innovations: (1) neuron-level temporal processing, where each neuron uses unique weight parameters to process a history of incoming signals; and (2) neural synchronization employed as a latent representation.

Computational Efficiency Question Answering

623
4.51 stars / hour

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

jiuhaichen/blip3o 14 May 2025

Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models.

Image Generation

94
2.71 stars / hour

Generating Physically Stable and Buildable LEGO Designs from Text

AvaLovelace1/LegoGPT 8 May 2025

Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts.

3D Generation Large Language Model +1

959
2.54 stars / hour

Flow-GRPO: Training Flow Matching Models via Online RL

yifan123/flow_grpo 8 May 2025

We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models.

Denoising Diversity +3

497
2.41 stars / hour

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

LeapLabTHU/Absolute-Zero-Reasoner 6 May 2025

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards.

Mathematical Reasoning

950
2.21 stars / hour

HealthBench: Evaluating Large Language Models Towards Improved Human Health

openai/simple-evals 13 May 2025

We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare.

Instruction Following Multiple-choice

3,285
1.70 stars / hour

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

hitsz-tmg/awesome-large-multimodal-reasoning-models 8 May 2025

Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning.

Multimodal Reasoning

257
1.44 stars / hour

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

opendrivelab/univla 9 May 2025

Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding.

Vision-Language-Action

206
1.35 stars / hour

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

vita-mllm/vita-audio 6 May 2025

Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +7

259
1.28 stars / hour
Morty Proxy This is a proxified and sanitized view of the page, visit original site.