Trending Research

FastVLM: Efficient Vision Encoding for Vision Language Models

apple/ml-fastvlm • • 17 Dec 2024

At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency.

2,960

10.19 stars / hour

Paper
Code

Continuous Thought Machines

SakanaAI/continuous-thought-machines • • 8 May 2025

The CTM has two core innovations: (1) neuron-level temporal processing, where each neuron uses unique weight parameters to process a history of incoming signals; and (2) neural synchronization employed as a latent representation.

Computational Efficiency Question Answering

623

4.51 stars / hour

Paper
Code

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

jiuhaichen/blip3o • 14 May 2025

Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models.

Image Generation

2.71 stars / hour

Paper
Code

Generating Physically Stable and Buildable LEGO Designs from Text

AvaLovelace1/LegoGPT • 8 May 2025

Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts.

3D Generation Large Language Model +1

959

2.54 stars / hour

Paper
Code

Flow-GRPO: Training Flow Matching Models via Online RL

yifan123/flow_grpo • • 8 May 2025

We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models.

Ranked #1 on Text-to-Image Generation on GenEval

Denoising Diversity +3

497

2.41 stars / hour

Paper
Code

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

LeapLabTHU/Absolute-Zero-Reasoner • • 6 May 2025

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards.

Mathematical Reasoning

950

2.21 stars / hour

Paper
Code

HealthBench: Evaluating Large Language Models Towards Improved Human Health

openai/simple-evals • 13 May 2025

We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare.

Instruction Following Multiple-choice

3,285

1.70 stars / hour

Paper
Code

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

hitsz-tmg/awesome-large-multimodal-reasoning-models • 8 May 2025

Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning.

Multimodal Reasoning

257

1.44 stars / hour

Paper
Code

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

opendrivelab/univla • • 9 May 2025

Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding.

Vision-Language-Action

206

1.35 stars / hour

Paper
Code

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

vita-mllm/vita-audio • • 6 May 2025

Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +7

259

1.28 stars / hour

Paper
Code

Subscribe to the PwC Newsletter

Join the community

Trending Research