multimodal-large-language-models

Here are 381 public repositories matching this topic...

BradyFU / Awesome-Multimodal-Large-Language-Models

✨✨Latest Advances on Multimodal Large Language Models

multi-modality instruction-following in-context-learning large-language-models chain-of-thought instruction-tuning visual-instruction-tuning large-vision-language-model multimodal-instruction-tuning large-vision-language-models multimodal-large-language-models multimodal-in-context-learning multimodal-chain-of-thought

Updated Dec 23, 2025

X-PLUG / MobileAgent

Star

Mobile-Agent: The Powerful GUI Agent Family

android agent app gui automation mobile copilot multimodal mobile-agents mllm multimodal-large-language-models multimodal-agent

Updated Dec 2, 2025
Python

StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.

svg vlm llm multimodal-large-language-models

Updated Nov 7, 2025
Python

ictnlp / LLaMA-Omni

Star

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

speech-to-text speech-to-speech large-language-models multimodal-large-language-models speech-language-model speech-interaction

Updated May 19, 2025
Python

VITA-MLLM / VITA

Star

✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

multimodal-large-language-models large-multimodal-models omni-modal-video-understanding omni-language-model omni-model

Updated Mar 28, 2025
Python

X-PLUG / mPLUG-DocOwl

Star

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

multimodal table-understanding document-understanding mllm multimodal-large-language-models chart-understanding

Updated May 30, 2025
Python

sherlockchou86 / VideoPipe

Star

A cross-platform video structuring (video analysis) framework. If you find it helpful, please give it a star: ) 跨平台的视频结构化（视频分析）框架，觉得有帮助的请给个星星 : )

Updated Nov 5, 2025
C++

cambrian-mllm / cambrian

Star

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

computer-vision chatbot representation-learning clip dino large-language-models llms instruction-tuning mllm multimodal-large-language-models

Updated Nov 7, 2025
Python

YangLing0818 / RPG-DiffusionMaster

Star

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

text-to-image image-editting large-language-models multimodal-large-language-models

Updated Feb 1, 2025
Jupyter Notebook

ByteDance-Seed / Seed1.5-VL

Star

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.

cookbook large-language-model vision-language-model multimodal-large-language-models

Updated Jun 14, 2025
Jupyter Notebook

AIDC-AI / Ovis

Star

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

chatbot multimodality multimodal vision-language-model multimodal-large-language-models vision-language-learning qwen llama3

Updated Sep 22, 2025
Python

Henry-23 / VideoChat

Star

实时交互数字人，可自定义形象与音色，支持音色克隆，对话延迟低至3s。Real-time voice interactive digital human, customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.