Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
-
Updated
Jul 3, 2025 - Python
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Agent S: an open agentic framework that uses computers like a human
Mobile-Agent: The Powerful GUI Agent Family
[NeurIPS 2025] SpatialLM: Training Large Language Models for Structured Indoor Modeling
[CVPR'25] Official Implementations for Paper - MagicQuill: An Intelligent Interactive Image Editing System
Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language Model
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
🔥 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"
OpenEMMA, a permissively licensed open source "reproduction" of Waymo’s EMMA model.
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
[NeurIPS 2025] 4KAgent: Agentic Any Image to 4K Super-Resolution. An intelligent computer vision agent that can magically restore any image to perfect-4K!
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Fully Open Framework for Democratized Multimodal Training
Add a description, image, and links to the mllm topic page so that developers can more easily learn about it.
To associate your repository with the mllm topic, visit your repo's landing page and select "manage topics."