Commit 10f047b
committed
Add Qwen3-VL and Qwen3-VL-MoE multimodal model support
This commit introduces comprehensive support for Qwen3-VL vision-language
models, including both the dense variant and the Mixture-of-Experts (MoE)
architecture with DeepStack fusion capabilities.
## Overview
Qwen3-VL represents Alibaba's advanced multimodal models capable of
understanding and reasoning about images alongside text. This implementation
enables running these models for various vision-language tasks including
image understanding, optical character recognition (OCR), visual question
answering, and document analysis.
## Architecture Implementation
### Core Architecture (llama-arch.cpp/h)
- **LLM_ARCH_QWEN3_VL**: Dense vision-language model architecture
- **LLM_ARCH_QWEN3_VL_MOE**: Mixture-of-Experts variant with expert routing
- Complete tensor mapping registration for both architectures
- Architecture-specific parameter handling and validation
### Model Loading (llama-model.cpp)
**Hyperparameter Loading**
- QWEN3_VL: Standard dense model configuration
* Uses full n_embd dimension throughout
* 36 layers for 4B parameter variant
- QWEN3_VL_MOE: Expert-based configuration
* 4x n_embd expansion (n_embd/4 per channel × 4 channels)
* 48 layers (30B-A3B) or 94 layers (235B-A22B)
* Expert feed-forward network dimensions
**Multi-axis Rotary Position Embedding (M-RoPE)**
- Configured rope_sections = [24, 20, 20, 0]
* Temporal dimension: 24 dims
* Height dimension: 20 dims
* Width dimension: 20 dims
* Unused dimension: 0
- Enables spatial awareness for image patch processing
- Added debug logging for MRoPE configuration verification
**Tensor Initialization**
- QWEN3_VL follows QWEN3 dense structure
* Token embeddings, output projection
* Per-layer: attention (Q/K/V/O), normalization, FFN
- QWEN3_VL_MOE includes expert-specific tensors
* Expert gate networks for routing
* Per-expert FFN weights (gate, down, up)
* Shared and expert-specific parameters
### Graph Building (llama-graph.cpp/h)
**DeepStack Architecture for MoE**
The Qwen3-VL-MoE variant implements a novel DeepStack fusion mechanism:
1. **Channel Splitting**: Vision embeddings split into 3 processing channels
- ds0, ds1, ds2 (DeepStack channels 0, 1, 2)
- Each channel: n_embd/4 dimensions
2. **Per-layer Processing**: Independent expert selection per channel
- Token-level expert routing
- Gated mixture-of-experts computation
- Q/K normalization before attention
3. **Fusion Layers**: Learned merging at early transformer layers
- Fusion occurs at layers 0, 1, and 2
- DeepStack merger combines information across channels
- Only active when vision embeddings present (text-only safe)
**Batch Processing**
- Enhanced position array handling for M-RoPE multi-dimensional positions
- Proper ubatch preparation distinguishing vision vs text tokens
- Conditional graph construction based on modality
### Vision Processing (clip.cpp/clip-impl.h)
**PROJECTOR_TYPE_QWEN3VLMOE**
- New projector type for Qwen3-VL-MoE vision encoder
- Handles projection from vision encoder to language model space
**DeepStack Merger Implementation**
The merger is a learnable 2-layer MLP with normalization:
```
Input (3 channels)
→ LayerNorm(norm_w, norm_b)
→ Linear(fc1_w, fc1_b)
→ GELU activation
→ Linear(fc2_w, fc2_b)
→ Output (fused representation)
```
Components:
- `norm_w`, `norm_b`: Layer normalization parameters
- `fc1_w`, `fc1_b`: First linear projection
- `fc2_w`, `fc2_b`: Second linear projection
**Spatial Operations**
- Fixed spatial merge for vision patch sequences
- Proper handling of patch grid dimensions
- Vision-text boundary management
**Safety Improvements**
- Removed illegal zero-tensor initialization for text-only inputs
- Conditional fusion: only processes when vision embeddings exist
- Prevents memory access violations in text-only inference
### Platform Support (llama-model-loader.cpp)
**Windows File Handle Limit**
- Increased stdio limit to 2048 handles (from default ~512)
- Critical for MoE models with many expert weight files
- Uses `_setmaxstdio()` on Windows platform
- Prevents "too many open files" errors during loading
### Reference Patches (llama/patches/)
Included for transparency and reproducibility:
- `0033-qwen3vl-base-architecture.patch`
- `0034-qwen3vl-deepstack-implementation.patch`
- `0035-qwen3vl-memory-fix.patch`
- `0036-qwen3vl-layer-norm-bias.patch`
## Technical Specifications
### Qwen3-VL (Dense)
- **Type**: Standard transformer with integrated vision encoder
- **Layers**: 36 (4B parameter model)
- **Embedding**: Full n_embd dimension
- **Position Encoding**: M-RoPE with 4 dimensional sections
- **Use Cases**: General vision-language understanding
### Qwen3-VL-MoE (Mixture of Experts)
- **Type**: Sparse MoE with DeepStack fusion
- **Layers**: 48 (30B activated/3B) or 94 (235B activated/22B)
- **Embedding**: 4-channel architecture (n_embd/4 per channel)
- **Experts**: Multiple expert networks per layer with learned routing
- **Fusion**: 3-layer early fusion (layers 0, 1, 2)
- **Use Cases**: High-quality vision understanding at improved efficiency
### DeepStack Fusion Mechanism
The multi-channel fusion enables:
1. **Parallel Processing**: Different aspects of vision processed independently
2. **Early Integration**: Information merged in early transformer layers
3. **Adaptive Routing**: Expert selection per channel and token
4. **Efficiency**: Sparse activation patterns reduce computation
## Capabilities Enabled
This implementation supports:
- **Multimodal Chat**: Conversational AI with image understanding
- **Image Captioning**: Detailed image descriptions
- **Visual Question Answering**: Answer questions about image content
- **Optical Character Recognition**: Extract text from images
- **Document Understanding**: Analyze documents, tables, charts
- **Image Analysis**: Detailed visual scene understanding
## References and Acknowledgments
This implementation is based on the outstanding work by the community:
**Primary Source Repository**
- Branch: https://github.com/LETS-BEE/llama.cpp/commits/qwen3vl/
- Author: LETS-BEE
**Source Commits** (applied in llama/patches/):
1. Base Architecture
LETS-BEE/llama.cpp@9971912
2. DeepStack Implementation
LETS-BEE/llama.cpp@b913e89
3. Memory Access Fix
LETS-BEE/llama.cpp@de0e3d3
4. Layer Normalization Update
LETS-BEE/llama.cpp@e45aecb
**Related Discussions and Pull Requests**
- Upstream llama.cpp Discussion:
ggml-org/llama.cpp#16207 (comment)
- Upstream llama.cpp PR:
ggml-org/llama.cpp#16745
- Related Ollama PR:
ollama#12665
**Additional Context**
- OCR-related discussion:
ggml-org/llama.cpp#16764
## Testing
Tested with:
- Qwen3-VL 4B parameter models (dense)
- Qwen3-VL-MoE 30B-A3B models (MoE)
- Various image understanding tasks
- Text-only and multimodal inference modes
## Future Work
Potential enhancements:
- Additional model size variants
- Performance optimizations for DeepStack fusion
- Extended M-RoPE configuration options
- Enhanced vision preprocessing pipelines
---
Special thanks to the llama.cpp community and all contributors who made
this multimodal vision-language support possible.1 parent 1f1b508 commit 10f047b
File tree
19 files changed
+14833
-31
lines changed- llama
- llama.cpp
- src
- tools/mtmd
- patches
- ml/backend/ggml/ggml/src
- ggml-cpu
19 files changed
+14833
-31
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
| 34 | + | |
| 35 | + | |
34 | 36 | | |
35 | 37 | | |
36 | 38 | | |
| |||
773 | 775 | | |
774 | 776 | | |
775 | 777 | | |
| 778 | + | |
| 779 | + | |
| 780 | + | |
| 781 | + | |
| 782 | + | |
| 783 | + | |
| 784 | + | |
| 785 | + | |
| 786 | + | |
| 787 | + | |
| 788 | + | |
| 789 | + | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
| 803 | + | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
776 | 821 | | |
777 | 822 | | |
778 | 823 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
| 38 | + | |
| 39 | + | |
38 | 40 | | |
39 | 41 | | |
40 | 42 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
658 | 658 | | |
659 | 659 | | |
660 | 660 | | |
661 | | - | |
| 661 | + | |
662 | 662 | | |
663 | 663 | | |
664 | 664 | | |
| |||
0 commit comments