[Feature]: AttributeError: Model MllamaForConditionalGeneration does not support BitsAndBytes quantization yet

### Your current environment

Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.4
Libc version: glibc-2.31

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-122-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3


### Model Input Dumps

`vllm serve unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit --quantization bitsandbytes --load-format bitsandbytes --trust-remote-code --enforce-eager`


```
Initializing an LLM engine (v0.6.3.post1) with config: model='unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit', speculative_config=None
, tokenizer='unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None,
rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_pa
rallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, d
evice_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forw
ard_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit, num_scheduler_steps=1, chunked_prefill_enabled=F
alse multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
```

### 🐛 Describe the bug

`AttributeError: Model MllamaForConditionalGeneration does not support BitsAndBytes quantization yet`

I was trying the Llama-3.2-90B-Vision-Instruct-bnb-4bit model, it shows such an error. Not sure which place is better to raise this issue, [unsloth](https://github.com/unslothai/unsloth) or [transformers](https://github.com/huggingface/transformers) or just here.

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: AttributeError: Model MllamaForConditionalGeneration does not support BitsAndBytes quantization yet #9714

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: AttributeError: Model MllamaForConditionalGeneration does not support BitsAndBytes quantization yet #9714

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions