-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
Your current environment
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.4
Libc version: glibc-2.31
Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-122-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
Model Input Dumps
vllm serve unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit --quantization bitsandbytes --load-format bitsandbytes --trust-remote-code --enforce-eager
Initializing an LLM engine (v0.6.3.post1) with config: model='unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit', speculative_config=None
, tokenizer='unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None,
rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_pa
rallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, d
evice_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forw
ard_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit, num_scheduler_steps=1, chunked_prefill_enabled=F
alse multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
🐛 Describe the bug
AttributeError: Model MllamaForConditionalGeneration does not support BitsAndBytes quantization yet
I was trying the Llama-3.2-90B-Vision-Instruct-bnb-4bit model, it shows such an error. Not sure which place is better to raise this issue, unsloth or transformers or just here.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.