Why is the bitsandbytes model significantly slower than the AWQ model?

### Your current environment

`VLLM 0.6.1.post2`

### 🐛 Describe the bug

I used a model from a hub with AWQ quantization, so it's already quantized. I loaded it with a half data type, and it performs really fast. However, when I loaded the base model and let VLLM handle bitsandbytes quantization, the performance was significantly slower compared to the AWQ model imported directly from the hub.
Any idea?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Why is the bitsandbytes model significantly slower than the AWQ model? #8743

Your current environment

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Why is the bitsandbytes model significantly slower than the AWQ model? #8743

Description

Your current environment

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions