-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Motivation.
Currently vLLM generally has a tight coupling between the checkpoint format and the kernel used during model execution. This model causes issues as the diversity of hardware and kernels increases. This is particularly challenging for quantized kernels (mixed-precision with subbyte weights in particular). For performance, quantized kernels will frequently want to run hardware specialized kernels and for mixed-input commonly pre-pack the weights into a bespoke layout that closely matches the hardware it's running on.
The goal is to separate the kernel implementation from checkpoint format; this will require a more sophisticated way of describing the linear layer operation in addition to a more sophisticated way of describing packed layouts within vLLM. The result will hopefully make it easier to register a kernel as a backend for multiple checkpoint formats. It will also require standardizing the calling structure of quantized linear layers in vLLM.
Proposed Change.
The high level proposal is to separate out the create_weights
logic, moving it into QuantizationConfig
from QuantizeMethodBase
, as QuantizationConfig
is more closely tied to the serialization format. Then to create a CompressedLinearDescriptor
to allow the QuantizationConfig
to describe the computation that needs to take place allow for a kernel dispatcher to select the most appropriate kernel (that can_implement
the computation).
More details:
https://docs.google.com/document/d/1AfgGfF73H_hcXfw6ehYO_l1vHEItsopbxFoV1PvnGIQ/edit?usp=sharing
Feedback Period.
Until Oct 7. , will begin preparatory work to help demonstrate before that
CC List.
@dsikka @mgoin @robertgshaw2-neuralmagic @comaniac @alexm-neuralmagic @HanGuo97 @tlrmchlsmth @bnellnm
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.