Skip to content

[RFC]: QuantizationConfig and QuantizeMethodBase Refactor for Simplifying Kernel Integrations #8913

@LucasWilkinson

Description

@LucasWilkinson

Motivation.

Currently vLLM generally has a tight coupling between the checkpoint format and the kernel used during model execution. This model causes issues as the diversity of hardware and kernels increases. This is particularly challenging for quantized kernels (mixed-precision with subbyte weights in particular). For performance, quantized kernels will frequently want to run hardware specialized kernels and for mixed-input commonly pre-pack the weights into a bespoke layout that closely matches the hardware it's running on.

The goal is to separate the kernel implementation from checkpoint format; this will require a more sophisticated way of describing the linear layer operation in addition to a more sophisticated way of describing packed layouts within vLLM. The result will hopefully make it easier to register a kernel as a backend for multiple checkpoint formats. It will also require standardizing the calling structure of quantized linear layers in vLLM.

Proposed Change.

The high level proposal is to separate out the create_weights logic, moving it into QuantizationConfig from QuantizeMethodBase, as QuantizationConfig is more closely tied to the serialization format. Then to create a CompressedLinearDescriptor to allow the QuantizationConfig to describe the computation that needs to take place allow for a kernel dispatcher to select the most appropriate kernel (that can_implement the computation).

More details:
https://docs.google.com/document/d/1AfgGfF73H_hcXfw6ehYO_l1vHEItsopbxFoV1PvnGIQ/edit?usp=sharing

Feedback Period.

Until Oct 7. , will begin preparatory work to help demonstrate before that

CC List.

@dsikka @mgoin @robertgshaw2-neuralmagic @comaniac @alexm-neuralmagic @HanGuo97 @tlrmchlsmth @bnellnm

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCkeep-openPrevents stale label being appliedunstaleRecieved activity after being labelled stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions