[RFC]: QuantizationConfig and QuantizeMethodBase Refactor for Simplifying Kernel Integrations

### Motivation.

Currently vLLM generally has a tight coupling between the checkpoint format and the kernel used during model execution. This model causes issues as the diversity of hardware and kernels increases. This is particularly challenging for quantized kernels (mixed-precision with subbyte weights in particular). For performance, quantized kernels will frequently want to run hardware specialized kernels and for mixed-input commonly pre-pack the weights into a bespoke layout that closely matches the hardware it's running on. 

The goal is to separate the kernel implementation from checkpoint format; this will require a more sophisticated way of describing the linear layer operation in addition to a more sophisticated way of describing packed layouts within vLLM. The result will hopefully make it easier to register a kernel as a backend for multiple checkpoint formats. It will also require standardizing the calling structure of quantized linear layers in vLLM.

### Proposed Change.

The high level proposal is to separate out the `create_weights` logic, moving it into `QuantizationConfig` from `QuantizeMethodBase`, as `QuantizationConfig` is more closely tied to the serialization format. Then to create a `CompressedLinearDescriptor` to allow the `QuantizationConfig` to describe the computation that needs to take place allow for a kernel dispatcher to select the most appropriate kernel (that `can_implement` the computation).

More details:
https://docs.google.com/document/d/1AfgGfF73H_hcXfw6ehYO_l1vHEItsopbxFoV1PvnGIQ/edit?usp=sharing

### Feedback Period.

Until Oct 7. , will begin preparatory work to help demonstrate before that

### CC List.

@dsikka @mgoin @robertgshaw2-neuralmagic @comaniac @alexm-neuralmagic @HanGuo97 @tlrmchlsmth @bnellnm 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: QuantizationConfig and QuantizeMethodBase Refactor for Simplifying Kernel Integrations #8913

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: QuantizationConfig and QuantizeMethodBase Refactor for Simplifying Kernel Integrations #8913

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions