[Feature]: Extend QuantFP8 to support per-token-group quantization

### 🚀 The feature, motivation and pitch

Currently, group quantization is handled by a `per_token_group_quant_fp8` custom CUDA kernel (with a Triton kernel fallback). We should fold this functionality into `QuantFP8` to allow easier dispatching between CUDA, Triton, and torch implementations, automatic Inductor fusion, and easier custom op fusion.

### Alternatives

_No response_

### Additional context

This is related and complementary to #20711.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Extend QuantFP8 to support per-token-group quantization #24185

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Extend QuantFP8 to support per-token-group quantization #24185

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions