[Bug]: Corrupted output with GPTQ Marlin kernel

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
Collecting environment information...
PyTorch version: 2.5.1
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:27:36) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-1020-aws-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 550.54.14
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               4
On-line CPU(s) list:                  0-3
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7R32
CPU family:                           23
Model:                                49
Thread(s) per core:                   2
Core(s) per socket:                   2
Socket(s):                            1
Stepping:                             0
BogoMIPS:                             5599.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            64 KiB (2 instances)
L1i cache:                            64 KiB (2 instances)
L2 cache:                             1 MiB (2 instances)
L3 cache:                             8 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-3
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow:   Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-ml-py==12.560.30
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1
[pip3] torchaudio==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.47.0
[pip3] triton==3.1.0
[conda] blas                      1.0                         mkl  
[conda] cuda-cudart               12.4.127                      0    nvidia
[conda] cuda-cupti                12.4.127                      0    nvidia
[conda] cuda-libraries            12.4.1                        0    nvidia
[conda] cuda-nvrtc                12.4.127                      0    nvidia
[conda] cuda-nvtx                 12.4.127                      0    nvidia
[conda] cuda-opencl               12.6.77                       0    nvidia
[conda] cuda-runtime              12.4.1                        0    nvidia
[conda] cuda-version              12.6                          3    nvidia
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libcublas                 12.4.5.8                      0    nvidia
[conda] libcufft                  11.2.1.3                      0    nvidia
[conda] libcufile                 1.11.1.6                      0    nvidia
[conda] libcurand                 10.3.7.77                     0    nvidia
[conda] libcusolver               11.6.1.9                      0    nvidia
[conda] libcusparse               12.3.1.170                    0    nvidia
[conda] libjpeg-turbo             2.0.0                h9bf148f_0    pytorch
[conda] libnpp                    12.2.5.30                     0    nvidia
[conda] libnvfatbin               12.6.77                       0    nvidia
[conda] libnvjitlink              12.4.127                      0    nvidia
[conda] libnvjpeg                 12.3.1.117                    0    nvidia
[conda] mkl                       2023.1.0         h213fc3f_46344  
[conda] mkl-service               2.4.0           py312h5eee18b_1  
[conda] mkl_fft                   1.3.11          py312h5eee18b_0  
[conda] mkl_random                1.2.8           py312h526ad5a_0  
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] pytorch                   2.5.1           py3.12_cuda12.4_cudnn9.1.0_0    pytorch
[conda] pytorch-cuda              12.4                 hc786d27_7    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torchaudio                2.5.1               py312_cu124    pytorch
[conda] torchtriton               3.1.0                     py312    pytorch
[conda] torchvision               0.20.1              py312_cu124    pytorch
[conda] transformers              4.47.0                   pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.4.post2.dev369+g88693683
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-3	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

LD_LIBRARY_PATH=/home/ubuntu/miniconda3/envs/optimum-quanto/lib/python3.12/site-packages/cv2/../../lib64:/usr/local/cuda-12.4/lib64
CUDA_MODULE_LOADING=LAZY
```

</details>


### Model Input Dumps

_No response_

### 🐛 Describe the bug

@dacorvo discovered a potential issue (https://github.com/huggingface/optimum-quanto/issues/332) with the GPTQ Marlin kernel where the outputs become corrupted for certain shape combinations. The corrupted values appear at different indices after each invocation.

I took some time to investigate the issue, and found two suspicious race conditions in the kernel with Compute Sanitizer:

```
========= Error: Race reported between Read access at void marlin::Marlin<__half, (long)1125899907892224, (int)256, (int)4, (int)16, (int)4, (int)4, (bool)0, (bool)0, (int)8, (bool)0>(const int4 *, const int4 *, int4 *, int4 *, const int4 *, const int4 *, const int *, int, int, int, int, int *, bool)::[lambda(int, int) (instance 2)]::operator ()(int, int) const+0x2df0 in [REDACTED]/vllm/csrc/quantization/gptq_marlin/gptq_marlin.cu:1044
=========     and Write access at marlin::cp_async4(void *, const void *)+0x3380 in [REDACTED]/vllm/csrc/quantization/gptq_marlin/marlin.cuh:73 [1830656 hazards]
========= 
========= Error: Race reported between Read access at void marlin::Marlin<__half, (long)1125899907892224, (int)256, (int)4, (int)16, (int)4, (int)4, (bool)0, (bool)0, (int)8, (bool)0>(const int4 *, const int4 *, int4 *, int4 *, const int4 *, const int4 *, const int *, int, int, int, int, int *, bool)::[lambda(int, int) (instance 2)]::operator ()(int, int) const+0x4a10 in [REDACTED]/vllm/csrc/quantization/gptq_marlin/gptq_marlin.cu:1044
=========     and Write access at marlin::cp_async4(void *, const void *)+0x4f30 in [REDACTED]/vllm/csrc/quantization/gptq_marlin/marlin.cuh:73 [1740800 hazards]
========= 
========= Error: Race reported between Read access at void marlin::ldsm4<__half>(marlin::ScalarType<T1>::FragA &, const void *)+0x5930 in [REDACTED]/vllm/csrc/quantization/gptq_marlin/gptq_marlin.cu:131
=========     and Write access at void marlin::Marlin<__half, (long)1125899907892224, (int)256, (int)4, (int)16, (int)4, (int)4, (bool)0, (bool)0, (int)8, (bool)0>(const int4 *, const int4 *, int4 *, int4 *, const int4 *, const int4 *, const int *, int, int, int, int, int *, bool)::[lambda() (instance 5)]::operator ()() const+0x5a80 in [REDACTED]/vllm/csrc/quantization/gptq_marlin/gptq_marlin.cu:1351 [59904 hazards]
========= 
```
I believe these can be related to the issue as the other Marlin kernels work for the same shapes and don't trigger race condition errors.

Here's a small reproducer, I tested it on AWS with A10G GPUs:
https://github.com/ahadnagy/vllm/blob/d2d7def73a8e9c3843b32fdc9adc8e71605b397c/tests/kernels/test_marlin_gemm.py#L619-L686

Also, I did some parameter space exploration to see at which shapes this error starts to manifest itself:
![plot_1](https://github.com/user-attachments/assets/21a29844-e591-4c7d-8ef5-63455dfc97d7)

Any suggestions are welcome as it's a really hard-to-debug issue, and can't really go further without deep-diving into the kernel.

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Corrupted output with GPTQ Marlin kernel #11205

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Corrupted output with GPTQ Marlin kernel #11205

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions