Skip to content

[Bug]: ROCm fail to build due to compilation error of moe_wna16.cu #14669

@tjtanaa

Description

@tjtanaa

Your current environment

The output of `python collect_env.py`
INFO 03-12 09:10:06 [__init__.py:256] Automatically detected platform rocm.                                                                    
Collecting environment information...                                                                                                          
PyTorch version: 2.7.0a0+git6c0e746                                                                                                            
Is debug build: False                                                                                                                          
CUDA used to build PyTorch: N/A                                                                                                                
ROCM used to build PyTorch: 6.3.42133-1b9c17779                                                                                                
                                                                                                                                               
OS: Ubuntu 22.04.5 LTS (x86_64)                                                                                                                
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0                                                                                             
Clang version: 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.3.1 24491 1e0fda770a2079fbd71e4b70974d74f62fd3af10)          
CMake version: version 3.31.6                                                                                                                  
Libc version: glibc-2.35                                                                                                                       
                                                                                                                                               
Python version: 3.12.9 (main, Feb  5 2025, 08:49:00) [GCC 11.4.0] (64-bit runtime)                                                             
Python platform: Linux-5.15.0-116-generic-x86_64-with-glibc2.35        
GPU models and configuration: AMD Instinct MI300X (gfx942:sramecc+:xnack-)                                                                     
Nvidia driver version: Could not collect                                                                                                       
cuDNN version: Could not collect                                                                                                               
HIP runtime version: 6.3.42133                                                                                                                 
MIOpen runtime version: 3.3.0                                                                                                                  
Is XNNPACK available: True                                                                                                                     

PYTORCH_ROCM_ARCH=gfx942
LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/opt/rocm/lib:/usr/local/lib:
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

When compiling the latest vLLM commit on ROCm, it gives the following error.

Image

The error comes from the compilation of CUDA only kernels, introduced in 90e88ab.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions