[Bug]: Unable to run Qwen3 on Turing GPUs after upgrading to torch 2.7.0

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
 python models/collect_env.py                                                                                                  01:03:35 [80/536]
INFO 05-05 01:03:32 [__init__.py:239] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.7.0+cu126
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: Arch Linux (x86_64)
GCC version: (GCC) 14.2.1 20240910
Clang version: Could not collect
CMake version: version 3.31.5
Libc version: glibc-2.40

Python version: 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.12.10-arch1-1-x86_64-with-glibc2.40
Is CUDA available: True
CUDA runtime version: 12.6.85
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 2080 Ti
GPU 1: NVIDIA GeForce RTX 2080 Ti
GPU 2: NVIDIA GeForce RTX 2080 Ti
GPU 3: NVIDIA GeForce RTX 2080 Ti

Nvidia driver version: 565.77
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               36
On-line CPU(s) list:                  0-35
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU E5-2696 v3 @ 2.30GHz
CPU family:                           6
Model:                                63
Thread(s) per core:                   2
Core(s) per socket:                   18
Socket(s):                            1
Stepping:                             2
CPU(s) scaling MHz:                   38%
CPU max MHz:                          3800.0000
CPU min MHz:                          1200.0000
BogoMIPS:                             4609.12
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe
1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm p
cid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp fsgsbase tsc_adjust bm
i1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d
L1d cache:                            576 KiB (18 instances)
L1i cache:                            576 KiB (18 instances)
L2 cache:                             4.5 MiB (18 instances)
L3 cache:                             45 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-35
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.5
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.51.3
[pip3] triton==3.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.5.dev451+ga92842454 (git sha: a92842454)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV2     PHB     PHB     0-35    0               N/A
GPU1    NV2      X      PHB     PHB     0-35    0               N/A
GPU2    PHB     PHB      X      NV2     0-35    0               N/A
GPU3    PHB     PHB     NV2      X      0-35    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

VLLM_ENGINE_ITERATION_TIMEOUT_S=3600
CUDA_PATH=/opt/cuda
OMP_NUM_THREADS=8
MKL_NUM_THREADS=8
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
```

</details>


### 🐛 Describe the bug

Turing GPUs (2080Ti) don't support bf16 and I have to use fp16. After upgrading torch to 2.7.0 I can no longer lanuch vllm when using dense Qwen3 or Qwen3Moe models.
1c2bc7ead019cdf5b04b2f1d07b00982352f85ef is the last working commit, 2c4f59afc3d50fda805c4ad94c9d9be168cded0b breaks it.
Launch command:
```
vllm serve --dtype float16 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.95 -tp 4 Qwen/Qwen3-30B-A3B --max-model-len 32768 --max-seq-len-to-capture 32768 --served-model-name Qwen3-30B-A3B --enable-reasoning --reasoning-parser qwen3
```

The following lines in log looks like the culprit:
```
Unsupported conversion from f16 to f16
LLVM ERROR: Unsupported rounding mode for conversion.
......
in/home/sgsdxzy/micromamba/envs/vllm-dev/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py:
254:032o/home/sgsdxzy/micromamba/envs/vllm-dev/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py:254:0:
n: i    c%aerror: 3lerror:  = iarith.addizFailures have been detected while processing an MLIR pass pipeline eFailures have been detected while processing an MLIR pass pipel
ine
%{
arg7 ,  m/home/sgsdxzy/micromamba/envs/vllm-dev/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py:254:0/home/sgsdxzy/micromamba/envs/vllm-dev/li
b/python3.12/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py:254:0%a: : c63_i32x- i:note: note: t eir32Pipeline failed while executing [`ConvertTritonGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`Pipeline failed while executing [`ConvertTrito
nGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`a
```

<details>
<summary>Full vllm log</summary>

```
INFO 05-05 00:40:29 [__init__.py:239] Automatically detected platform cuda.                                                                                                  INFO 05-05 00:40:33 [api_server.py:1042] vLLM API server version 0.8.5.dev451+ga92842454
INFO 05-05 00:40:33 [api_server.py:1043] args: Namespace(subparser='serve', model_tag='./models/Qwen3-30B-A3B', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info
', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=F
alse, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_
tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='./models/Qwen3-30B-A3B', task='auto', tokenizer=None, tokenizer_mode='auto', trust_remote_code=False,
 dtype='float16', seed=None, hf_config_path=None, allowed_local_media_path='', revision=None, code_revision=None, rope_scaling={}, rope_theta=None, tokenizer_revision=None,
max_model_len=32768, quantization=None, enforce_eager=False, max_seq_len_to_capture=32768, max_logprobs=20, disable_sliding_window=False, disable_cascade_attn=False, skip_to
kenizer_init=False, served_model_name=['Qwen3-30B-A3B'], disable_async_output_proc=False, config_format='auto', hf_token=None, hf_overrides={}, override_neuron_config={}, ov
erride_pooler_config=None, logits_processor_pattern=None, generation_config='auto', override_generation_config={}, enable_sleep_mode=False, model_impl='auto', load_format='a
uto', download_dir=None, model_loader_extra_config={}, ignore_patterns=None, use_tqdm_on_load=True, qlora_adapter_name_or_path=None, pt_load_map_location='cpu', guided_decod
ing_backend='auto', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, enable_reasoni
ng=True, reasoning_parser='qwen3', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, data_parallel_size=1, enable_expert_parallel=False, m
ax_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, worker_cls='auto', worker_extension_cls='', block_size=None, gpu_memory_util
ization=0.95, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=True, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_
kv_scales=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_
cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_l
oras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, show_hidden_metr
ics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_p
refills=1, cuda_graph_sizes=[512], long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step
_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=True, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', compilation_confi
g=None, kv_transfer_config=None, kv_events_config=None, additional_config=None, use_v2_block_manager=True, disable_log_stats=False, disable_log_requests=False, max_log_len=N
one, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x748a875d0f40>)
WARNING 05-05 00:40:33 [config.py:3034] Casting torch.bfloat16 to torch.float16.
INFO 05-05 00:40:40 [config.py:748] This model supports multiple tasks: {'classify', 'score', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.
WARNING 05-05 00:40:40 [arg_utils.py:1539] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
INFO 05-05 00:40:40 [config.py:1811] Defaulting to use mp for distributed inference
INFO 05-05 00:40:40 [config.py:2053] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 05-05 00:40:40 [api_server.py:246] Started engine process with PID 2328
INFO 05-05 00:40:43 [__init__.py:239] Automatically detected platform cuda.
INFO 05-05 00:40:46 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.dev451+ga92842454) with config: model='./models/Qwen3-30B-A3B', speculative_config=None, tokeniz
er='./models/Qwen3-30B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtyp
e=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quanti
zation=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=Fal
se, disable_additional_properties=False, reasoning_backend='qwen3'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None
, collect_detailed_traces=None), seed=None, served_model_name=Qwen3-30B-A3B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefi
ll_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216
,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 05-05 00:40:47 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 05-05 00:40:47 [cuda.py:289] Using XFormers backend.
INFO 05-05 00:40:50 [__init__.py:239] Automatically detected platform cuda.
INFO 05-05 00:40:50 [__init__.py:239] Automatically detected platform cuda.
INFO 05-05 00:40:50 [__init__.py:239] Automatically detected platform cuda.
(VllmWorkerProcess pid=2355) INFO 05-05 00:40:53 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2356) INFO 05-05 00:40:53 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2355) INFO 05-05 00:40:53 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=2355) INFO 05-05 00:40:53 [cuda.py:289] Using XFormers backend.
(VllmWorkerProcess pid=2356) INFO 05-05 00:40:53 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=2356) INFO 05-05 00:40:53 [cuda.py:289] Using XFormers backend.
(VllmWorkerProcess pid=2354) INFO 05-05 00:40:53 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2354) INFO 05-05 00:40:53 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=2354) INFO 05-05 00:40:53 [cuda.py:289] Using XFormers backend.
(VllmWorkerProcess pid=2356) INFO 05-05 00:40:55 [utils.py:1056] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2356) INFO 05-05 00:40:55 [pynccl.py:69] vLLM is using nccl==2.26.2
(VllmWorkerProcess pid=2355) INFO 05-05 00:40:55 [utils.py:1056] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2355) INFO 05-05 00:40:55 [pynccl.py:69] vLLM is using nccl==2.26.2
INFO 05-05 00:40:55 [utils.py:1056] Found nccl from library libnccl.so.2
INFO 05-05 00:40:55 [pynccl.py:69] vLLM is using nccl==2.26.2
(VllmWorkerProcess pid=2354) INFO 05-05 00:40:55 [utils.py:1056] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2354) INFO 05-05 00:40:55 [pynccl.py:69] vLLM is using nccl==2.26.2
WARNING 05-05 00:40:55 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify d
isable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2356) WARNING 05-05 00:40:55 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To si
lence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2354) WARNING 05-05 00:40:55 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To si
lence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2355) WARNING 05-05 00:40:55 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To si
lence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 05-05 00:40:55 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_7644c59a'), local
_subscribe_addr='ipc:///tmp/6fdbe6e1-a71e-4087-95e0-b7b5c189ff9c', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorkerProcess pid=2355) INFO 05-05 00:40:55 [parallel_state.py:1004] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2
(VllmWorkerProcess pid=2354) INFO 05-05 00:40:55 [parallel_state.py:1004] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 05-05 00:40:55 [parallel_state.py:1004] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorkerProcess pid=2356) INFO 05-05 00:40:55 [parallel_state.py:1004] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3
INFO 05-05 00:40:55 [model_runner.py:1161] Starting to load model ./models/Qwen3-30B-A3B...
(VllmWorkerProcess pid=2355) INFO 05-05 00:40:55 [model_runner.py:1161] Starting to load model ./models/Qwen3-30B-A3B...
(VllmWorkerProcess pid=2356) INFO 05-05 00:40:55 [model_runner.py:1161] Starting to load model ./models/Qwen3-30B-A3B...
(VllmWorkerProcess pid=2354) INFO 05-05 00:40:55 [model_runner.py:1161] Starting to load model ./models/Qwen3-30B-A3B...
Loading safetensors checkpoint shards:   0% Completed | 0/16 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   6% Completed | 1/16 [00:02<00:36,  2.44s/it]
Loading safetensors checkpoint shards:  12% Completed | 2/16 [00:04<00:34,  2.44s/it]
Loading safetensors checkpoint shards:  19% Completed | 3/16 [00:07<00:30,  2.38s/it]
Loading safetensors checkpoint shards:  25% Completed | 4/16 [00:09<00:28,  2.37s/it]
Loading safetensors checkpoint shards:  31% Completed | 5/16 [00:11<00:25,  2.35s/it]
Loading safetensors checkpoint shards:  38% Completed | 6/16 [00:12<00:17,  1.76s/it]
Loading safetensors checkpoint shards:  44% Completed | 7/16 [00:15<00:18,  2.02s/it]
Loading safetensors checkpoint shards:  50% Completed | 8/16 [00:17<00:17,  2.15s/it]
Loading safetensors checkpoint shards:  56% Completed | 9/16 [00:19<00:15,  2.23s/it]
Loading safetensors checkpoint shards:  62% Completed | 10/16 [00:22<00:13,  2.27s/it]
Loading safetensors checkpoint shards:  69% Completed | 11/16 [00:24<00:11,  2.32s/it]
Loading safetensors checkpoint shards:  75% Completed | 12/16 [00:27<00:09,  2.33s/it]
Loading safetensors checkpoint shards:  81% Completed | 13/16 [00:29<00:07,  2.37s/it]
Loading safetensors checkpoint shards:  88% Completed | 14/16 [00:31<00:04,  2.36s/it]
Loading safetensors checkpoint shards:  94% Completed | 15/16 [00:34<00:02,  2.34s/it]
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:36<00:00,  2.42s/it]
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:36<00:00,  2.30s/it]

(VllmWorkerProcess pid=2356) INFO 05-05 00:41:32 [loader.py:459] Loading weights took 36.76 seconds
INFO 05-05 00:41:32 [loader.py:459] Loading weights took 36.76 seconds
(VllmWorkerProcess pid=2354) INFO 05-05 00:41:32 [loader.py:459] Loading weights took 36.77 seconds
(VllmWorkerProcess pid=2355) INFO 05-05 00:41:32 [loader.py:459] Loading weights took 36.78 seconds
(VllmWorkerProcess pid=2356) INFO 05-05 00:41:32 [model_runner.py:1193] Model loading took 14.2464 GiB and 36.986922 seconds
INFO 05-05 00:41:32 [model_runner.py:1193] Model loading took 14.2464 GiB and 36.996060 seconds
(VllmWorkerProcess pid=2354) INFO 05-05 00:41:32 [model_runner.py:1193] Model loading took 14.2464 GiB and 37.000365 seconds
(VllmWorkerProcess pid=2355) INFO 05-05 00:41:32 [model_runner.py:1193] Model loading took 14.2464 GiB and 37.008113 seconds
WARNING 05-05 00:41:34 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/sgsdxzy/micromamba/envs/vllm-dev/lib/pyt
hon3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_GeForce_RTX_2080_Ti.json
(VllmWorkerProcess pid=2355) WARNING 05-05 00:41:34 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/sgsdxzy/mic
romamba/envs/vllm-dev/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_GeForce_RTX_2080_Ti.json
(VllmWorkerProcess pid=2354) WARNING 05-05 00:41:34 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/sgsdxzy/mic
romamba/envs/vllm-dev/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_GeForce_RTX_2080_Ti.json
(VllmWorkerProcess pid=2356) WARNING 05-05 00:41:34 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/sgsdxzy/mic
romamba/envs/vllm-dev/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_GeForce_RTX_2080_Ti.json
Unsupported conversion from f16 to f16
LLVM ERROR: Unsupported rounding mode for conversion.
Unsupported conversion from f16 to f16
LLVM ERROR: Unsupported rounding mode for conversion.
Unsupported conversion from f16 to f16
LLVM ERROR: Unsupported rounding mode for conversion.
#blocked = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [2, 16], warpsPerCTA = [4, 1], order = [1, 0]}>
#blocked1 = #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [4, 8], warpsPerCTA = [1, 4], order = [0, 1]}>
#blocked2 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [8, 4], warpsPerCTA = [4, 1], order = [1, 0]}>
#blocked3 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [4, 8], warpsPerCTA = [4, 1], order = [1, 0]}>
#shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
#shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
#smem = ##blockedttg = .shared_memory
#modulettg attributes. {blocked<{sizePerThread = [4, 4], threadsPerWarp = [2, 16], warpsPerCTA = [4, 1], order = [1, 0]}>"
t#tblockedg1. = nu#mttg-.cblocked<{sizePerThread = [8, 1], threadsPerWarp = [4, 8], warpsPerCTA = [1, 4], order = [0, 1]}>t
a#sblocked"2 =  = 1# : ttgi.32blocked<{sizePerThread = [1, 8], threadsPerWarp = [8, 4], warpsPerCTA = [4, 1], order = [1, 0]}>,
"#tblockedt3g = .n#uttgm.-blocked<{sizePerThread = [1, 8], threadsPerWarp = [4, 8], warpsPerCTA = [4, 1], order = [1, 0]}>w
a#rsharedp = s"# = ttg4. : swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>i#
32blocked#,  = sharedttg.target1 =  = "#cttgu.dswizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>a
:##7ttgsmem5. = "blocked<{sizePerThread = [4, 4], threadsPerWarp = [2, 16], warpsPerCTA = [4, 1], order = [1, 0]}>#
, ttg#".blockedtshared_memory1t
 = gmodule.# attributestttg {h."rblocked<{sizePerThread = [8, 1], threadsPerWarp = [4, 8], warpsPerCTA = [1, 4], order = [0, 1]}>te
ta#gdblocked.s2n- = upm#e-ttgrc.-tablocked<{sizePerThread = [1, 8], threadsPerWarp = [8, 4], warpsPerCTA = [4, 1], order = [1, 0]}>ws
a"#r = blockedp13" :  =  = i#3232ttg : , .i"blocked<{sizePerThread = [1, 8], threadsPerWarp = [4, 8], warpsPerCTA = [4, 1], order = [1, 0]}>32ttg.n
}u# mshared{- =
w#  attgtt.funcr. pswizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>publics
 "#@ = fused_moe_kernelshared4(%arg01 : :  = i!#32ttttg, ..ptr<f16>ttg.targetswizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}> { =
tt.divisibility"# = csmem16u =  : d#iattg32:.7}shared_memory5,
"%module, arg1" attributes: t {!t"ttg.threat.dtptr<f16>sg {-.tt.divisibilitypn = eu16rm : --iwc32at}ra, ps%""arg2 =  = : 321! :  : ttii.3232ptr<f16>},  { "tt.divisibility{t
=
t16g  . : tt.funcn ipublic32u }m@, -fused_moe_kernel%(warg3%a: arg0r!: ptt!s.tt".ptr<f32> = ptr<f16> { {tt.divisibility = 16 : i32}, %arg44tt.divisibility:  :  = !i16tt32 :
., ptr<i32>ittg.target {32 = tt.divisibility}" = , c16% : uarg1id: 32a!}:tt, 7.%5ptr<f16>arg5" {: , tt.divisibility!" = ttt16.ptr<i32>t :  {gitt.divisibility.32 = t}16h,  :
r%iearg2a32: d}!stt, -.%pptr<f16>arg6 {e: tt.divisibilityr = !-16ttw : .aiptr<i32>r32 {ptt.divisibility}" = ,  = 16%32 : arg3i : : 32i!}32tt, }.% ptr<f32>arg7{:  {
itt.divisibility  32 = tt.func {16 tt.divisibility : public = i 1632@ : }fused_moe_kerneli, (32%%}arg4arg0, : : %!arg8!tt: tt.i.ptr<i32>32ptr<f16> { {tt.divisibilitytt.divis
ibility =  { = 16tt.divisibility16 :  =  : i16i32 : 32}i}, 32, %}%arg5, arg9: !%: ttarg1i.: 32ptr<i32>! { {tttt.divisibilitytt.divisibility. =  = ptr<f16>1616 { :  : tt.divi
sibilityii = 323216}} : , , i%%32arg10arg6}: : , i!%32ttarg2 {.: tt.divisibilityptr<i32>! =  {tt16tt.divisibility. :  = ptr<f16>i16 {32 : tt.divisibility}i = , 3216%}arg11 :
 , : i%i32arg732}:  {, itt.divisibility = 16 : %32iarg3 {32: tt.divisibility}! = , tt16%. : arg12ptr<f32>i:  {32itt.divisibility}32 = ,  {16%tt.divisibility : arg8 = i: 1632
i : }32i,  {32%tt.divisibility}arg4 = , : 16%! : arg13tti: .32iptr<i32>}32 {,  {tt.divisibility%tt.divisibility = arg9 = 16: 16 : i : i32i32 {32}tt.divisibility} = , , 16%%
: arg5arg14i: : 32!i}tt32, . {%ptr<i32>tt.divisibilityarg10 { = : tt.divisibility16i =  : 3216i { : 32tt.divisibilityi} = 32, 16}%,  : arg15%i: arg632: i}!32, tt {%tt.divisi
bilityarg11. = : ptr<i32>16i :  {itt.divisibility3232 { = }tt.divisibility16,  =  : %16i : 32arg16i}: 32, }%i, arg732%:  {arg12itt.divisibility:  = i1632 : 32 {i {tt.divisib
ility32tt.divisibility = } = 16, 16% :  : arg17i: i32i32}32,  {%}tt.divisibilityarg13,  = : %16iarg8 : : 32ii {3232tt.divisibility} { = , 16tt.divisibility% :  = arg18i16: 3
2i : }32, i {%32tt.divisibilityarg14} = : , 16i% : 32arg9i {tt.divisibility = 16:  : 32ii}3232,  {}%tt.divisibilityarg19,  = : %16iarg15 : 32: i {i32tt.divisibility32} =  {1
6 : , tt.divisibilityi% = 32arg1016} : : )ii3232 attributes} { {, tt.divisibilitynoinline% =  = arg1616false:  : }ii32 32}{ {,
tt.divisibility% =     arg1116%:  : icsti32 = 32 {arith.constant}tt.divisibility ,  = dense<%16arg17 : : 0.000000e+00ii>3232 : } {tensor<, tt.divisibility64% = xarg121664: x
 : f32ii, 3232# {}blockedtt.divisibility, > = %
16arg18     : : %iic-1_i643232 = } {arith.constant, tt.divisibility -1% =  : arg1316i:  : 64ii
3232     {}%tt.divisibility, c0_i32 = % = 16arg19arith.constant : :  ii03232 : } {i, tt.divisibility32% =
arg1416    :  : %iic1_i323232 = } {arith.constant)tt.divisibility  = 116 attributes :  :  {inoinline32i =
32false    }}%,  c63_i32%{ = arg15
arith.constant    :  %i63cst32 :  =  {iarith.constanttt.divisibility32 =
16dense<     : %i0.000000e+00c31_i3232> = } : arith.constant, tensor< %6431arg16x : : 64iix32
    32f32% {, c32_i32tt.divisibility# =  = blockedarith.constant16> :
i32    32 : %}ic-1_i64, 32 = %
arg17arith.constant     : %-1icst_0 : 32 = i {arith.constant64tt.divisibility
 = dense<16    0.000000e+00 : %i>c0_i3232 :  = }tensor<arith.constant, 32 %x0arg1864 : : xiif163232,  {
#tt.divisibility    blocked = %116c1_i32> =
arith.constant      : %1icst_1 : 32 = i}arith.constant32,
 %    dense<arg19%0.000000e+00c63_i32: >i =  : 32arith.constanttensor< { 64tt.divisibility63x =  : 3216ix : 32f16i
, 32    #}%blocked)c31_i322 = >arith.constant
 attributes      {31% : cst_2i = noinline32arith.constant =
 false    dense<}%0.000000e+00 c32_i32>{ =  :
arith.constanttensor<     64%32xcst : 64 = ixarith.constant32f16
,     dense<#%blocked0.000000e+00cst_03> = > : arith.constant
tensor<     64dense<%x0.000000e+00c8_i3264> = x : arith.constanttensor<32x64f32 x, 8f16# : , blockedi#>32blocked

1        >%%
c-1_i64c64_i32     =  = %arith.constantarith.constantcst_1   = -164arith.constant :  :  iidense<64320.000000e+00

>         : %tensor<%64cst_3c0_i32x = 32 = arith.constantxarith.constant f16 dense<, 08# : >blockedi : 232tensor<>
64
    x    %1%c1_i32xcst_2 = i = arith.constant32arith.constant,   #1dense<blocked : 0.000000e+002i>>32 :
tensor<
    64    %x%cst_464c63_i32 = x = arith.constantf16arith.constant ,  dense<#6332blocked : >3i : >32tensor<

32        x%%64c8_i32c31_i32x =  = iarith.constantarith.constant32  , 831 : # : iblockedi32132
>

    %    %c64_i32%c32_i32 = cst_5 = arith.constant = arith.constantarith.constant   64dense<32 : 32 : i>i32 : 32
tensor<
    64    %x%cst_332cst_0 = x = arith.constantiarith.constant 32 dense<, dense<8#0.000000e+00>blocked> : 2 : tensor<>tensor<64
32x    x1%0 = 64xtt.get_program_idxi f16x32,  , #:#blocked blockedi1232>>


            %%%1cst_1cst_4 =  =  = arith.addiarith.constantarith.constant   %dense<dense<arg90.000000e+0032,>>  :  : %tensor<tensor<c63_i3232x64 64x:32x xiif163232, ,
##    blockedblocked%21>2>
 =
    arith.divsi    % %cst_2%cst_51 =  = ,arith.constantarith.constant  % dense<c64_i32dense< 0.000000e+0032:>> :  tensor<i : 6432tensor<x
6464    x32x%xf163i, 32 = #arith.addi, blocked #3%blocked>arg7
2    ,> %
%c8_i32    c63_i32 = % arith.constant0: =   tt.get_program_id8i  : 32xi
 32    :
 %    i4%32 = c64_i32
arith.divsi  =     %arith.constant%3 1,64 =   : arith.addi%ic64_i32 32 %
:arg9     ,%i cst_332%
c63_i32     % = :5arith.constant =   arith.muliidense< 328%
>4     : ,%tensor< 264% = xc8_i32arith.divsi  1:%x i1i32,32,
#%    blockedc64_i32%2 6:> =
arith.divsii 32    %
%0    cst_4,% =  3arith.constant% =  5arith.addidense<  32:%>  : arg7i32tensor<,
32     x%%647c63_i32x =  iarith.muli:32  , i%#326blocked
,    1 %>%4
c8_i32 =      arith.divsi%: cst_5 % = i3arith.constant,32
dense<%    32c64_i32%> 8 : : = tensor< arith.subi64i x32%32
2x    ,i% 325%,  = 7#arith.muli blocked :2% >4i
,32
%%    0c8_i32% =  9tt.get_program_id: =   arith.minsixi  32%:
8     ,i% 326%
 = c8_i32    arith.divsi % :1%  = 0iarith.addi,32
%%    arg95%, 10 : = % arith.remsic63_i32i  32%:
0     ,i% 327%
 = 5    arith.muli % :%2 6 = i,arith.divsi32
%%    c8_i321% ,11:   = i%32arith.remsic64_i32
      %:%10 8,i =  32arith.subi%
 9    % %2:3,   = i%arith.addi327
 %:    arg7 %,i12 32 = %
arith.addic63_i32     % %9:7  = ,iarith.minsi 32 %
%118     ,%: 4 % = ic8_i32arith.divsi32
:%     %3i13,32 =
arith.divsi%     c64_i32%% 1010: = , arith.remsi i %32%90
 ,:      %%i5532 =
arith.muli:      %i%14432 = ,
tt.load      %%%c8_i3211arg6 =  arith.remsi:   :%i 1032!,
tt     .%%ptr<i32>96
  = :    arith.divsi % i15%32 = 0
,arith.muli      %%%12125 = ,  arith.addi:%  c64_i32%i 732:,
      i%%32117
 =      arith.muli:%  16i% = 326
arith.cmpi,      %%sge13c8_i32 = , arith.divsi : % %15i,1032 ,
%     14%% 98:  =  :arith.subii  32i%
322
,cf.cond_br      %%%14716 =  ,tt.load:   ^bb1%i,arg632
^bb2:
 %  !9^bb1tt = :.arith.minsi  // pred:  ptr<i32>^bb0%

8        ,tt.return%
15%   = c8_i32^bb2arith.muli : :  // pred: % ^bb012i
32,
 %    %17%c64_i32 = 10 tt.make_range = : {arith.remsiend   = i%64320 :
,i     32%%, 165start =   = arith.cmpi:0   : sgeii,3232
}    %% 1511: = , arith.remsi  tensor<%%641410x ,i: 32 %, i9#32 ttg
:.    slice<{dim = 1, parent = #blocked3}> cf.cond_br>i
32%    16
%,    18 % = ^bb112tt.make_range, =  {arith.addi end ^bb2 = %64
7 :   ,i^bb1 32:%,   // pred: 11start^bb0  =
:0      : itt.returni32
32
  }    ^bb2 %::13  // pred:   = ^bb0tensor<arith.divsi
64     x%%10i17,32 =  , tt.make_range%#9 {ttg end.: = slice<{dim = 1, parent = #blocked2}> 64i> : 32
i
32%    , 19%start = 14 = tt.make_range = 0 {tt.load : end i = %3264arg6} :  i: 32 :, tensor<start 64 = !x0tti : .32iptr<i32>, 32
}#     ttg%:.15 slice<{dim = 1, parent = #blocked3}> = tensor<>arith.muli64
 x    %i%123218,,  =  #tt.make_range%ttg {c64_i32.end slice<{dim = 0, parent = #blocked3}> = >:64
  :     ii%323220
,  =     starttt.make_range% =  {160end =  :  = arith.cmpii64 32 : sge}i, 32 :, % start15tensor< = ,640 x : %ii143232 , }:#  ttg:i. 32slice<{dim = 1, parent = #blocked2}>ten
sor<
>64
xcf.cond_br    i %32%19, 16 = #,tt.make_rangettg  {.^bb1endslice<{dim = 0, parent = #blocked1}>, = > 64
^bb2 :
i%  3221^bb1,  = :startarith.extsi  // pred:  =  ^bb00%
 : 17    i tt.return32:
 }tensor<   64^bb2:x: itensor<  // pred: 3264^bb0x,
i#    32ttg%, 17.# = slice<{dim = 1, parent = #blocked3}>ttgtt.make_range>.  {slice<{dim = 0, parent = #blocked3}>toend>  =
tensor<64    64 : %xi20i32 = 64, tt.make_range, start { = #end0ttg =  : .64islice<{dim = 1, parent = #blocked3}> : 32>i}
32     , :%start 22 = tensor< = 064arith.extsi : x ii32%3218},   #::ttg  .tensor<tensor<slice<{dim = 1, parent = #blocked3}>6464>xx
ii    3232%, , 18## = ttgttgtt.make_range.. {slice<{dim = 1, parent = #blocked2}>slice<{dim = 0, parent = #blocked1}>>end>  =
to64      : %tensor<i216432 = x, arith.extsiistart 64 = %, 017# :  ttgi:.32 slice<{dim = 1, parent = #blocked2}>}tensor<> 64
:x     i%tensor<322364,  = x#arith.extsiittg 32.%, slice<{dim = 1, parent = #blocked3}>20#> ttg :.to slice<{dim = 1, parent = #blocked2}> tensor<>tensor<64
64x    ix%32i19, 64 = #, tt.make_rangettg#.ttg {slice<{dim = 0, parent = #blocked1}>.end> = slice<{dim = 1, parent = #blocked3}> 64>to :
 i    tensor<32%64, 22xstart = i = arith.extsi640 ,  : %#i18ttg32 .}:slice<{dim = 0, parent = #blocked1}>  >:tensor<
64tensor<    x64%ix2432i = , 32arith.extsi#,  ttg#%.ttg15slice<{dim = 1, parent = #blocked2}> .>:slice<{dim = 0, parent = #blocked3}>  >ito
32      tensor<%to6420 x = iitt.make_range6464 {
,     end#% = ttg2564. =  : slice<{dim = 1, parent = #blocked2}>tt.splati> 32
%,     24start%  = 230: =  :  arith.extsiii 3264%}20   ->::   tensor<tensor<tensor<646464xxxiii643232, , , ###ttgttgttg...slice<{dim = 0, parent = #blocked1}>slice<{dim = 1,
 parent = #blocked3}>slice<{dim = 0, parent = #blocked1}>>>>

         %to%21 26 = tensor< = arith.extsi64tt.splat x %i%176424 ,  :#: ttg tensor<.i64slice<{dim = 0, parent = #blocked1}>64x> i
->32     , %tensor<#2464ttg = x.arith.extsiislice<{dim = 1, parent = #blocked3}> 64%>, 15 # tottg: . tensor<slice<{dim = 1, parent = #blocked2}>i64>32
x     ito%6427 ,  = i#arith.addi64ttg
.%    slice<{dim = 1, parent = #blocked3}>25%>,25
  =     %tt.splat%21 22% =  24arith.extsi : : % tensor<18i64 64x: i ->64tensor< , 64tensor<#x64ttgix.32islice<{dim = 1, parent = #blocked3}>, 64>#,
ttg#    .ttg%slice<{dim = 1, parent = #blocked2}>.28>slice<{dim = 1, parent = #blocked3}> = > arith.addi
to      %%tensor<262664, = x tt.splati% 64%22, 24 # :ttg: . tensor<slice<{dim = 1, parent = #blocked2}>i>64
x64    i %6423, -> = # arith.extsitensor<ttg 64.%xslice<{dim = 1, parent = #blocked2}>20i> 64
:,      #%tensor<ttg2964. = xslice<{dim = 1, parent = #blocked2}>tt.splati> 32
%,     arg4#% ttg27:. =  slice<{dim = 0, parent = #blocked1}>arith.addi!> tt %.to25ptr<i32> ,tensor<  64->%x 21itensor< 64:64,  x#tensor<ttg!64.ttslice<{dim = 0, parent = #b
locked1}>x.>iptr<i32>
64,     , #%#ttg24ttg. = .slice<{dim = 1, parent = #blocked3}>arith.extsislice<{dim = 1, parent = #blocked3}>> >
%
    15    % %30: = 28 tt.splat = i 32arith.addi%  arg4to%  26:i ,64!
tt%    .22%ptr<i32> 25 : = ->  tt.splattensor< tensor<64%64x24x i!:64tt , .i#ptr<i32>64ttg,  .#->slice<{dim = 1, parent = #blocked2}>ttg >.tensor<
slice<{dim = 1, parent = #blocked2}>64    >x%
i29    64 = %, tt.splat31#  = ttg%tt.addptr.arg4  slice<{dim = 1, parent = #blocked3}>%:29> ,
! tt    %.%27ptr<i32>26   = :->tt.splat   tensor<%tensor<642464x x!:!tt tt.i.ptr<i32>64ptr<i32>,  , #->#ttg ttg.tensor<.slice<{dim = 1, parent = #blocked3}>64slice<{dim = 1,
 parent = #blocked3}>>x
>i    ,64% , 30tensor<# = 64ttgtt.splatx. islice<{dim = 1, parent = #blocked2}>%64>arg4,
 #    :ttg%. 27slice<{dim = 1, parent = #blocked3}>! = >ttarith.addi
.     ptr<i32>%% 2532->, =   tt.addptrtensor<% 6421%x30 !,:tt  .%tensor<ptr<i32>2864,  x#:ittg 64.tensor<, slice<{dim = 1, parent = #blocked2}>64#>xttg
.!    slice<{dim = 1, parent = #blocked3}>tt%>.31
ptr<i32> = ,     tt.addptr%# 28ttg% = .arith.addi29slice<{dim = 1, parent = #blocked2}> ,>% ,26% ,27tensor<  64%:x22 i tensor<64:64,  x#tensor<!ttg64tt.x.slice<{dim = 1, par
ent = #blocked2}>iptr<i32>>64,
, #    #ttg%ttg.33.slice<{dim = 1, parent = #blocked3}> = slice<{dim = 1, parent = #blocked2}>>tt.load>,
 %    tensor<3164%x29i =  64tt.splat:,   #%tensor<ttgarg464. xslice<{dim = 1, parent = #blocked3}>:!> tt
!.    ttptr<i32>%., 32ptr<i32># =  ttgtt.addptr->.  slice<{dim = 1, parent = #blocked3}>%tensor<>6430
x,    ! %tt%34.28 = ptr<i32> tt.load, :  #%tensor<ttg3264.xslice<{dim = 1, parent = #blocked3}> !>:tt
 .    tensor<ptr<i32>%64, 30x = #!tt.splatttgtt ..%slice<{dim = 1, parent = #blocked2}>ptr<i32>arg4>,  ,#: ttg tensor<.!64slice<{dim = 1, parent = #blocked2}>ttx>.i
ptr<i32>64     , %->#35 ttg = tensor<.tt.splatslice<{dim = 1, parent = #blocked2}>64 >x%
arg10!     tt:%. 33ptr<i32>i = , 32tt.load#  ttg->%. 31slice<{dim = 1, parent = #blocked2}>tensor<>64
:x     itensor<%326431, x = #!tt.addptrttttg ..%ptr<i32>slice<{dim = 1, parent = #blocked3}>29, >,#
 ttg    %.%27slice<{dim = 1, parent = #blocked3}>36 > = :
tt.splat     tensor< %64%34xarg10 = ! tttt.load:.  ptr<i32>%i, 3232# ttg ->.: slice<{dim = 1, parent = #blocked3}>tensor< >64tensor<,x64 ix32tensor<!, 64tt#x.ttgiptr<i32>.64
, slice<{dim = 1, parent = #blocked2}>, #>#ttg
ttg..    slice<{dim = 1, parent = #blocked2}>slice<{dim = 1, parent = #blocked3}>%>>37

 =         arith.cmpi%% 35slt32 = , = tt.splat tt.addptr % %33%arg10,30  ,:%  35%i 2832:   :->tensor<  64tensor<tensor<x6464ixx32!i, tt32#., ttgptr<i32>#., ttgslice<{dim = 1
, parent = #blocked3}>#.>ttgslice<{dim = 1, parent = #blocked3}>
.>    slice<{dim = 1, parent = #blocked2}>
%>    38,% =  36arith.cmpitensor< =  64tt.splatsltx ,i% arg1064% , 34:#, ttgi .32%slice<{dim = 1, parent = #blocked2}> 36>->
 :    tensor< %64tensor<33x64 = ixtt.load32i , 32%#, 31ttg#.ttgslice<{dim = 1, parent = #blocked2}>. >slice<{dim = 1, parent = #blocked2}>:
>
tensor<%    6437%x = 39!arith.cmpi = tt tt.addptrslt. ,ptr<i32>% , arg5%#,33ttg ,.% slice<{dim = 1, parent = #blocked3}>12% >35:
      :!% tt34tensor<. = 64ptr<i32>tt.loadx, i 32%i, 3232#
ttg     .:%slice<{dim = 1, parent = #blocked3}> 40>tensor< =
64tt.loadx     !%%tt3839. = ptr<i32>arith.cmpi , : # sltttg!,.tt slice<{dim = 1, parent = #blocked2}>.%>ptr<i32>34

,         %%%3541 = 36 = tt.splat arith.extsi : % %arg10tensor<40 64 :x: i i32i32, 32 # ->ttgto . tensor<slice<{dim = 1, parent = #blocked2}>i64>64x

i        32%%39, 42 = # = tt.addptrttgarith.cmpi . %slice<{dim = 1, parent = #blocked3}>eqarg5>,,
 %    %41%12,36   = %:tt.splatc-1_i64   !%:ttarg10 . iptr<i32>:64,
     iicf.cond_br3232
%->    42 %,tensor<40 64^bb3x = ,itt.load 32 ^bb4, %
#39  ttg ^bb3.::slice<{dim = 1, parent = #blocked2}>   // pred: >!^bb2
tt
    .    %ptr<i32>%37
43 =      = arith.cmpi%arith.muli 41 slt = %,arith.extsi13  ,%% 3340%,  c64_i32:%  35i: 32 : i to32tensor<
64i    x64%i44
32 =     , tt.splat%# 42ttg% = .43arith.cmpislice<{dim = 1, parent = #blocked3}>  >:eq
 ,     i%%323841  = ,->arith.cmpi   %slttensor<c-1_i6464, x :i% 3234i,, 64#
ttg%    .36cf.cond_brslice<{dim = 0, parent = #blocked3}>  >:%
 42    tensor<,%64 45x^bb3i = ,32arith.addi ,  ^bb4#%
ttg44  .,slice<{dim = 1, parent = #blocked2}>^bb3 >:%
  // pred: 19    ^bb2 %
:39     % = tensor<43tt.addptr64 =  xarith.muli%i arg532%,, 13 #,%ttg 12.% slice<{dim = 0, parent = #blocked3}>c64_i32:>
:!     tt%i.4632ptr<i32> =
,tt.expand_dims      %i%443233 =
 {tt.splat    axis % = %40143 =  :  itt.load:32  }%i 3932:   :tensor<-> 64 x!tensor<i64tt32x., iptr<i32>#32
ttg,     .#%slice<{dim = 1, parent = #blocked3}>ttg41>. =  slice<{dim = 0, parent = #blocked3}>arith.extsi->>
%tensor<    4064% x45:1 =  arith.addixi i32%32 44, to,#  blockedi%36419>

    :    % %42tensor<47 = 64 = arith.cmpixtt.splat i eq32%,, arg14 # %ttg:41. ,slice<{dim = 0, parent = #blocked3}>i >%32
c-1_i64      ->%: 46 tensor< = i64tt.expand_dims64x
1%    x33cf.cond_bri { 32axis%,  = #421blocked,3 :  >i^bb3
32,    } % ^bb448:
 =    arith.mulitensor<^bb3 64:%x  // pred: 47i^bb2,32
 ,     %#%46ttg43 . = :slice<{dim = 1, parent = #blocked3}>arith.muli > tensor< %64->13x ,tensor<1 64x%xic64_i32132 x, :i# 32blockedi, 332#>
blocked
    3    %>%
4449     =  = %tt.splattt.splat47   = %%tt.splat43arg2  % :arg14:   i:!32 tt i.->32ptr<f16>   tensor<->->64  xtensor<tensor<i646432xx, 11#xxttgi!32.tt, slice<{dim = 0, paren
t = #blocked3}>.#>ptr<f16>blocked
, 3    #>%blocked
345    > = %
arith.addi48      = %%arith.muli5044  = ,%tt.addptr 47 %,19%  49%,:46   tensor<:%64 48xtensor< i64:32 x, tensor<1#64ttgxx.1islice<{dim = 0, parent = #blocked3}>x32>!,
tt#    .blocked%ptr<f16>346, > = #
tt.expand_dimsblocked     3%%>4933, =  { tt.splataxistensor<  = 64%1xarg2 : 1 ix:32i }32!,  tt:#.blocked ptr<f16>3tensor< 64>->x
 i    tensor<%326451, x = #1tt.expand_dimsttgx .!%slice<{dim = 1, parent = #blocked3}>tt45>. { ptr<f16>axis->,  =  #0tensor<blocked : 643ix>321
}x     i%:3250 ,  = tensor<#tt.addptr64blocked x3%i>49
32,    ,  %%#4748ttg =  .tt.splat:slice<{dim = 0, parent = #blocked3}>  >tensor<% 64arg14->x  1:tensor<x 1!ixtt3264. xptr<f16>->i,  32#tensor<, blocked#643blockedx>31,>x
itensor<    3264%, x52#1 = blockedxtt.broadcast3i >32%
50,      #%:blocked48 3 = tensor<>arith.muli64
x     1%%x5147! = ,tttt.expand_dims . %%ptr<f16>4645,   {#:axisblocked  = 3tensor<0>64 :  xi->132 x}tensor< i:6432 x, tensor<64#64xblockedx!3itt>32.
, ptr<f16>    #, %ttg#49.blockedslice<{dim = 0, parent = #blocked3}> = 3>tt.splat>
->%     arg2%tensor< 531: = x tt.broadcast64! xtt%i.5132ptr<f16> ,  :# ->blockedtensor< 31tensor<x>6464
xx    1i%x3252!,  = tt#tt.broadcastblocked. 3ptr<f16>%>, 50 # ->blocked: 3 tensor<>tensor<64
64x    x64%1x50xi = !32tt.addptrtt,  .#%ptr<f16>blocked49, 3,#> blocked
%3    48>%  54:-> =   tensor<tt.addptrtensor<6464 xx1%64x52x!,!tt tt.%.ptr<f16>53ptr<f16>,  , :## blockedblockedtensor<3364>>x,
64     xtensor<%!6453ttx = .tt.broadcast1ptr<f16> x, %i#5132blocked , 3:># blocked,tensor<3 1>tensor<x
6464    xx%64i51x32 = i, tt.expand_dims32# , blocked%#345blocked> {3 axis>-> =
 0    tensor< : %64i55x32 = }64tt.expand_dims x :i% 3237tensor<, 64 {#xaxisblockedi = 3321>,  :
#i    ttg32%.}54slice<{dim = 0, parent = #blocked3}>  = >:tt.addptr   ->tensor<% 6452tensor<x,1i x1%64, 53x# i:ttg32 ., tensor<slice<{dim = 1, parent = #blocked3}>#64>blocke
dx 364->>x
!tensor<tt    .64%ptr<f16>x52, 1 = #xtt.broadcastblockedi 31%>, 50,#  blocked:tensor<3 64>tensor<x
6464    xx%1i56x32 = !, tt.splattt# .blocked%ptr<f16>3arg7, > #
:blocked     3%i>5532  =  ->tt.expand_dims->   tensor<%tensor<64371x {x64axis64x = x!1itt : .32iptr<f16>, 32, #}#blocked blocked3:3> >
tensor<
    64    %x%57i53 = 1 = arith.cmpi, tt.broadcast # sltttg%,.51 slice<{dim = 1, parent = #blocked3}> %>:51  ,->tensor<  1%tensor<x566464 xx:1i x32itensor<, 11#, blockedx#364
blocked>x3 i>->32
 ,     tensor<#%64blocked56x3 = 64>tt.splatx
i%    32arg7%,  58#: = blocked tt.broadcast3 i>%32
55      ->%: 54 tensor< = tensor<1tt.addptr64x x64%1x52xi,i32 1, %, #53# blockedblocked:33 >>tensor<
 64    x->%64 57xtensor< = !64arith.cmpittx .64sltptr<f16>x,, i #1%blocked, 513#,>blocked ,3% >56tensor<
 64    :x% 6459tensor<x = 1itt.broadcastx32 64, %x57#i blocked32:3,  >#tensor<
blocked1    3x%>6455
x =     itt.expand_dims%1 58, % = #37tt.broadcastblocked { 3axis%> = 55 1 -> : : i tensor<32tensor<64}64x x64:1x xitensor<i1641, x, #i#blocked1blocked3, 3>#>
ttg     .->%slice<{dim = 1, parent = #blocked3}> 60>tensor< =  64arith.andi->x  64%tensor<x6458ix,11 , x%#i59blocked1 3, :>#
blockedtensor<    364%>x59
64 =     xtt.broadcast%i 561% = , 57#tt.splat blocked :3% >arg7tensor<
 1    :xtt.store 64 ix%32i54 1,->,   #%tensor<blockedcst_213,x> 64 %x->60i 32 tensor<, :64# xblockedtensor<64364x>xi
641    x, %!#57ttblocked = .3arith.cmpiptr<f16>> ,
slt#    ,blocked% 360%> = 51
arith.andi,      tt.return%%
5856  , ^bb4 ::%   // pred: tensor<59^bb21
x:    64 %xtensor<61i64 = 32xarith.muli, 64 #x%blockedi1331,>,
#%    blockedc64_i32%3 58>: =
 tt.broadcast    i tt.store32 %
%55    54 %,:62   = %tensor<arith.extsicst_264 ,%x 611% x60:i  1i:, 32 # tensor<blockedto643 x>i64 64x->
!     tttensor<%.6463ptr<f16>x = , 64tt.splat#x blockedi%3162>,
#:    blocked tt.return3i
>64
 ^bb4    ->: %  // pred: tensor<59^bb264 =
xtt.broadcast    i %64%61, 57 = # arith.mulittg: . %tensor<slice<{dim = 0, parent = #blocked1}>131>,x
     64%%xc64_i3264i  = 1:arith.addi,   #i%blocked32633,
>      %%->2362   = tensor<:arith.extsi64  x%tensor<646164x xi:i1 64, i, #32#blocked ttg3to.> slice<{dim = 0, parent = #blocked1}>
i>    64
%
    60    % = %6563arith.andi =  =  arith.extsitt.splat%  58%%,arg762   %::59   ii:3264   tensor<to64-> x i64tensor<64x64
xi    i1%64, 66, # = #blockedtt.splatttg3 >.%
slice<{dim = 0, parent = #blocked1}>65    > tt.store
:      %i%646454 =  ,arith.addi->   %%tensor<cst_26364,,x  i%%646023,  #:  ttgtensor<:.64 slice<{dim = 0, parent = #blocked1}>xtensor<>i64
64x    , %64#67xttg = !.arith.remsittslice<{dim = 0, parent = #blocked1}> .>%ptr<f16>
64,     ,# blocked%%66 :365 > = tensor<
arith.extsi64     xtt.return%i
arg764   , ^bb4:#: ttg  // pred: i.^bb232slice<{dim = 0, parent = #blocked1}>
 >    to
%     61i% = 6468arith.muli
 =      tt.expand_dims%% 1366% = 34tt.splat { ,axis%  = 65%1 c64_i32 : : i :32i }64i  32:->
      tensor<tensor<%646462xxi = i32arith.extsi64,  , #%#ttg61.ttgslice<{dim = 1, parent = #blocked2}> .>:slice<{dim = 0, parent = #blocked1}>  >->i
 32    tensor< %64to67x  = 1iarith.remsix64 i
%32    64, %,#63 blocked = %2tt.splat66>
%:    62 % tensor<69:64 =  xtt.expand_dimsi i64%6433 ,  {->#axis ttg = tensor<.164slice<{dim = 0, parent = #blocked1}> : x>ii
3264    }, % #68:ttg =  .tt.expand_dimstensor<slice<{dim = 0, parent = #blocked1}> 64>%x
34i     {32%axis, 64 = # = 1ttgarith.addi : . islice<{dim = 1, parent = #blocked3}>%32>63} , -> : % tensor<23tensor<64 64x:x1 ixtensor<32i64, 32x#, ittg#64.blocked, slice<{d
im = 1, parent = #blocked2}>3#>>ttg
 .    ->slice<{dim = 0, parent = #blocked1}>% >tensor<7064
 = x    arith.divsi1% x65%i = 6832arith.extsi,,   #%%blockedcst_3arg72  >::
      tensor<i%643269x  = 1tt.expand_dimstox  i%i3233, 64 {#
axisblocked     = 2%1>66 :
 =     itt.splat%32 71}% =  65tt.splat:   :tensor<%64 arg11xi i64:32  , i->#32 ttg tensor<->.slice<{dim = 1, parent = #blocked3}>64 >xtensor< i64->64x , 1tensor<#x64ttgix.32
1slice<{dim = 0, parent = #blocked1}>, x>#i
blocked32    2, %>#67
blocked =     3arith.remsi%> 72
% =     64arith.muli%, 70 % = %70arith.divsi,66   %%:6871 , tensor< :64% xcst_3tensor<i 6464:x,  1#tensor<xttg64i.x32slice<{dim = 0, parent = #blocked1}>1, >x#
iblocked    322%, >68#
 = blocked    tt.expand_dims2%> 73
% =     %34tt.make_range71 { { = axisendtt.splat =  =  132% :  : arg11ii 3232:},   starti: = 32 0 tensor< : ->64i xtensor<32i64}32x , 1:#x ttgitensor<.3232slice<{dim = 1, pa
rent = #blocked2}>, x#>iblocked 322->, >
#tensor<    ttg64%.x72slice<{dim = 0, parent = #blocked2}>1 = >xarith.muli
i     32%%, 7074#, = blocked tt.expand_dims2% >71%
 73    : {% axis69tensor< =  = 640tt.expand_dimsx :  1i%x3233i} {32 axis, : = # 1blockedtensor< : 232>xi
i32    32}%,  73#: = ttg tt.make_range.tensor< {slice<{dim = 0, parent = #blocked2}>64end>x =  i32->32 :  , itensor<#321ttg, x.start32slice<{dim = 1, parent = #blocked3}> =
x>0i  : 32->i, 32# }blockedtensor< 264:>x
1tensor<    x32%ix7532i = , 32tt.broadcast#,  blocked#%3ttg72.> slice<{dim = 0, parent = #blocked2}>
:>
%tensor<    %706474 = x = arith.divsi1tt.expand_dims x %i%683273,,  { #axis%blocked = cst_320 > : :i  32->tensor<} 64tensor< x64:1xx 32itensor<x3232i, x32#i, 32blocked#, 2bl
ocked#>2ttg
>.
slice<{dim = 0, parent = #blocked2}>%    >71%  = 76->tt.splat =   tt.broadcasttensor<% 1arg11%x 7432 :x: i i32tensor<32, 1 #x->blocked322x> i
tensor<32    64%, x75#1 = blockedx2tt.broadcasti> 32 %, ->72#  blockedtensor<:264 x>tensor<32

......

a    l%i2z = e-arith.divsit o%-1l,l v%mc64_i32,  :c ain/home/sgsdxzy/micromamba/envs/vllm-dev/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py:
254:032o/home/sgsdxzy/micromamba/envs/vllm-dev/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py:254:0:
n: i    c%aerror: 3lerror:  = iarith.addizFailures have been detected while processing an MLIR pass pipeline eFailures have been detected while processing an MLIR pass pipel
ine
%{
arg7 ,  m/home/sgsdxzy/micromamba/envs/vllm-dev/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py:254:0/home/sgsdxzy/micromamba/envs/vllm-dev/li
b/python3.12/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py:254:0%a: : c63_i32x- i:note: note: t eir32Pipeline failed while executing [`ConvertTritonGPUToLL
VM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`Pipeline failed while executing [`ConvertTrito
nGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`a


t    i%o4n = sarith.divsi= 1%03 ,m a%xc64_i32- n:u mi-32r
e    w%r5i = tarith.mulie s%=4-,1  %rc8_i32e g:i oin32-
s    i%m6p = liarith.divsif y%=0n,o r%m5a l:  tie32s
t    -%c7o = narith.muliv e%r6g,e n%cc8_i32e= f:a i32l
s    e% 8t = oarith.subip- d%o2wn,= tr%u7e} ,: i 32c
s    e%,9  = sarith.minsiym b%o8l,-d c%ec8_i32,  :en ai32b
l    e%-10l = iarith.remsin e%-0i,n f%o5) ":,
i      32disable_threading
:     false%,11
 =       arith.remsiverify_each %: 10true,
     }%
9  }
:#-}
i32
    %12 = arith.addi %7, %11 : i32
/home/sgsdxzy/micromamba/envs/vllm-dev/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py:254:0    : %13 = error: arith.divsi %Failures have been
 detected while processing an MLIR pass pipeline10
, %9/home/sgsdxzy/micromamba/envs/vllm-dev/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py:254:0 : : i32
note:     %14Pipeline failed while executing [`ConvertTritonGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above
with Triton project.` =
tt.load %arg6 : !tt.ptr<i32>
    %15 = arith.muli %12, %c64_i32 : i32
    %16 = arith.cmpi sge, %15, %14 : i32
    cf.cond_br %16, ^bb1, ^bb2
  ^bb1:  // pred: ^bb0
    tt.return
  ^bb2:  // pred: ^bb0
    %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked3}>>
    %18 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
    %19 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked3}>>
    %19 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked3}>>
    %20 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
    %21 = arith.extsi %17 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> to tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked3}>>
    %22 = arith.extsi %18 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
    %23 = arith.extsi %20 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked1}>>
    %24 = arith.extsi %15 : i32 to i64
    %25 = tt.splat %24 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked3}>>
    %26 = tt.splat %24 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
    %27 = arith.addi %25, %21 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked3}>>
    %28 = arith.addi %26, %22 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
    %29 = tt.splat %arg4 : !tt.ptr<i32> -> tensor<64x!tt.ptr<i32>, #ttg.slice<{dim = 1, parent = #blocked3}>>
    %30 = tt.splat %arg4 : !tt.ptr<i32> -> tensor<64x!tt.ptr<i32>, #ttg.slice<{dim = 1, parent = #blocked2}>>
    %31 = tt.addptr %29, %27 : tensor<64x!tt.ptr<i32>, #ttg.slice<{dim = 1, parent = #blocked3}>>, tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked3}>>
    %32 = tt.addptr %30, %28 : tensor<64x!tt.ptr<i32>, #ttg.slice<{dim = 1, parent = #blocked2}>>, tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
    %33 = tt.load %31 : tensor<64x!tt.ptr<i32>, #ttg.slice<{dim = 1, parent = #blocked3}>>
    %34 = tt.load %32 : tensor<64x!tt.ptr<i32>, #ttg.slice<{dim = 1, parent = #blocked2}>>
    %35 = tt.splat %arg10 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked3}>>
    %36 = tt.splat %arg10 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
    %37 = arith.cmpi slt, %33, %35 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked3}>>
    %38 = arith.cmpi slt, %34, %36 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
    %39 = tt.addptr %arg5, %12 : !tt.ptr<i32>, i32
    %40 = tt.load %39 : !tt.ptr<i32>
    %41 = arith.extsi %40 : i32 to i64
    %42 = arith.cmpi eq, %41, %c-1_i64 : i64
    cf.cond_br %42, ^bb3, ^bb4
  ^bb3:  // pred: ^bb2
    %43 = arith.muli %13, %c64_i32 : i32
    %44 = tt.splat %43 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked3}>>
    %45 = arith.addi %44, %19 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked3}>>
    %46 = tt.expand_dims %33 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<64x1xi32, #blocked3>
    %47 = tt.splat %arg14 : i32 -> tensor<64x1xi32, #blocked3>
    %48 = arith.muli %47, %46 : tensor<64x1xi32, #blocked3>
    %49 = tt.splat %arg2 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>, #blocked3>
    %50 = tt.addptr %49, %48 : tensor<64x1x!tt.ptr<f16>, #blocked3>, tensor<64x1xi32, #blocked3>
    %51 = tt.expand_dims %45 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked3}>> -> tensor<1x64xi32, #blocked3>
    %52 = tt.broadcast %50 : tensor<64x1x!tt.ptr<f16>, #blocked3> -> tensor<64x64x!tt.ptr<f16>, #blocked3>
    %53 = tt.broadcast %51 : tensor<1x64xi32, #blocked3> -> tensor<64x64xi32, #blocked3>
    %54 = tt.addptr %52, %53 : tensor<64x64x!tt.ptr<f16>, #blocked3>, tensor<64x64xi32, #blocked3>
    %55 = tt.expand_dims %37 {axis = 1 : i32} : tensor<64xi1, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<64x1xi1, #blocked3>
    %56 = tt.splat %arg7 : i32 -> tensor<1x64xi32, #blocked3>
    %57 = arith.cmpi slt, %51, %56 : tensor<1x64xi32, #blocked3>
    %58 = tt.broadcast %55 : tensor<64x1xi1, #blocked3> -> tensor<64x64xi1, #blocked3>
    %59 = tt.broadcast %57 : tensor<1x64xi1, #blocked3> -> tensor<64x64xi1, #blocked3>
    %60 = arith.andi %58, %59 : tensor<64x64xi1, #blocked3>
    tt.store %54, %cst_2, %60 : tensor<64x64x!tt.ptr<f16>, #blocked3>
    tt.return
  ^bb4:  // pred: ^bb2
    %61 = arith.muli %13, %c64_i32 : i32
    %62 = arith.extsi %61 : i32 to i64

......
```

</details>

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Unable to run Qwen3 on Turing GPUs after upgrading to torch 2.7.0 #17639

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Unable to run Qwen3 on Turing GPUs after upgrading to torch 2.7.0 #17639

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions