[Bug]: An illegal memory access was encountered with DeepSeek-R1 on 8xH200

### Your current environment

- Installed vllm with `uv pip install vllm`
- Tried to serve the model with `vllm serve "deepseek-ai/DeepSeek-R1" -tp 8 --max-model-len 38768 --max-num-batched-tokens 38768 --gpu-memory-utilization 0.9 --trust-remote-code --port 1234`
- Error: `RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered`


### 🐛 Describe the bug

```
INFO 03-17 15:26:45 __init__.py:207] Automatically detected platform cuda.
INFO 03-17 15:26:45 api_server.py:912] vLLM API server version 0.7.3
INFO 03-17 15:26:45 api_server.py:913] args: Namespace(subparser='serve', model_tag='deepseek-ai/DeepSeek-R1', config='', host=None, port=1234, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='deepseek-ai/DeepSeek-R1', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=38768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=38768, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x14978979afc0>)
INFO 03-17 15:26:46 api_server.py:209] Started engine process with PID 579746
INFO 03-17 15:26:47 config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 03-17 15:26:53 __init__.py:207] Automatically detected platform cuda.
INFO 03-17 15:26:54 config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 03-17 15:26:57 config.py:549] This model supports multiple tasks: {'score', 'reward', 'classify', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 03-17 15:26:58 config.py:1382] Defaulting to use mp for distributed inference
WARNING 03-17 15:26:58 arg_utils.py:1187] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 03-17 15:26:58 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=38768.
WARNING 03-17 15:26:58 fp8.py:53] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
INFO 03-17 15:26:58 config.py:3329] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
INFO 03-17 15:27:02 config.py:549] This model supports multiple tasks: {'score', 'reward', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 03-17 15:27:04 config.py:1382] Defaulting to use mp for distributed inference
WARNING 03-17 15:27:04 arg_utils.py:1187] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 03-17 15:27:04 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=38768.
WARNING 03-17 15:27:04 fp8.py:53] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
INFO 03-17 15:27:04 config.py:3329] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
INFO 03-17 15:27:04 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='deepseek-ai/DeepSeek-R1', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=38768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=deepseek-ai/DeepSeek-R1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 03-17 15:27:04 multiproc_worker_utils.py:300] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-17 15:27:04 custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 03-17 15:27:05 cuda.py:160] Using Triton MLA backend.
WARNING 03-17 15:27:05 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
INFO 03-17 15:27:11 __init__.py:207] Automatically detected platform cuda.
INFO 03-17 15:27:11 __init__.py:207] Automatically detected platform cuda.
INFO 03-17 15:27:11 __init__.py:207] Automatically detected platform cuda.
INFO 03-17 15:27:11 __init__.py:207] Automatically detected platform cuda.
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:13 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:13 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:13 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:13 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:13 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:13 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:13 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:13 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580056) WARNING 03-17 15:27:13 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=580053) WARNING 03-17 15:27:13 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=580055) WARNING 03-17 15:27:13 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=580050) WARNING 03-17 15:27:13 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
INFO 03-17 15:27:16 __init__.py:207] Automatically detected platform cuda.
INFO 03-17 15:27:16 __init__.py:207] Automatically detected platform cuda.
INFO 03-17 15:27:16 __init__.py:207] Automatically detected platform cuda.
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:18 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:18 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:18 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:18 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:18 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580052) WARNING 03-17 15:27:18 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=580054) WARNING 03-17 15:27:18 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:18 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580051) WARNING 03-17 15:27:18 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 03-17 15:27:33 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_f84afa44'), local_subscribe_port=41739, remote_subscribe_port=None)
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
(VllmWorkerProcess pid=580050) WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=580051) WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=580053) WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=580056) WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=580055) WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=580052) WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=580054) WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
INFO 03-17 15:27:33 weight_utils.py:254] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:33 weight_utils.py:254] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:33 weight_utils.py:254] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:34 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/163 [00:00<?, ?it/s]
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:34 weight_utils.py:254] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:34 weight_utils.py:254] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:34 weight_utils.py:254] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:34 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   1% Completed | 1/163 [00:00<00:38,  4.23it/s]
Loading safetensors checkpoint shards:   1% Completed | 2/163 [00:00<00:47,  3.40it/s]
Loading safetensors checkpoint shards:   2% Completed | 3/163 [00:00<00:50,  3.17it/s]
Loading safetensors checkpoint shards:   2% Completed | 4/163 [00:01<00:51,  3.06it/s]
Loading safetensors checkpoint shards:   3% Completed | 5/163 [00:01<00:52,  3.03it/s]
Loading safetensors checkpoint shards:   4% Completed | 6/163 [00:01<00:53,  2.93it/s]
Loading safetensors checkpoint shards:   4% Completed | 7/163 [00:02<00:52,  2.95it/s]
Loading safetensors checkpoint shards:   5% Completed | 8/163 [00:02<00:52,  2.95it/s]
Loading safetensors checkpoint shards:   6% Completed | 9/163 [00:02<00:53,  2.90it/s]
Loading safetensors checkpoint shards:   6% Completed | 10/163 [00:03<00:52,  2.93it/s]
Loading safetensors checkpoint shards:   7% Completed | 11/163 [00:03<00:51,  2.97it/s]
Loading safetensors checkpoint shards:   7% Completed | 12/163 [00:03<00:41,  3.64it/s]
Loading safetensors checkpoint shards:   8% Completed | 13/163 [00:04<00:40,  3.70it/s]
Loading safetensors checkpoint shards:   9% Completed | 14/163 [00:04<00:41,  3.59it/s]
Loading safetensors checkpoint shards:   9% Completed | 15/163 [00:04<00:43,  3.43it/s]
Loading safetensors checkpoint shards:  10% Completed | 16/163 [00:04<00:42,  3.42it/s]
Loading safetensors checkpoint shards:  10% Completed | 17/163 [00:05<00:42,  3.41it/s]
Loading safetensors checkpoint shards:  11% Completed | 18/163 [00:05<00:42,  3.41it/s]
Loading safetensors checkpoint shards:  12% Completed | 19/163 [00:05<00:41,  3.45it/s]
Loading safetensors checkpoint shards:  12% Completed | 20/163 [00:06<00:41,  3.42it/s]
Loading safetensors checkpoint shards:  13% Completed | 21/163 [00:06<00:41,  3.42it/s]
Loading safetensors checkpoint shards:  13% Completed | 22/163 [00:06<00:41,  3.40it/s]
Loading safetensors checkpoint shards:  14% Completed | 23/163 [00:07<00:41,  3.38it/s]
Loading safetensors checkpoint shards:  15% Completed | 24/163 [00:07<00:40,  3.40it/s]
Loading safetensors checkpoint shards:  15% Completed | 25/163 [00:07<00:41,  3.36it/s]
Loading safetensors checkpoint shards:  16% Completed | 26/163 [00:07<00:40,  3.36it/s]
Loading safetensors checkpoint shards:  17% Completed | 27/163 [00:08<00:40,  3.39it/s]
Loading safetensors checkpoint shards:  17% Completed | 28/163 [00:08<00:40,  3.35it/s]
Loading safetensors checkpoint shards:  18% Completed | 29/163 [00:08<00:39,  3.38it/s]
Loading safetensors checkpoint shards:  18% Completed | 30/163 [00:09<00:39,  3.41it/s]
Loading safetensors checkpoint shards:  19% Completed | 31/163 [00:09<00:39,  3.37it/s]
Loading safetensors checkpoint shards:  20% Completed | 32/163 [00:09<00:38,  3.40it/s]
Loading safetensors checkpoint shards:  20% Completed | 33/163 [00:10<00:40,  3.20it/s]
Loading safetensors checkpoint shards:  21% Completed | 34/163 [00:10<00:35,  3.62it/s]
Loading safetensors checkpoint shards:  21% Completed | 35/163 [00:10<00:36,  3.53it/s]
Loading safetensors checkpoint shards:  22% Completed | 36/163 [00:10<00:38,  3.29it/s]
Loading safetensors checkpoint shards:  23% Completed | 37/163 [00:11<00:40,  3.09it/s]
Loading safetensors checkpoint shards:  23% Completed | 38/163 [00:11<00:41,  2.99it/s]
Loading safetensors checkpoint shards:  24% Completed | 39/163 [00:11<00:43,  2.83it/s]
Loading safetensors checkpoint shards:  25% Completed | 40/163 [00:12<00:44,  2.76it/s]
Loading safetensors checkpoint shards:  25% Completed | 41/163 [00:12<00:44,  2.72it/s]
Loading safetensors checkpoint shards:  26% Completed | 42/163 [00:13<00:45,  2.65it/s]
Loading safetensors checkpoint shards:  26% Completed | 43/163 [00:13<00:44,  2.68it/s]
Loading safetensors checkpoint shards:  27% Completed | 44/163 [00:13<00:44,  2.65it/s]
Loading safetensors checkpoint shards:  28% Completed | 45/163 [00:14<00:44,  2.67it/s]
Loading safetensors checkpoint shards:  28% Completed | 46/163 [00:14<00:43,  2.69it/s]
Loading safetensors checkpoint shards:  29% Completed | 47/163 [00:15<00:43,  2.65it/s]
Loading safetensors checkpoint shards:  29% Completed | 48/163 [00:15<00:42,  2.68it/s]
Loading safetensors checkpoint shards:  30% Completed | 49/163 [00:15<00:42,  2.70it/s]
Loading safetensors checkpoint shards:  31% Completed | 50/163 [00:16<00:42,  2.66it/s]
Loading safetensors checkpoint shards:  31% Completed | 51/163 [00:16<00:41,  2.71it/s]
Loading safetensors checkpoint shards:  32% Completed | 52/163 [00:16<00:40,  2.76it/s]
Loading safetensors checkpoint shards:  33% Completed | 53/163 [00:17<00:40,  2.74it/s]
Loading safetensors checkpoint shards:  33% Completed | 54/163 [00:17<00:38,  2.80it/s]
Loading safetensors checkpoint shards:  34% Completed | 55/163 [00:17<00:39,  2.75it/s]
Loading safetensors checkpoint shards:  34% Completed | 56/163 [00:18<00:33,  3.18it/s]
Loading safetensors checkpoint shards:  35% Completed | 57/163 [00:18<00:33,  3.20it/s]
Loading safetensors checkpoint shards:  36% Completed | 58/163 [00:18<00:34,  3.07it/s]
Loading safetensors checkpoint shards:  36% Completed | 59/163 [00:19<00:35,  2.93it/s]
Loading safetensors checkpoint shards:  37% Completed | 60/163 [00:19<00:36,  2.85it/s]
Loading safetensors checkpoint shards:  37% Completed | 61/163 [00:19<00:36,  2.76it/s]
Loading safetensors checkpoint shards:  38% Completed | 62/163 [00:20<00:37,  2.72it/s]
Loading safetensors checkpoint shards:  39% Completed | 63/163 [00:20<00:37,  2.69it/s]
Loading safetensors checkpoint shards:  39% Completed | 64/163 [00:21<00:37,  2.64it/s]
Loading safetensors checkpoint shards:  40% Completed | 65/163 [00:21<00:36,  2.68it/s]
Loading safetensors checkpoint shards:  40% Completed | 66/163 [00:21<00:36,  2.66it/s]
Loading safetensors checkpoint shards:  41% Completed | 67/163 [00:22<00:35,  2.68it/s]
Loading safetensors checkpoint shards:  42% Completed | 68/163 [00:22<00:35,  2.70it/s]
Loading safetensors checkpoint shards:  42% Completed | 69/163 [00:22<00:35,  2.65it/s]
Loading safetensors checkpoint shards:  43% Completed | 70/163 [00:23<00:34,  2.68it/s]
Loading safetensors checkpoint shards:  44% Completed | 71/163 [00:23<00:33,  2.71it/s]
Loading safetensors checkpoint shards:  44% Completed | 72/163 [00:24<00:34,  2.65it/s]
Loading safetensors checkpoint shards:  45% Completed | 73/163 [00:24<00:33,  2.68it/s]
Loading safetensors checkpoint shards:  45% Completed | 74/163 [00:24<00:32,  2.73it/s]
Loading safetensors checkpoint shards:  46% Completed | 75/163 [00:25<00:32,  2.70it/s]
Loading safetensors checkpoint shards:  47% Completed | 76/163 [00:25<00:31,  2.73it/s]
Loading safetensors checkpoint shards:  47% Completed | 77/163 [00:25<00:31,  2.71it/s]
Loading safetensors checkpoint shards:  48% Completed | 78/163 [00:26<00:27,  3.14it/s]
Loading safetensors checkpoint shards:  48% Completed | 79/163 [00:26<00:26,  3.16it/s]
Loading safetensors checkpoint shards:  49% Completed | 80/163 [00:26<00:27,  3.03it/s]
Loading safetensors checkpoint shards:  50% Completed | 81/163 [00:27<00:28,  2.90it/s]
Loading safetensors checkpoint shards:  50% Completed | 82/163 [00:27<00:28,  2.82it/s]
Loading safetensors checkpoint shards:  51% Completed | 83/163 [00:27<00:29,  2.76it/s]
Loading safetensors checkpoint shards:  52% Completed | 84/163 [00:28<00:29,  2.72it/s]
Loading safetensors checkpoint shards:  52% Completed | 85/163 [00:28<00:28,  2.69it/s]
Loading safetensors checkpoint shards:  53% Completed | 86/163 [00:29<00:29,  2.64it/s]
Loading safetensors checkpoint shards:  53% Completed | 87/163 [00:29<00:28,  2.68it/s]
Loading safetensors checkpoint shards:  54% Completed | 88/163 [00:29<00:28,  2.65it/s]
Loading safetensors checkpoint shards:  55% Completed | 89/163 [00:30<00:27,  2.67it/s]
Loading safetensors checkpoint shards:  55% Completed | 90/163 [00:30<00:27,  2.69it/s]
Loading safetensors checkpoint shards:  56% Completed | 91/163 [00:30<00:27,  2.65it/s]
Loading safetensors checkpoint shards:  56% Completed | 92/163 [00:31<00:26,  2.68it/s]
Loading safetensors checkpoint shards:  57% Completed | 93/163 [00:31<00:25,  2.71it/s]
Loading safetensors checkpoint shards:  58% Completed | 94/163 [00:32<00:25,  2.68it/s]
Loading safetensors checkpoint shards:  58% Completed | 95/163 [00:32<00:24,  2.74it/s]
Loading safetensors checkpoint shards:  59% Completed | 96/163 [00:32<00:23,  2.80it/s]
Loading safetensors checkpoint shards:  60% Completed | 97/163 [00:33<00:24,  2.74it/s]
Loading safetensors checkpoint shards:  60% Completed | 98/163 [00:33<00:23,  2.77it/s]
Loading safetensors checkpoint shards:  61% Completed | 99/163 [00:33<00:23,  2.73it/s]
Loading safetensors checkpoint shards:  61% Completed | 100/163 [00:34<00:19,  3.16it/s]
Loading safetensors checkpoint shards:  62% Completed | 101/163 [00:34<00:19,  3.17it/s]
Loading safetensors checkpoint shards:  63% Completed | 102/163 [00:34<00:20,  3.04it/s]
Loading safetensors checkpoint shards:  63% Completed | 103/163 [00:35<00:20,  2.93it/s]
Loading safetensors checkpoint shards:  64% Completed | 104/163 [00:35<00:20,  2.86it/s]
Loading safetensors checkpoint shards:  64% Completed | 105/163 [00:35<00:20,  2.77it/s]
Loading safetensors checkpoint shards:  65% Completed | 106/163 [00:36<00:20,  2.74it/s]
Loading safetensors checkpoint shards:  66% Completed | 107/163 [00:36<00:20,  2.69it/s]
Loading safetensors checkpoint shards:  66% Completed | 108/163 [00:37<00:20,  2.63it/s]
Loading safetensors checkpoint shards:  67% Completed | 109/163 [00:37<00:20,  2.66it/s]
Loading safetensors checkpoint shards:  67% Completed | 110/163 [00:37<00:20,  2.63it/s]
Loading safetensors checkpoint shards:  68% Completed | 111/163 [00:38<00:19,  2.65it/s]
Loading safetensors checkpoint shards:  69% Completed | 112/163 [00:38<00:19,  2.68it/s]
Loading safetensors checkpoint shards:  69% Completed | 113/163 [00:38<00:18,  2.64it/s]
Loading safetensors checkpoint shards:  70% Completed | 114/163 [00:39<00:18,  2.66it/s]
Loading safetensors checkpoint shards:  71% Completed | 115/163 [00:39<00:17,  2.69it/s]
Loading safetensors checkpoint shards:  71% Completed | 116/163 [00:40<00:17,  2.64it/s]
Loading safetensors checkpoint shards:  72% Completed | 117/163 [00:40<00:17,  2.66it/s]
Loading safetensors checkpoint shards:  72% Completed | 118/163 [00:40<00:16,  2.69it/s]
Loading safetensors checkpoint shards:  73% Completed | 119/163 [00:41<00:16,  2.66it/s]
Loading safetensors checkpoint shards:  74% Completed | 120/163 [00:41<00:15,  2.69it/s]
Loading safetensors checkpoint shards:  74% Completed | 121/163 [00:41<00:15,  2.65it/s]
Loading safetensors checkpoint shards:  75% Completed | 122/163 [00:42<00:13,  3.09it/s]
Loading safetensors checkpoint shards:  75% Completed | 123/163 [00:42<00:12,  3.11it/s]
Loading safetensors checkpoint shards:  76% Completed | 124/163 [00:42<00:13,  3.00it/s]
Loading safetensors checkpoint shards:  77% Completed | 125/163 [00:43<00:13,  2.88it/s]
Loading safetensors checkpoint shards:  77% Completed | 126/163 [00:43<00:13,  2.83it/s]
Loading safetensors checkpoint shards:  78% Completed | 127/163 [00:43<00:13,  2.75it/s]
Loading safetensors checkpoint shards:  79% Completed | 128/163 [00:44<00:12,  2.72it/s]
Loading safetensors checkpoint shards:  79% Completed | 129/163 [00:44<00:12,  2.71it/s]
Loading safetensors checkpoint shards:  80% Completed | 130/163 [00:45<00:12,  2.65it/s]
Loading safetensors checkpoint shards:  80% Completed | 131/163 [00:45<00:11,  2.69it/s]
Loading safetensors checkpoint shards:  81% Completed | 132/163 [00:45<00:11,  2.67it/s]
Loading safetensors checkpoint shards:  82% Completed | 133/163 [00:46<00:11,  2.69it/s]
Loading safetensors checkpoint shards:  82% Completed | 134/163 [00:46<00:10,  2.71it/s]
Loading safetensors checkpoint shards:  83% Completed | 135/163 [00:46<00:10,  2.66it/s]
Loading safetensors checkpoint shards:  83% Completed | 136/163 [00:47<00:10,  2.69it/s]
Loading safetensors checkpoint shards:  84% Completed | 137/163 [00:47<00:09,  2.72it/s]
Loading safetensors checkpoint shards:  85% Completed | 138/163 [00:48<00:09,  2.68it/s]
Loading safetensors checkpoint shards:  85% Completed | 139/163 [00:48<00:08,  2.71it/s]
Loading safetensors checkpoint shards:  86% Completed | 140/163 [00:48<00:08,  2.74it/s]
Loading safetensors checkpoint shards:  87% Completed | 141/163 [00:49<00:07,  2.91it/s]
Loading safetensors checkpoint shards:  87% Completed | 142/163 [00:49<00:07,  2.92it/s]
Loading safetensors checkpoint shards:  88% Completed | 143/163 [00:49<00:07,  2.85it/s]
Loading safetensors checkpoint shards:  88% Completed | 144/163 [00:50<00:06,  2.78it/s]
Loading safetensors checkpoint shards:  89% Completed | 145/163 [00:50<00:06,  2.76it/s]
Loading safetensors checkpoint shards:  90% Completed | 146/163 [00:50<00:06,  2.70it/s]
(VllmWorkerProcess pid=580055) INFO 03-17 15:28:25 model_runner.py:1115] Loading model weights took 83.8786 GB
Loading safetensors checkpoint shards:  90% Completed | 147/163 [00:51<00:05,  2.69it/s]
(VllmWorkerProcess pid=580056) INFO 03-17 15:28:25 model_runner.py:1115] Loading model weights took 83.8786 GB
Loading safetensors checkpoint shards:  91% Completed | 148/163 [00:51<00:05,  2.68it/s]
Loading safetensors checkpoint shards:  91% Completed | 149/163 [00:52<00:05,  2.64it/s]
Loading safetensors checkpoint shards:  92% Completed | 150/163 [00:52<00:04,  2.70it/s]
(VllmWorkerProcess pid=580053) INFO 03-17 15:28:27 model_runner.py:1115] Loading model weights took 83.8786 GB
Loading safetensors checkpoint shards:  93% Completed | 151/163 [00:52<00:04,  2.68it/s]
Loading safetensors checkpoint shards:  93% Completed | 152/163 [00:53<00:04,  2.71it/s]
Loading safetensors checkpoint shards:  94% Completed | 153/163 [00:53<00:03,  2.75it/s]
Loading safetensors checkpoint shards:  94% Completed | 154/163 [00:53<00:03,  2.70it/s]
(VllmWorkerProcess pid=580052) INFO 03-17 15:28:28 model_runner.py:1115] Loading model weights took 83.8786 GB
Loading safetensors checkpoint shards:  95% Completed | 155/163 [00:54<00:02,  2.74it/s]
Loading safetensors checkpoint shards:  96% Completed | 156/163 [00:54<00:02,  2.76it/s]
Loading safetensors checkpoint shards:  96% Completed | 157/163 [00:54<00:02,  2.71it/s]
Loading safetensors checkpoint shards:  97% Completed | 158/163 [00:55<00:01,  2.74it/s]
Loading safetensors checkpoint shards:  98% Completed | 159/163 [00:55<00:01,  2.76it/s]
Loading safetensors checkpoint shards:  98% Completed | 160/163 [00:55<00:01,  2.88it/s]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [00:56<00:00,  2.91it/s]

INFO 03-17 15:28:31 model_runner.py:1115] Loading model weights took 83.8786 GB
(VllmWorkerProcess pid=580050) INFO 03-17 15:28:32 model_runner.py:1115] Loading model weights took 83.8786 GB
(VllmWorkerProcess pid=580054) INFO 03-17 15:28:34 model_runner.py:1115] Loading model weights took 83.8786 GB
(VllmWorkerProcess pid=580051) INFO 03-17 15:28:35 model_runner.py:1115] Loading model weights took 83.8786 GB
(VllmWorkerProcess pid=580052) INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
(VllmWorkerProcess pid=580055) INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
(VllmWorkerProcess pid=580053) INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
(VllmWorkerProcess pid=580056) INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
(VllmWorkerProcess pid=580054) INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
(VllmWorkerProcess pid=580050) INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
(VllmWorkerProcess pid=580051) INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
(VllmWorkerProcess pid=580055) INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
(VllmWorkerProcess pid=580056) INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
(VllmWorkerProcess pid=580053) INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
(VllmWorkerProcess pid=580054) INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
(VllmWorkerProcess pid=580052) INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
(VllmWorkerProcess pid=580051) INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
(VllmWorkerProcess pid=580050) INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242] Traceback (most recent call last):
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 236, in _run_worker_process
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/utils.py", line 2196, in run_method
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     self.model_runner.profile_run()
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1235, in profile_run
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1346, in _dummy_run
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1724, in execute_model
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 677, in forward
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 633, in forward
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 560, in forward
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 162, in forward
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     final_hidden_states = self.experts(
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                           ^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 586, in forward
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     final_hidden_states = self.quant_method.apply(
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                           ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/fp8.py", line 664, in apply
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     topk_weights, topk_ids = FusedMoE.select_experts(
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                              ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 558, in select_experts
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     topk_weights, topk_ids = grouped_topk(
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                              ^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return fn(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 922, in grouped_topk
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     @torch.compile(dynamic=True, backend=current_platform.simple_compile_backend)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return fn(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1100, in forward
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return compiled_fn(full_args)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 321, in runtime_wrapper
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     all_outs = call_func_at_runtime_with_args(
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 124, in call_func_at_runtime_with_args
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     out = normalize_as_list(f(args))
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                             ^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 667, in inner_fn
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     outs = compiled_fn(args)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 488, in wrapper
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return compiled_fn(runtime_args)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 1478, in __call__
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return self.current_callable(inputs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_inductor/utils.py", line 1977, in run
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return model(new_inputs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/tmp/torchinductor_dalistarh/ob/cobjy4b4neiv6uxnagpsrkuxnjmxs774i2sopv3lpxy66mnnzf3t.py", line 352, in call
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     triton_poi_fused_add_sigmoid_0.run(arg1_1, arg2_1, buf0, triton_poi_fused_add_sigmoid_0_xnumel, grid=grid(triton_poi_fused_add_sigmoid_0_xnumel), stream=stream7)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 879, in run
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return launcher(
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "<string>", line 13, in launcher
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 365, in __call__
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     self.launch(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242] RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: An illegal memory access was encountered with DeepSeek-R1 on 8xH200 #14965

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: An illegal memory access was encountered with DeepSeek-R1 on 8xH200 #14965

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions