Skip to content

[Bug]: An illegal memory access was encountered with DeepSeek-R1 on 8xH200 #14965

@eldarkurtic

Description

@eldarkurtic

Your current environment

  • Installed vllm with uv pip install vllm
  • Tried to serve the model with vllm serve "deepseek-ai/DeepSeek-R1" -tp 8 --max-model-len 38768 --max-num-batched-tokens 38768 --gpu-memory-utilization 0.9 --trust-remote-code --port 1234
  • Error: RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

🐛 Describe the bug

INFO 03-17 15:26:45 __init__.py:207] Automatically detected platform cuda.
INFO 03-17 15:26:45 api_server.py:912] vLLM API server version 0.7.3
INFO 03-17 15:26:45 api_server.py:913] args: Namespace(subparser='serve', model_tag='deepseek-ai/DeepSeek-R1', config='', host=None, port=1234, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='deepseek-ai/DeepSeek-R1', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=38768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=38768, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x14978979afc0>)
INFO 03-17 15:26:46 api_server.py:209] Started engine process with PID 579746
INFO 03-17 15:26:47 config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 03-17 15:26:53 __init__.py:207] Automatically detected platform cuda.
INFO 03-17 15:26:54 config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 03-17 15:26:57 config.py:549] This model supports multiple tasks: {'score', 'reward', 'classify', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 03-17 15:26:58 config.py:1382] Defaulting to use mp for distributed inference
WARNING 03-17 15:26:58 arg_utils.py:1187] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 03-17 15:26:58 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=38768.
WARNING 03-17 15:26:58 fp8.py:53] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
INFO 03-17 15:26:58 config.py:3329] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
INFO 03-17 15:27:02 config.py:549] This model supports multiple tasks: {'score', 'reward', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 03-17 15:27:04 config.py:1382] Defaulting to use mp for distributed inference
WARNING 03-17 15:27:04 arg_utils.py:1187] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 03-17 15:27:04 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=38768.
WARNING 03-17 15:27:04 fp8.py:53] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
INFO 03-17 15:27:04 config.py:3329] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
INFO 03-17 15:27:04 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='deepseek-ai/DeepSeek-R1', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=38768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=deepseek-ai/DeepSeek-R1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 03-17 15:27:04 multiproc_worker_utils.py:300] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-17 15:27:04 custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 03-17 15:27:05 cuda.py:160] Using Triton MLA backend.
WARNING 03-17 15:27:05 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
INFO 03-17 15:27:11 __init__.py:207] Automatically detected platform cuda.
INFO 03-17 15:27:11 __init__.py:207] Automatically detected platform cuda.
INFO 03-17 15:27:11 __init__.py:207] Automatically detected platform cuda.
INFO 03-17 15:27:11 __init__.py:207] Automatically detected platform cuda.
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:13 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:13 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:13 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:13 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:13 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:13 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:13 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:13 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580056) WARNING 03-17 15:27:13 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=580053) WARNING 03-17 15:27:13 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=580055) WARNING 03-17 15:27:13 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=580050) WARNING 03-17 15:27:13 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
INFO 03-17 15:27:16 __init__.py:207] Automatically detected platform cuda.
INFO 03-17 15:27:16 __init__.py:207] Automatically detected platform cuda.
INFO 03-17 15:27:16 __init__.py:207] Automatically detected platform cuda.
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:18 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:18 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:18 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:18 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:18 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580052) WARNING 03-17 15:27:18 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=580054) WARNING 03-17 15:27:18 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:18 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580051) WARNING 03-17 15:27:18 triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:30 utils.py:916] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:30 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 03-17 15:27:32 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/dalistarh/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 03-17 15:27:33 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_f84afa44'), local_subscribe_port=41739, remote_subscribe_port=None)
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:33 model_runner.py:1110] Starting to load model deepseek-ai/DeepSeek-R1...
(VllmWorkerProcess pid=580050) WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=580051) WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=580053) WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=580056) WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=580055) WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=580052) WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=580054) WARNING 03-17 15:27:33 utils.py:168] The model class DeepseekV3ForCausalLM has not defined `packed_modules_mapping`, this may lead to incorrect mapping of quantized or ignored modules
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:33 cuda.py:160] Using Triton MLA backend.
INFO 03-17 15:27:33 weight_utils.py:254] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=580053) INFO 03-17 15:27:33 weight_utils.py:254] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=580055) INFO 03-17 15:27:33 weight_utils.py:254] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=580056) INFO 03-17 15:27:34 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/163 [00:00<?, ?it/s]
(VllmWorkerProcess pid=580054) INFO 03-17 15:27:34 weight_utils.py:254] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=580052) INFO 03-17 15:27:34 weight_utils.py:254] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=580050) INFO 03-17 15:27:34 weight_utils.py:254] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=580051) INFO 03-17 15:27:34 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   1% Completed | 1/163 [00:00<00:38,  4.23it/s]
Loading safetensors checkpoint shards:   1% Completed | 2/163 [00:00<00:47,  3.40it/s]
Loading safetensors checkpoint shards:   2% Completed | 3/163 [00:00<00:50,  3.17it/s]
Loading safetensors checkpoint shards:   2% Completed | 4/163 [00:01<00:51,  3.06it/s]
Loading safetensors checkpoint shards:   3% Completed | 5/163 [00:01<00:52,  3.03it/s]
Loading safetensors checkpoint shards:   4% Completed | 6/163 [00:01<00:53,  2.93it/s]
Loading safetensors checkpoint shards:   4% Completed | 7/163 [00:02<00:52,  2.95it/s]
Loading safetensors checkpoint shards:   5% Completed | 8/163 [00:02<00:52,  2.95it/s]
Loading safetensors checkpoint shards:   6% Completed | 9/163 [00:02<00:53,  2.90it/s]
Loading safetensors checkpoint shards:   6% Completed | 10/163 [00:03<00:52,  2.93it/s]
Loading safetensors checkpoint shards:   7% Completed | 11/163 [00:03<00:51,  2.97it/s]
Loading safetensors checkpoint shards:   7% Completed | 12/163 [00:03<00:41,  3.64it/s]
Loading safetensors checkpoint shards:   8% Completed | 13/163 [00:04<00:40,  3.70it/s]
Loading safetensors checkpoint shards:   9% Completed | 14/163 [00:04<00:41,  3.59it/s]
Loading safetensors checkpoint shards:   9% Completed | 15/163 [00:04<00:43,  3.43it/s]
Loading safetensors checkpoint shards:  10% Completed | 16/163 [00:04<00:42,  3.42it/s]
Loading safetensors checkpoint shards:  10% Completed | 17/163 [00:05<00:42,  3.41it/s]
Loading safetensors checkpoint shards:  11% Completed | 18/163 [00:05<00:42,  3.41it/s]
Loading safetensors checkpoint shards:  12% Completed | 19/163 [00:05<00:41,  3.45it/s]
Loading safetensors checkpoint shards:  12% Completed | 20/163 [00:06<00:41,  3.42it/s]
Loading safetensors checkpoint shards:  13% Completed | 21/163 [00:06<00:41,  3.42it/s]
Loading safetensors checkpoint shards:  13% Completed | 22/163 [00:06<00:41,  3.40it/s]
Loading safetensors checkpoint shards:  14% Completed | 23/163 [00:07<00:41,  3.38it/s]
Loading safetensors checkpoint shards:  15% Completed | 24/163 [00:07<00:40,  3.40it/s]
Loading safetensors checkpoint shards:  15% Completed | 25/163 [00:07<00:41,  3.36it/s]
Loading safetensors checkpoint shards:  16% Completed | 26/163 [00:07<00:40,  3.36it/s]
Loading safetensors checkpoint shards:  17% Completed | 27/163 [00:08<00:40,  3.39it/s]
Loading safetensors checkpoint shards:  17% Completed | 28/163 [00:08<00:40,  3.35it/s]
Loading safetensors checkpoint shards:  18% Completed | 29/163 [00:08<00:39,  3.38it/s]
Loading safetensors checkpoint shards:  18% Completed | 30/163 [00:09<00:39,  3.41it/s]
Loading safetensors checkpoint shards:  19% Completed | 31/163 [00:09<00:39,  3.37it/s]
Loading safetensors checkpoint shards:  20% Completed | 32/163 [00:09<00:38,  3.40it/s]
Loading safetensors checkpoint shards:  20% Completed | 33/163 [00:10<00:40,  3.20it/s]
Loading safetensors checkpoint shards:  21% Completed | 34/163 [00:10<00:35,  3.62it/s]
Loading safetensors checkpoint shards:  21% Completed | 35/163 [00:10<00:36,  3.53it/s]
Loading safetensors checkpoint shards:  22% Completed | 36/163 [00:10<00:38,  3.29it/s]
Loading safetensors checkpoint shards:  23% Completed | 37/163 [00:11<00:40,  3.09it/s]
Loading safetensors checkpoint shards:  23% Completed | 38/163 [00:11<00:41,  2.99it/s]
Loading safetensors checkpoint shards:  24% Completed | 39/163 [00:11<00:43,  2.83it/s]
Loading safetensors checkpoint shards:  25% Completed | 40/163 [00:12<00:44,  2.76it/s]
Loading safetensors checkpoint shards:  25% Completed | 41/163 [00:12<00:44,  2.72it/s]
Loading safetensors checkpoint shards:  26% Completed | 42/163 [00:13<00:45,  2.65it/s]
Loading safetensors checkpoint shards:  26% Completed | 43/163 [00:13<00:44,  2.68it/s]
Loading safetensors checkpoint shards:  27% Completed | 44/163 [00:13<00:44,  2.65it/s]
Loading safetensors checkpoint shards:  28% Completed | 45/163 [00:14<00:44,  2.67it/s]
Loading safetensors checkpoint shards:  28% Completed | 46/163 [00:14<00:43,  2.69it/s]
Loading safetensors checkpoint shards:  29% Completed | 47/163 [00:15<00:43,  2.65it/s]
Loading safetensors checkpoint shards:  29% Completed | 48/163 [00:15<00:42,  2.68it/s]
Loading safetensors checkpoint shards:  30% Completed | 49/163 [00:15<00:42,  2.70it/s]
Loading safetensors checkpoint shards:  31% Completed | 50/163 [00:16<00:42,  2.66it/s]
Loading safetensors checkpoint shards:  31% Completed | 51/163 [00:16<00:41,  2.71it/s]
Loading safetensors checkpoint shards:  32% Completed | 52/163 [00:16<00:40,  2.76it/s]
Loading safetensors checkpoint shards:  33% Completed | 53/163 [00:17<00:40,  2.74it/s]
Loading safetensors checkpoint shards:  33% Completed | 54/163 [00:17<00:38,  2.80it/s]
Loading safetensors checkpoint shards:  34% Completed | 55/163 [00:17<00:39,  2.75it/s]
Loading safetensors checkpoint shards:  34% Completed | 56/163 [00:18<00:33,  3.18it/s]
Loading safetensors checkpoint shards:  35% Completed | 57/163 [00:18<00:33,  3.20it/s]
Loading safetensors checkpoint shards:  36% Completed | 58/163 [00:18<00:34,  3.07it/s]
Loading safetensors checkpoint shards:  36% Completed | 59/163 [00:19<00:35,  2.93it/s]
Loading safetensors checkpoint shards:  37% Completed | 60/163 [00:19<00:36,  2.85it/s]
Loading safetensors checkpoint shards:  37% Completed | 61/163 [00:19<00:36,  2.76it/s]
Loading safetensors checkpoint shards:  38% Completed | 62/163 [00:20<00:37,  2.72it/s]
Loading safetensors checkpoint shards:  39% Completed | 63/163 [00:20<00:37,  2.69it/s]
Loading safetensors checkpoint shards:  39% Completed | 64/163 [00:21<00:37,  2.64it/s]
Loading safetensors checkpoint shards:  40% Completed | 65/163 [00:21<00:36,  2.68it/s]
Loading safetensors checkpoint shards:  40% Completed | 66/163 [00:21<00:36,  2.66it/s]
Loading safetensors checkpoint shards:  41% Completed | 67/163 [00:22<00:35,  2.68it/s]
Loading safetensors checkpoint shards:  42% Completed | 68/163 [00:22<00:35,  2.70it/s]
Loading safetensors checkpoint shards:  42% Completed | 69/163 [00:22<00:35,  2.65it/s]
Loading safetensors checkpoint shards:  43% Completed | 70/163 [00:23<00:34,  2.68it/s]
Loading safetensors checkpoint shards:  44% Completed | 71/163 [00:23<00:33,  2.71it/s]
Loading safetensors checkpoint shards:  44% Completed | 72/163 [00:24<00:34,  2.65it/s]
Loading safetensors checkpoint shards:  45% Completed | 73/163 [00:24<00:33,  2.68it/s]
Loading safetensors checkpoint shards:  45% Completed | 74/163 [00:24<00:32,  2.73it/s]
Loading safetensors checkpoint shards:  46% Completed | 75/163 [00:25<00:32,  2.70it/s]
Loading safetensors checkpoint shards:  47% Completed | 76/163 [00:25<00:31,  2.73it/s]
Loading safetensors checkpoint shards:  47% Completed | 77/163 [00:25<00:31,  2.71it/s]
Loading safetensors checkpoint shards:  48% Completed | 78/163 [00:26<00:27,  3.14it/s]
Loading safetensors checkpoint shards:  48% Completed | 79/163 [00:26<00:26,  3.16it/s]
Loading safetensors checkpoint shards:  49% Completed | 80/163 [00:26<00:27,  3.03it/s]
Loading safetensors checkpoint shards:  50% Completed | 81/163 [00:27<00:28,  2.90it/s]
Loading safetensors checkpoint shards:  50% Completed | 82/163 [00:27<00:28,  2.82it/s]
Loading safetensors checkpoint shards:  51% Completed | 83/163 [00:27<00:29,  2.76it/s]
Loading safetensors checkpoint shards:  52% Completed | 84/163 [00:28<00:29,  2.72it/s]
Loading safetensors checkpoint shards:  52% Completed | 85/163 [00:28<00:28,  2.69it/s]
Loading safetensors checkpoint shards:  53% Completed | 86/163 [00:29<00:29,  2.64it/s]
Loading safetensors checkpoint shards:  53% Completed | 87/163 [00:29<00:28,  2.68it/s]
Loading safetensors checkpoint shards:  54% Completed | 88/163 [00:29<00:28,  2.65it/s]
Loading safetensors checkpoint shards:  55% Completed | 89/163 [00:30<00:27,  2.67it/s]
Loading safetensors checkpoint shards:  55% Completed | 90/163 [00:30<00:27,  2.69it/s]
Loading safetensors checkpoint shards:  56% Completed | 91/163 [00:30<00:27,  2.65it/s]
Loading safetensors checkpoint shards:  56% Completed | 92/163 [00:31<00:26,  2.68it/s]
Loading safetensors checkpoint shards:  57% Completed | 93/163 [00:31<00:25,  2.71it/s]
Loading safetensors checkpoint shards:  58% Completed | 94/163 [00:32<00:25,  2.68it/s]
Loading safetensors checkpoint shards:  58% Completed | 95/163 [00:32<00:24,  2.74it/s]
Loading safetensors checkpoint shards:  59% Completed | 96/163 [00:32<00:23,  2.80it/s]
Loading safetensors checkpoint shards:  60% Completed | 97/163 [00:33<00:24,  2.74it/s]
Loading safetensors checkpoint shards:  60% Completed | 98/163 [00:33<00:23,  2.77it/s]
Loading safetensors checkpoint shards:  61% Completed | 99/163 [00:33<00:23,  2.73it/s]
Loading safetensors checkpoint shards:  61% Completed | 100/163 [00:34<00:19,  3.16it/s]
Loading safetensors checkpoint shards:  62% Completed | 101/163 [00:34<00:19,  3.17it/s]
Loading safetensors checkpoint shards:  63% Completed | 102/163 [00:34<00:20,  3.04it/s]
Loading safetensors checkpoint shards:  63% Completed | 103/163 [00:35<00:20,  2.93it/s]
Loading safetensors checkpoint shards:  64% Completed | 104/163 [00:35<00:20,  2.86it/s]
Loading safetensors checkpoint shards:  64% Completed | 105/163 [00:35<00:20,  2.77it/s]
Loading safetensors checkpoint shards:  65% Completed | 106/163 [00:36<00:20,  2.74it/s]
Loading safetensors checkpoint shards:  66% Completed | 107/163 [00:36<00:20,  2.69it/s]
Loading safetensors checkpoint shards:  66% Completed | 108/163 [00:37<00:20,  2.63it/s]
Loading safetensors checkpoint shards:  67% Completed | 109/163 [00:37<00:20,  2.66it/s]
Loading safetensors checkpoint shards:  67% Completed | 110/163 [00:37<00:20,  2.63it/s]
Loading safetensors checkpoint shards:  68% Completed | 111/163 [00:38<00:19,  2.65it/s]
Loading safetensors checkpoint shards:  69% Completed | 112/163 [00:38<00:19,  2.68it/s]
Loading safetensors checkpoint shards:  69% Completed | 113/163 [00:38<00:18,  2.64it/s]
Loading safetensors checkpoint shards:  70% Completed | 114/163 [00:39<00:18,  2.66it/s]
Loading safetensors checkpoint shards:  71% Completed | 115/163 [00:39<00:17,  2.69it/s]
Loading safetensors checkpoint shards:  71% Completed | 116/163 [00:40<00:17,  2.64it/s]
Loading safetensors checkpoint shards:  72% Completed | 117/163 [00:40<00:17,  2.66it/s]
Loading safetensors checkpoint shards:  72% Completed | 118/163 [00:40<00:16,  2.69it/s]
Loading safetensors checkpoint shards:  73% Completed | 119/163 [00:41<00:16,  2.66it/s]
Loading safetensors checkpoint shards:  74% Completed | 120/163 [00:41<00:15,  2.69it/s]
Loading safetensors checkpoint shards:  74% Completed | 121/163 [00:41<00:15,  2.65it/s]
Loading safetensors checkpoint shards:  75% Completed | 122/163 [00:42<00:13,  3.09it/s]
Loading safetensors checkpoint shards:  75% Completed | 123/163 [00:42<00:12,  3.11it/s]
Loading safetensors checkpoint shards:  76% Completed | 124/163 [00:42<00:13,  3.00it/s]
Loading safetensors checkpoint shards:  77% Completed | 125/163 [00:43<00:13,  2.88it/s]
Loading safetensors checkpoint shards:  77% Completed | 126/163 [00:43<00:13,  2.83it/s]
Loading safetensors checkpoint shards:  78% Completed | 127/163 [00:43<00:13,  2.75it/s]
Loading safetensors checkpoint shards:  79% Completed | 128/163 [00:44<00:12,  2.72it/s]
Loading safetensors checkpoint shards:  79% Completed | 129/163 [00:44<00:12,  2.71it/s]
Loading safetensors checkpoint shards:  80% Completed | 130/163 [00:45<00:12,  2.65it/s]
Loading safetensors checkpoint shards:  80% Completed | 131/163 [00:45<00:11,  2.69it/s]
Loading safetensors checkpoint shards:  81% Completed | 132/163 [00:45<00:11,  2.67it/s]
Loading safetensors checkpoint shards:  82% Completed | 133/163 [00:46<00:11,  2.69it/s]
Loading safetensors checkpoint shards:  82% Completed | 134/163 [00:46<00:10,  2.71it/s]
Loading safetensors checkpoint shards:  83% Completed | 135/163 [00:46<00:10,  2.66it/s]
Loading safetensors checkpoint shards:  83% Completed | 136/163 [00:47<00:10,  2.69it/s]
Loading safetensors checkpoint shards:  84% Completed | 137/163 [00:47<00:09,  2.72it/s]
Loading safetensors checkpoint shards:  85% Completed | 138/163 [00:48<00:09,  2.68it/s]
Loading safetensors checkpoint shards:  85% Completed | 139/163 [00:48<00:08,  2.71it/s]
Loading safetensors checkpoint shards:  86% Completed | 140/163 [00:48<00:08,  2.74it/s]
Loading safetensors checkpoint shards:  87% Completed | 141/163 [00:49<00:07,  2.91it/s]
Loading safetensors checkpoint shards:  87% Completed | 142/163 [00:49<00:07,  2.92it/s]
Loading safetensors checkpoint shards:  88% Completed | 143/163 [00:49<00:07,  2.85it/s]
Loading safetensors checkpoint shards:  88% Completed | 144/163 [00:50<00:06,  2.78it/s]
Loading safetensors checkpoint shards:  89% Completed | 145/163 [00:50<00:06,  2.76it/s]
Loading safetensors checkpoint shards:  90% Completed | 146/163 [00:50<00:06,  2.70it/s]
(VllmWorkerProcess pid=580055) INFO 03-17 15:28:25 model_runner.py:1115] Loading model weights took 83.8786 GB
Loading safetensors checkpoint shards:  90% Completed | 147/163 [00:51<00:05,  2.69it/s]
(VllmWorkerProcess pid=580056) INFO 03-17 15:28:25 model_runner.py:1115] Loading model weights took 83.8786 GB
Loading safetensors checkpoint shards:  91% Completed | 148/163 [00:51<00:05,  2.68it/s]
Loading safetensors checkpoint shards:  91% Completed | 149/163 [00:52<00:05,  2.64it/s]
Loading safetensors checkpoint shards:  92% Completed | 150/163 [00:52<00:04,  2.70it/s]
(VllmWorkerProcess pid=580053) INFO 03-17 15:28:27 model_runner.py:1115] Loading model weights took 83.8786 GB
Loading safetensors checkpoint shards:  93% Completed | 151/163 [00:52<00:04,  2.68it/s]
Loading safetensors checkpoint shards:  93% Completed | 152/163 [00:53<00:04,  2.71it/s]
Loading safetensors checkpoint shards:  94% Completed | 153/163 [00:53<00:03,  2.75it/s]
Loading safetensors checkpoint shards:  94% Completed | 154/163 [00:53<00:03,  2.70it/s]
(VllmWorkerProcess pid=580052) INFO 03-17 15:28:28 model_runner.py:1115] Loading model weights took 83.8786 GB
Loading safetensors checkpoint shards:  95% Completed | 155/163 [00:54<00:02,  2.74it/s]
Loading safetensors checkpoint shards:  96% Completed | 156/163 [00:54<00:02,  2.76it/s]
Loading safetensors checkpoint shards:  96% Completed | 157/163 [00:54<00:02,  2.71it/s]
Loading safetensors checkpoint shards:  97% Completed | 158/163 [00:55<00:01,  2.74it/s]
Loading safetensors checkpoint shards:  98% Completed | 159/163 [00:55<00:01,  2.76it/s]
Loading safetensors checkpoint shards:  98% Completed | 160/163 [00:55<00:01,  2.88it/s]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [00:56<00:00,  2.91it/s]

INFO 03-17 15:28:31 model_runner.py:1115] Loading model weights took 83.8786 GB
(VllmWorkerProcess pid=580050) INFO 03-17 15:28:32 model_runner.py:1115] Loading model weights took 83.8786 GB
(VllmWorkerProcess pid=580054) INFO 03-17 15:28:34 model_runner.py:1115] Loading model weights took 83.8786 GB
(VllmWorkerProcess pid=580051) INFO 03-17 15:28:35 model_runner.py:1115] Loading model weights took 83.8786 GB
(VllmWorkerProcess pid=580052) INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
(VllmWorkerProcess pid=580055) INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
(VllmWorkerProcess pid=580053) INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
(VllmWorkerProcess pid=580056) INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
(VllmWorkerProcess pid=580054) INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
(VllmWorkerProcess pid=580050) INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
(VllmWorkerProcess pid=580051) INFO 03-17 15:28:38 fp8_utils.py:432] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
(VllmWorkerProcess pid=580055) INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
(VllmWorkerProcess pid=580056) INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
(VllmWorkerProcess pid=580053) INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
(VllmWorkerProcess pid=580054) INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
(VllmWorkerProcess pid=580052) INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
(VllmWorkerProcess pid=580051) INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
(VllmWorkerProcess pid=580050) INFO 03-17 15:28:41 fused_moe.py:800] Using configuration from /home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json for MoE layer.
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242] Traceback (most recent call last):
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 236, in _run_worker_process
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/utils.py", line 2196, in run_method
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     self.model_runner.profile_run()
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1235, in profile_run
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1346, in _dummy_run
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1724, in execute_model
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 677, in forward
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 633, in forward
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 560, in forward
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 162, in forward
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     final_hidden_states = self.experts(
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                           ^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 586, in forward
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     final_hidden_states = self.quant_method.apply(
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                           ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/fp8.py", line 664, in apply
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     topk_weights, topk_ids = FusedMoE.select_experts(
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                              ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 558, in select_experts
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     topk_weights, topk_ids = grouped_topk(
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                              ^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return fn(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 922, in grouped_topk
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     @torch.compile(dynamic=True, backend=current_platform.simple_compile_backend)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return fn(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1100, in forward
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return compiled_fn(full_args)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 321, in runtime_wrapper
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     all_outs = call_func_at_runtime_with_args(
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 124, in call_func_at_runtime_with_args
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     out = normalize_as_list(f(args))
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]                             ^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 667, in inner_fn
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     outs = compiled_fn(args)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 488, in wrapper
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return compiled_fn(runtime_args)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 1478, in __call__
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return self.current_callable(inputs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_inductor/utils.py", line 1977, in run
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return model(new_inputs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/tmp/torchinductor_dalistarh/ob/cobjy4b4neiv6uxnagpsrkuxnjmxs774i2sopv3lpxy66mnnzf3t.py", line 352, in call
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     triton_poi_fused_add_sigmoid_0.run(arg1_1, arg2_1, buf0, triton_poi_fused_add_sigmoid_0_xnumel, grid=grid(triton_poi_fused_add_sigmoid_0_xnumel), stream=stream7)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 879, in run
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     return launcher(
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]            ^^^^^^^^^
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "<string>", line 13, in launcher
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]   File "/home/dalistarh/test/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 365, in __call__
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242]     self.launch(*args, **kwargs)
(VllmWorkerProcess pid=580056) ERROR 03-17 15:28:43 multiproc_worker_utils.py:242] RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions