[Bug]: Pipeline parallel (pp>1) crashes with CUDA illegal memory access

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0
Clang version                : Could not collect
CMake version                : version 3.22.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.8.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 | packaged by Anaconda, Inc. | (main, Jun  5 2025, 13:09:17) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-5.19.0-46-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.41
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA L20
GPU 1: NVIDIA L20
GPU 2: NVIDIA L20
GPU 3: NVIDIA L20

Nvidia driver version        : 575.51.03
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Gold 6430
CPU family:                      6
Model:                           143
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
Stepping:                        8
CPU max MHz:                     3400.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4200.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       3 MiB (64 instances)
L1i cache:                       2 MiB (64 instances)
L2 cache:                        128 MiB (64 instances)
L3 cache:                        120 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-31,64-95
NUMA node1 CPU(s):               32-63,96-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-nccl-cu12==2.27.3
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.8.0
[pip3] torchaudio==2.8.0
[pip3] torchvision==0.23.0
[pip3] transformers==4.56.2
[pip3] triton==3.4.0
[conda] numpy                     2.2.6                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.8.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.8.90                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.8.93                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.8.90                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.10.2.21                pypi_0    pypi
[conda] nvidia-cufft-cu12         11.3.3.83                pypi_0    pypi
[conda] nvidia-cufile-cu12        1.13.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.9.90                pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.7.3.90                pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.5.8.93                pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.7.1                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.27.3                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.8.93                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.8.90                  pypi_0    pypi
[conda] pyzmq                     27.1.0                   pypi_0    pypi
[conda] torch                     2.8.0                    pypi_0    pypi
[conda] torchaudio                2.8.0                    pypi_0    pypi
[conda] torchvision               0.23.0                   pypi_0    pypi
[conda] transformers              4.56.2                   pypi_0    pypi
[conda] triton                    3.4.0                    pypi_0    pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.10.2
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NODE	SYS	SYS	SYS	0-31,64-95	0		N/A
GPU1	NODE	 X 	SYS	SYS	SYS	0-31,64-95	0		N/A
GPU2	SYS	SYS	 X 	NODE	NODE	32-63,96-127	1		N/A
GPU3	SYS	SYS	NODE	 X 	NODE	32-63,96-127	1		N/A
NIC0	SYS	SYS	NODE	NODE	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/cuda-12.9/lib64:
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
```

</details>



### 🐛 Describe the bug

When running the **throughput benchmark** of vLLM v1 engine with pipeline parallelism (`--pipeline-parallel-size > 1`), the run crashes with **CUDA illegal memory access**.

### Description

When running vLLM v1 engine with pipeline parallelism (--pipeline-parallel-size > 1), the benchmark crashes with CUDA illegal memory access.

### Environment:

* vLLM version: 0.10.2 (V1 engine)
* GPUs: 4× L20 (48GB)
* Model:
  * `codellama/CodeLlama-34b-hf`
  * Same issue also reproduced on `Qwen/Qwen2.5-32B-Instruct` → not model-specific.

### Command

```
vllm bench throughput \
  --model codellama/CodeLlama-34b-hf \
  --dataset ShareGPT_V3_unfiltered_cleaned_split.json \
  -pp 4 \
  --gpu-memory-utilization 0.8
```

### Notes

* Reproduces consistently with different models (CodeLlama-34B, Qwen2.5-32B).
* Issue is specific to **v1 engine + pipeline parallelism**.
* Using the **same configuration with tensor parallelism (`tp > 1`, `pp = 1`) runs without issues**.
* Likely related to PP communication.

### Error Log
```
......
(Worker_PP1 pid=2729458) INFO 09-25 00:15:07 [default_loader.py:268] Loading weights took 4.05 seconds
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:04<00:00,  1.61it/s]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:04<00:00,  1.67it/s]
(Worker_PP0 pid=2729457) 
(Worker_PP0 pid=2729457) INFO 09-25 00:15:07 [default_loader.py:268] Loading weights took 4.19 seconds
(Worker_PP2 pid=2729459) INFO 09-25 00:15:08 [default_loader.py:268] Loading weights took 4.24 seconds
(Worker_PP3 pid=2729460) INFO 09-25 00:15:08 [gpu_model_runner.py:2392] Model loading took 15.9614 GiB and 4.191021 seconds
(Worker_PP1 pid=2729458) INFO 09-25 00:15:08 [gpu_model_runner.py:2392] Model loading took 15.4731 GiB and 4.205874 seconds
(Worker_PP0 pid=2729457) INFO 09-25 00:15:08 [gpu_model_runner.py:2392] Model loading took 15.9613 GiB and 4.346570 seconds
(Worker_PP2 pid=2729459) INFO 09-25 00:15:08 [gpu_model_runner.py:2392] Model loading took 15.4731 GiB and 4.426911 seconds
(Worker_PP3 pid=2729460) INFO 09-25 00:15:11 [backends.py:539] Using cache directory: /home/zhanghb/.cache/vllm/torch_compile_cache/9bee0c48a6/rank_3_0/backbone for vLLM's torch.compile
(Worker_PP3 pid=2729460) INFO 09-25 00:15:11 [backends.py:550] Dynamo bytecode transform time: 2.15 s
(Worker_PP1 pid=2729458) INFO 09-25 00:15:11 [backends.py:539] Using cache directory: /home/zhanghb/.cache/vllm/torch_compile_cache/9bee0c48a6/rank_1_0/backbone for vLLM's torch.compile
(Worker_PP1 pid=2729458) INFO 09-25 00:15:11 [backends.py:550] Dynamo bytecode transform time: 2.22 s
(Worker_PP2 pid=2729459) INFO 09-25 00:15:11 [backends.py:539] Using cache directory: /home/zhanghb/.cache/vllm/torch_compile_cache/9bee0c48a6/rank_2_0/backbone for vLLM's torch.compile
(Worker_PP2 pid=2729459) INFO 09-25 00:15:11 [backends.py:550] Dynamo bytecode transform time: 2.23 s
(Worker_PP0 pid=2729457) INFO 09-25 00:15:11 [backends.py:539] Using cache directory: /home/zhanghb/.cache/vllm/torch_compile_cache/61975149e0/rank_0_0/backbone for vLLM's torch.compile
(Worker_PP0 pid=2729457) INFO 09-25 00:15:11 [backends.py:550] Dynamo bytecode transform time: 2.25 s
(Worker_PP2 pid=2729459) INFO 09-25 00:15:11 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 0.551 s
(Worker_PP3 pid=2729460) INFO 09-25 00:15:11 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 0.650 s
(Worker_PP0 pid=2729457) INFO 09-25 00:15:11 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 0.557 s
(Worker_PP1 pid=2729458) INFO 09-25 00:15:11 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 0.581 s
(Worker_PP2 pid=2729459) INFO 09-25 00:15:12 [monitor.py:34] torch.compile takes 2.23 s in total
(Worker_PP0 pid=2729457) INFO 09-25 00:15:12 [monitor.py:34] torch.compile takes 2.25 s in total
(Worker_PP3 pid=2729460) INFO 09-25 00:15:12 [monitor.py:34] torch.compile takes 2.15 s in total
(Worker_PP1 pid=2729458) INFO 09-25 00:15:12 [monitor.py:34] torch.compile takes 2.22 s in total
(Worker_PP2 pid=2729459) INFO 09-25 00:15:14 [gpu_worker.py:298] Available KV cache memory: 18.22 GiB
(Worker_PP0 pid=2729457) INFO 09-25 00:15:14 [gpu_worker.py:298] Available KV cache memory: 18.02 GiB
(Worker_PP1 pid=2729458) INFO 09-25 00:15:14 [gpu_worker.py:298] Available KV cache memory: 18.25 GiB
(Worker_PP3 pid=2729460) INFO 09-25 00:15:14 [gpu_worker.py:298] Available KV cache memory: 17.77 GiB
(EngineCore_DP0 pid=2729309) INFO 09-25 00:15:14 [kv_cache_utils.py:864] GPU KV cache size: 393,728 tokens
(EngineCore_DP0 pid=2729309) INFO 09-25 00:15:14 [kv_cache_utils.py:868] Maximum concurrency for 16,384 tokens per request: 24.03x
(EngineCore_DP0 pid=2729309) INFO 09-25 00:15:14 [kv_cache_utils.py:864] GPU KV cache size: 398,768 tokens
(EngineCore_DP0 pid=2729309) INFO 09-25 00:15:14 [kv_cache_utils.py:868] Maximum concurrency for 16,384 tokens per request: 24.34x
(EngineCore_DP0 pid=2729309) INFO 09-25 00:15:14 [kv_cache_utils.py:864] GPU KV cache size: 398,080 tokens
(EngineCore_DP0 pid=2729309) INFO 09-25 00:15:14 [kv_cache_utils.py:868] Maximum concurrency for 16,384 tokens per request: 24.30x
(EngineCore_DP0 pid=2729309) INFO 09-25 00:15:14 [kv_cache_utils.py:864] GPU KV cache size: 388,272 tokens
(EngineCore_DP0 pid=2729309) INFO 09-25 00:15:14 [kv_cache_utils.py:868] Maximum concurrency for 16,384 tokens per request: 23.70x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████| 67/67 [00:04<00:00, 16.44it/s]
(Worker_PP1 pid=2729458) INFO 09-25 00:15:19 [gpu_model_runner.py:3118] Graph capturing finished in 5 secs, took 0.81 GiB
(Worker_PP1 pid=2729458) INFO 09-25 00:15:19 [gpu_worker.py:391] Free memory on device (44.2/44.53 GiB) on startup. Desired GPU memory utilization is (0.8, 35.62 GiB). Actual usage is 15.47 GiB for weight, 1.77 GiB for peak activation, 0.13 GiB for non-torch memory, and 0.81 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=18576896409` to fit into requested memory, or `--kv-cache-memory=27786631168` to fully utilize gpu memory. Current kv cache memory in use is 19600306585 bytes.
(Worker_PP0 pid=2729457) INFO 09-25 00:15:19 [gpu_model_runner.py:3118] Graph capturing finished in 5 secs, took 0.81 GiB
(Worker_PP0 pid=2729457) INFO 09-25 00:15:19 [gpu_worker.py:391] Free memory on device (44.2/44.53 GiB) on startup. Desired GPU memory utilization is (0.8, 35.62 GiB). Actual usage is 15.96 GiB for weight, 1.52 GiB for peak activation, 0.12 GiB for non-torch memory, and 0.81 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=18329432473` to fit into requested memory, or `--kv-cache-memory=27539167232` to fully utilize gpu memory. Current kv cache memory in use is 19352842649 bytes.
(Worker_PP2 pid=2729459) INFO 09-25 00:15:19 [gpu_model_runner.py:3118] Graph capturing finished in 5 secs, took 0.81 GiB
(Worker_PP2 pid=2729459) INFO 09-25 00:15:19 [gpu_worker.py:391] Free memory on device (44.2/44.53 GiB) on startup. Desired GPU memory utilization is (0.8, 35.62 GiB). Actual usage is 15.47 GiB for weight, 1.77 GiB for peak activation, 0.16 GiB for non-torch memory, and 0.81 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=18543341977` to fit into requested memory, or `--kv-cache-memory=27753076736` to fully utilize gpu memory. Current kv cache memory in use is 19566752153 bytes.
(Worker_PP3 pid=2729460) INFO 09-25 00:15:19 [gpu_model_runner.py:3118] Graph capturing finished in 5 secs, took 0.81 GiB
(Worker_PP3 pid=2729460) INFO 09-25 00:15:19 [gpu_worker.py:391] Free memory on device (44.2/44.53 GiB) on startup. Desired GPU memory utilization is (0.8, 35.62 GiB). Actual usage is 15.96 GiB for weight, 1.77 GiB for peak activation, 0.12 GiB for non-torch memory, and 0.81 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=18054689177` to fit into requested memory, or `--kv-cache-memory=27264423936` to fully utilize gpu memory. Current kv cache memory in use is 19084390809 bytes.
(EngineCore_DP0 pid=2729309) INFO 09-25 00:15:19 [core.py:218] init engine (profile, create kv cache, warmup model) took 10.95 seconds
(EngineCore_DP0 pid=2729309) INFO 09-25 00:15:19 [core.py:145] Batch queue is enabled with size 4
INFO 09-25 00:15:28 [llm.py:295] Supported_tasks: ['generate']
INFO 09-25 00:15:28 [__init__.py:36] No IOProcessor plugins requested by the model
Adding requests:   0%|                                                               | 0/1000 [00:00<?, ?it/s](Worker_PP0 pid=2729457) /home/zhanghb/.venv/lib/python3.12/site-packages/vllm/distributed/parallel_state.py:523: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
(Worker_PP0 pid=2729457)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
Adding requests:  14%|███████▏                                           | 141/1000 [00:00<00:00, 1409.06it/s](Worker_PP1 pid=2729458) /home/zhanghb/.venv/lib/python3.12/site-packages/vllm/distributed/parallel_state.py:523: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
(Worker_PP1 pid=2729458)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
Adding requests:  31%|████████████████                                   | 314/1000 [00:00<00:00, 1590.50it/s](Worker_PP2 pid=2729459) /home/zhanghb/.venv/lib/python3.12/site-packages/vllm/distributed/parallel_state.py:523: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
(Worker_PP2 pid=2729459)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
Adding requests: 100%|██████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1592.37it/s]
[detokenizer.py:245] Encountered invalid prefix detokenization error for request 509, resetting decode stream.
Processed prompts:  42%|▍| 423/1000 [01:53<00:53, 10.76it/s, est. speed input: 1038.55 toks/s, output: 433.32 [rank0]:[E925 00:17:22.110851246 ProcessGroupNCCL.cpp:2068] [PG ID 4 PG GUID 19 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f635c97eeb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x111c7 (0x7f635cd0c1c7 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f62ffec4640 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f62ffed3e28 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x978 (0x7f62ffed6f48 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xd2 (0x7f62ffed8ec2 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f62e3843253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f635d63cac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a40 (0x7f635d6cea40 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 4 PG GUID 19 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f635c97eeb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x111c7 (0x7f635cd0c1c7 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f62ffec4640 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f62ffed3e28 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x978 (0x7f62ffed6f48 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xd2 (0x7f62ffed8ec2 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f62e3843253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f635d63cac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a40 (0x7f635d6cea40 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f635c97eeb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe1c1a1 (0x7f62ffeb01a1 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x9468e6 (0x7f62ff9da8e6 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x7f62e3843253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x7f635d63cac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0x126a40 (0x7f635d6cea40 in /lib/x86_64-linux-gnu/libc.so.6)

(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:22 [multiproc_executor.py:149] Worker proc VllmWorker-0 died unexpectedly, shutting down executor.
(Worker_PP1 pid=2729458) INFO 09-25 00:17:22 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_PP2 pid=2729459) INFO 09-25 00:17:22 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_PP3 pid=2729460) INFO 09-25 00:17:22 [multiproc_executor.py:546] Parent process exited, terminating worker
[rank3]:[W925 00:17:22.668143884 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=87, addr=[broker.mmdcszpulsarmmdata.wx.com]:38158, remote=[mmsche.wx.com]:48749): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fbaf297eeb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7fbad69ef4d1 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d6a8cd (0x7fbad69f08cd in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b47a (0x7fbad69f147a in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x31e (0x7fbad69ec19e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7fba95ed1b18 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7fba79843253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7fbaf362fac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a40 (0x7fbaf36c1a40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[W925 00:17:22.674155157 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
[rank2]:[W925 00:17:22.670231834 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=87, addr=[sandbox.btsvr.wx.com]:37802, remote=[tianjin.btsvr.wx.com]:48749): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f050ac12eb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7f054c5ef4d1 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d6a8cd (0x7f054c5f08cd in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b47a (0x7f054c5f147a in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x31e (0x7f054c5ec19e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7f050bad1b18 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f04ef243253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f05690f6ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a40 (0x7f0569188a40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[W925 00:17:22.677506860 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
[rank1]:[W925 00:17:23.009269413 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=87, addr=[broker.mmdcszpulsarmmdata.wx.com]:40242, remote=[mmsche.wx.com]:48749): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f2c56f7eeb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7f2c3afef4d1 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d6a8cd (0x7f2c3aff08cd in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b47a (0x7f2c3aff147a in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x31e (0x7f2c3afec19e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7f2bfa4d1b18 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f2bdde43253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f2c57cc8ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a40 (0x7f2c57d5aa40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W925 00:17:23.020097890 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
[rank3]:[W925 00:17:23.674361628 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=87, addr=[sandbox.btsvr.wx.com]:38158, remote=[mmsche.wx.com]:48749): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fbaf297eeb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7fbad69ef4d1 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d69d62 (0x7fbad69efd62 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b86e (0x7fbad69f186e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7fbad69ec18e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7fba95ed1b18 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7fba79843253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7fbaf362fac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a40 (0x7fbaf36c1a40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[W925 00:17:23.678587674 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank2]:[W925 00:17:23.677648143 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=87, addr=[broker.mmdcszpulsarmmdata.wx.com]:37802, remote=[tianjin.btsvr.wx.com]:48749): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f050ac12eb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7f054c5ef4d1 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d69d62 (0x7f054c5efd62 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b86e (0x7f054c5f186e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f054c5ec18e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7f050bad1b18 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f04ef243253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f05690f6ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a40 (0x7f0569188a40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[W925 00:17:23.681460208 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W925 00:17:24.020331150 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=87, addr=[sandbox.btsvr.wx.com]:40242, remote=[mmsche.wx.com]:48749): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f2c56f7eeb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7f2c3afef4d1 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d69d62 (0x7f2c3afefd62 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b86e (0x7f2c3aff186e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f2c3afec18e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7f2bfa4d1b18 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f2bdde43253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f2c57cc8ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a40 (0x7f2c57d5aa40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W925 00:17:24.024483009 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank3]:[W925 00:17:24.678774897 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=87, addr=[broker.mmdcszpulsarmmdata.wx.com]:38158, remote=[mmsche.wx.com]:48749): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fbaf297eeb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7fbad69ef4d1 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d69d62 (0x7fbad69efd62 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b86e (0x7fbad69f186e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7fbad69ec18e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7fba95ed1b18 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7fba79843253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7fbaf362fac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a40 (0x7fbaf36c1a40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[W925 00:17:24.683054852 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank2]:[W925 00:17:24.681600761 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=87, addr=[sandbox.btsvr.wx.com]:37802, remote=[tianjin.btsvr.wx.com]:48749): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f050ac12eb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7f054c5ef4d1 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d69d62 (0x7f054c5efd62 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b86e (0x7f054c5f186e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f054c5ec18e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7f050bad1b18 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f04ef243253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f05690f6ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a40 (0x7f0569188a40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[W925 00:17:24.685569373 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W925 00:17:25.024673860 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=87, addr=[broker.mmdcszpulsarmmdata.wx.com]:40242, remote=[mmsche.wx.com]:48749): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f2c56f7eeb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7f2c3afef4d1 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d69d62 (0x7f2c3afefd62 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b86e (0x7f2c3aff186e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f2c3afec18e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7f2bfa4d1b18 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f2bdde43253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f2c57cc8ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a40 (0x7f2c57d5aa40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W925 00:17:25.030868578 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank3]:[W925 00:17:25.683262024 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=87, addr=[sandbox.btsvr.wx.com]:38158, remote=[mmsche.wx.com]:48749): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fbaf297eeb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7fbad69ef4d1 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d69d62 (0x7fbad69efd62 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b86e (0x7fbad69f186e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7fbad69ec18e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7fba95ed1b18 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7fba79843253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7fbaf362fac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a40 (0x7fbaf36c1a40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[W925 00:17:25.687312044 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank2]:[W925 00:17:25.685726736 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=87, addr=[broker.mmdcszpulsarmmdata.wx.com]:37802, remote=[tianjin.btsvr.wx.com]:48749): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f050ac12eb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7f054c5ef4d1 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d69d62 (0x7f054c5efd62 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b86e (0x7f054c5f186e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f054c5ec18e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7f050bad1b18 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f04ef243253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f05690f6ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a40 (0x7f0569188a40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[W925 00:17:25.689677397 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W925 00:17:26.031082680 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=87, addr=[sandbox.btsvr.wx.com]:40242, remote=[mmsche.wx.com]:48749): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f2c56f7eeb0 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7f2c3afef4d1 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d69d62 (0x7f2c3afefd62 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b86e (0x7f2c3aff186e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f2c3afec18e in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7f2bfa4d1b18 in /home/zhanghb/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f2bdde43253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f2c57cc8ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a40 (0x7f2c57d5aa40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W925 00:17:26.035572242 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.2) with config: model='/home/zhanghb/CodeLlama-34b-hf', speculative_config=None, tokenizer='/home/zhanghb/CodeLlama-34b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=4, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/zhanghb/CodeLlama-34b-hf, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}, 
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=674,prompt_token_ids_len=17,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=33, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([16658, 16659],),num_computed_tokens=0,lora_request=None), NewRequestData(req_id=675,prompt_token_ids_len=962,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=19, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([16660, 16661, 16662, 16663, 16664, 16665, 16666, 16667, 16668, 16669, 16670, 16671, 16672, 16673, 16674, 16675, 16676, 16677, 16678, 16679, 16680, 16681, 16682, 16683, 16684, 16685, 16686, 16687, 16688, 16689, 16690, 16691, 16692, 16693, 16694, 16695, 16696, 16697, 16698, 16699, 16700, 16701, 16702, 16703, 16704, 16705, 16706, 16707, 16708, 16709, 16710, 16711, 16712, 16713, 16714, 16715, 16716, 16717, 16718, 16719, 16720],),num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=['1', '5', '10', '12', '20', '26', '130', '131', '135', '139', '146', '152', '153', '156', '160', '162', '165', '257', '266', '286', '287', '298', '311', '331', '333', '337', '361', '371', '399', '408', '411', '412', '424', '429', '440', '442', '453', '455', '458', '468', '474', '477', '481', '484', '494', '509', '514', '528', '533', '540', '555', '556', '570', '575', '578', '582', '588', '594', '597', '615', '623', '629', '633', '634', '636', '649', '654', '655', '660', '661', '668'], resumed_from_preemption=[false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false], new_token_ids=[[29974], [4275], [29894], [14226], [25957], [29880], [29954], [29871], [18232], [17156], [29889], [1283], [12177], [462], [1049], [4571], [29897], [29871], [29897], [278], [10780], [29896], [11590], [12], [1732], [526], [338], [29899], [29906], [24854], [15110], [580], [13], [29906], [7791], [13], [1375], [12932], [13011], [13], [29908], [29892], [1867], [2450], [1146], [610], [27489], [1988], [1090], [29974], [29954], [322], [474], [4323], [3255], [29906], [30152], [353], [29889], [13], [29902], [370], [29912], [13], [462], [13], [1307], [13], [3128], [29892], [278]], new_block_ids=[null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, [[16650]], null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, [[16651]], null, null, [[16652]], null, null, null, null, [[16653]], null, null, null, null, [[16654]], null, [[16655]], null, null, [[16656]], null, null, null, null, null, null, null, null, null, null, null, null, [[16657]], null, null, null, null, null, null, null], num_computed_tokens=[445, 444, 431, 922, 1258, 859, 447, 435, 802, 425, 426, 479, 460, 454, 425, 1248, 459, 427, 425, 430, 745, 664, 463, 404, 408, 393, 388, 387, 430, 348, 307, 374, 720, 291, 273, 464, 265, 439, 237, 316, 272, 229, 262, 397, 198, 704, 244, 192, 150, 150, 128, 143, 95, 115, 91, 92, 196, 142, 91, 440, 156, 392, 44, 48, 41, 21, 170, 50, 154, 31, 152]), num_scheduled_tokens={131: 1, 528: 1, 474: 1, 556: 1, 399: 1, 453: 1, 570: 1, 311: 1, 20: 1, 668: 1, 165: 1, 629: 1, 26: 1, 160: 1, 333: 1, 5: 1, 156: 1, 540: 1, 623: 1, 660: 1, 266: 1, 10: 1, 257: 1, 298: 1, 484: 1, 411: 1, 509: 1, 578: 1, 337: 1, 674: 17, 458: 1, 555: 1, 582: 1, 588: 1, 331: 1, 597: 1, 12: 1, 575: 1, 371: 1, 153: 1, 655: 1, 412: 1, 1: 1, 636: 1, 287: 1, 455: 1, 675: 962, 634: 1, 649: 1, 139: 1, 654: 1, 130: 1, 633: 1, 661: 1, 442: 1, 286: 1, 533: 1, 594: 1, 424: 1, 429: 1, 494: 1, 162: 1, 408: 1, 361: 1, 615: 1, 440: 1, 477: 1, 481: 1, 135: 1, 514: 1, 146: 1, 152: 1, 468: 1}, total_num_scheduled_tokens=1050, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=['457', '656'], free_encoder_mm_hashes=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=256, num_waiting_reqs=320, step_counter=0, current_wave=0, kv_cache_usage=0.2595813071787686, prefix_cache_stats=PrefixCacheStats(reset=False, requests=2, queries=1420, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720] Traceback (most recent call last):
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]   File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 711, in run_engine_core
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]   File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 738, in run_busy_loop
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]     self._process_engine_step()
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]   File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 764, in _process_engine_step
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]   File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 353, in step_with_batch_queue
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]     model_output = self.execute_model_with_error_logging(
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]   File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 278, in execute_model_with_error_logging
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]     raise err
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]   File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 269, in execute_model_with_error_logging
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]     return model_fn(scheduler_output)
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]   File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 354, in <lambda>
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]     lambda _: future.result(), scheduler_output)
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]               ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]     return self.__get_result()
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]     raise self._exception
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]   File "/usr/lib/python3.12/concurrent/futures/thread.py", line 59, in run
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]     result = self.fn(*self.args, **self.kwargs)
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]   File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 239, in get_response
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]     status, result = w.worker_response_mq.dequeue(
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]   File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 507, in dequeue
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]     with self.acquire_read(timeout, cancel) as buf:
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]     return next(self.gen)
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]            ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]   File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 464, in acquire_read
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720]     raise RuntimeError("cancelled")
(EngineCore_DP0 pid=2729309) ERROR 09-25 00:17:26 [core.py:720] RuntimeError: cancelled
Traceback (most recent call last):
  File "/home/zhanghb/.venv/bin/vllm", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
    args.dispatch_function(args)
  File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/benchmark/throughput.py", line 21, in cmd
    main(args)
  File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/benchmarks/throughput.py", line 633, in main
    elapsed_time, request_outputs = run_vllm(
                                    ^^^^^^^^^
  File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/benchmarks/throughput.py", line 81, in run_vllm
    outputs = llm.generate(prompts,
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 396, in generate
    outputs = self._run_engine(use_tqdm=use_tqdm)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 1550, in _run_engine
    step_outputs = self.llm_engine.step()
                   ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py", line 248, in step
    outputs = self.engine_core.get_output()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhanghb/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 670, in get_output
    raise self._format_exception(outputs) from None
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
Processed prompts:  42%|▍| 424/1000 [01:58<02:41,  3.57it/s, est. speed input: 1041.51 toks/s, output: 433.40 
/usr/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 3 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
```





### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Pipeline parallel (pp>1) crashes with CUDA illegal memory access #25650

Your current environment

🐛 Describe the bug

Description

Environment:

Command

Notes

Error Log

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Pipeline parallel (pp>1) crashes with CUDA illegal memory access #25650

Description

Your current environment

🐛 Describe the bug

Description

Environment:

Command

Notes

Error Log

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions