-
Notifications
You must be signed in to change notification settings - Fork 60
Open
Labels
bugSomething isn't workingSomething isn't workingmodule: distributedFor distributed feature issueFor distributed feature issue
Milestone
Description
🐛 Describe the bug
The following cases failed with "RuntimeError: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)".
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorChunked::test_complete_world_size
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorChunked::test_multiple_local_shards
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorChunked::test_new_group
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorChunked::test_partial_world_size
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorEnumerable::test_grid_sharding
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorEnumerable::test_multiple_local_shards
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorEnumerable::test_new_group
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorEnumerable::test_partial_world_size
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorEnumerable::test_with_rpc_names
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorFromLocalTensor::test_init_from_local_tensor
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorFromLocalShards::test_init_from_local_shards
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorFromLocalShards::test_init_from_local_shards_and_global_metadata
Error message
Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
File "/home/sdp/miniforge-pypy3/envs/shared_ut/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 864, in run_test
getattr(self, test_name)()
File "/home/sdp/miniforge-pypy3/envs/shared_ut/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 718, in wrapper
fn()
File "/home/sdp/miniforge-pypy3/envs/shared_ut/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3226, in wrapper
method(*args, **kwargs)
File "/home/sdp/miniforge-pypy3/envs/shared_ut/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1919, in wrapper
return fn(*args, **kwargs)
File "/home/sdp/miniforge-pypy3/envs/shared_ut/lib/python3.10/site-packages/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 102, in wrapper
func(self, *args, **kwargs)
File "/home/sdp/miniforge-pypy3/envs/shared_ut/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 221, in wrapper
return func(*args, **kwargs)
File "/home/sdp/pytorch/test/distributed/_shard/sharded_tensor/test_sharded_tensor.py", line 999, in test_multiple_local_shards
shard = remote_shard.to_here()
RuntimeError: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
To execute this test, run the following from the base repo dir:
python test/distributed/_shard/sharded_tensor/test_sharded_tensor.py TestShardedTensorChunked.test_multiple_local_shards
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Versions
Reproducer
# Please get the wheel from http://mengfeil-ubuntu.sh.intel.com/test/torch_whl/ww33_distributed/torch-2.9.0a0%2Bgit95ef9c6-cp310-cp310-linux_x86_64.whl
git clone -b libo/distrituted_shared_p4 https://github.com/libohao1201/pytorch.git
cd pytorch
pip install pytest expecttest zstandard
pip install -r requirements.txt
pytest -v test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorChunked::test_complete_world_size
Metadata
Metadata
Labels
bugSomething isn't workingSomething isn't workingmodule: distributedFor distributed feature issueFor distributed feature issue