Skip to content

[distributed][shared_tensor] test\distributed\_shard\shared_tensor\test_sharded_tensor.py has 12 cases failed with "RuntimeError: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)" #2004

@libohao1201

Description

@libohao1201

🐛 Describe the bug

The following cases failed with "RuntimeError: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)".

test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorChunked::test_complete_world_size
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorChunked::test_multiple_local_shards
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorChunked::test_new_group
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorChunked::test_partial_world_size
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorEnumerable::test_grid_sharding
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorEnumerable::test_multiple_local_shards
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorEnumerable::test_new_group
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorEnumerable::test_partial_world_size
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorEnumerable::test_with_rpc_names
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorFromLocalTensor::test_init_from_local_tensor
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorFromLocalShards::test_init_from_local_shards
test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorFromLocalShards::test_init_from_local_shards_and_global_metadata

Error message

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/home/sdp/miniforge-pypy3/envs/shared_ut/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 864, in run_test
    getattr(self, test_name)()
  File "/home/sdp/miniforge-pypy3/envs/shared_ut/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 718, in wrapper
    fn()
  File "/home/sdp/miniforge-pypy3/envs/shared_ut/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3226, in wrapper
    method(*args, **kwargs)
  File "/home/sdp/miniforge-pypy3/envs/shared_ut/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1919, in wrapper
    return fn(*args, **kwargs)
  File "/home/sdp/miniforge-pypy3/envs/shared_ut/lib/python3.10/site-packages/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 102, in wrapper
    func(self, *args, **kwargs)
  File "/home/sdp/miniforge-pypy3/envs/shared_ut/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 221, in wrapper
    return func(*args, **kwargs)
  File "/home/sdp/pytorch/test/distributed/_shard/sharded_tensor/test_sharded_tensor.py", line 999, in test_multiple_local_shards
    shard = remote_shard.to_here()
RuntimeError: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)

To execute this test, run the following from the base repo dir:
    python test/distributed/_shard/sharded_tensor/test_sharded_tensor.py TestShardedTensorChunked.test_multiple_local_shards

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Versions

Reproducer

# Please get the wheel from http://mengfeil-ubuntu.sh.intel.com/test/torch_whl/ww33_distributed/torch-2.9.0a0%2Bgit95ef9c6-cp310-cp310-linux_x86_64.whl 
git clone -b libo/distrituted_shared_p4 https://github.com/libohao1201/pytorch.git
cd pytorch
pip install pytest expecttest zstandard
pip install -r requirements.txt

pytest -v test/distributed/_shard/sharded_tensor/test_sharded_tensor.py::TestShardedTensorChunked::test_complete_world_size


Metadata

Metadata

Labels

bugSomething isn't workingmodule: distributedFor distributed feature issue

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions