-
Notifications
You must be signed in to change notification settings - Fork 59
Open
Labels
bugSomething isn't workingSomething isn't workingmodule: distributedFor distributed feature issueFor distributed feature issue
Milestone
Description
🐛 Describe the bug
get wheels from https://github.com/intel/torch-xpu-ops/actions/runs/16826215961
git clone -b distributed_2.9 https://github.com/daisyden/pytorch.git
cd pytorch
pip install pytest expecttest
pip install -r requirements.txt
pytest -v test/distributed/fsdp/test_fsdp_core.py::TestNoGradXPU::test_transformer_no_grad_mixed_precision_True_xpu
Traceback (most recent call last):
File "/opt/conda/envs/test/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
yield
File "/opt/conda/envs/test/lib/python3.10/unittest/case.py", line 591, in run
self._callTestMethod(testMethod)
File "/opt/conda/envs/test/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
method()
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 705, in wrapper
self._join_processes(fn)
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 969, in _join_processes
self._check_return_codes(fn, elapsed_time)
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1009, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 853, in run_test
getattr(self, test_name)()
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 707, in wrapper
fn()
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3217, in wrapper
method(*args, **kwargs)
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
result = test(self, **param_kwargs)
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 221, in wrapper
return func(*args, **kwargs)
File "/tmp/pytorch/test/distributed/fsdp/test_fsdp_core.py", line 434, in test_transformer_no_grad
self.assertEqual(ref_output, no_grad_output)
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4170, in assertEqual
raise error_metas.pop()[0].to_error( # type: ignore[index]
AssertionError: Tensor-likes are not close!
Mismatched elements: 68 / 184 (37.0%)
Greatest absolute difference: 0.01171875 at index (0, 1, 12) (up to 1e-05 allowed)
Greatest relative difference: 0.08544921875 at index (3, 0, 14) (up to 0.001 allowed)
To execute this test, run the following from the base repo dir:
python test/distributed/fsdp/test_fsdp_core.py TestNoGradXPU.test_transformer_no_grad_mixed_precision_True_xpu
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 853, in run_test
getattr(self, test_name)()
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 707, in wrapper
fn()
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3217, in wrapper
method(*args, **kwargs)
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
result = test(self, **param_kwargs)
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 221, in wrapper
return func(*args, **kwargs)
File "/tmp/pytorch/test/distributed/fsdp/test_fsdp_core.py", line 434, in test_transformer_no_grad
self.assertEqual(ref_output, no_grad_output)
File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4170, in assertEqual
raise error_metas.pop()[0].to_error( # type: ignore[index]
AssertionError: Tensor-likes are not close!
Mismatched elements: 68 / 184 (37.0%)
Greatest absolute difference: 0.01171875 at index (0, 1, 12) (up to 1e-05 allowed)
Greatest relative difference: 0.08544921875 at index (3, 0, 14) (up to 0.001 allowed)
Versions
PyTorch: https://github.com/daisyden/pytorch/tree/distributed_2.9
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingmodule: distributedFor distributed feature issueFor distributed feature issue