Skip to content

[distributed][fsdp]Accuracy gaps on xelink #1926

@zxd1997066

Description

@zxd1997066

🐛 Describe the bug

get wheels from https://github.com/intel/torch-xpu-ops/actions/runs/16826215961
git clone -b distributed_2.9 https://github.com/daisyden/pytorch.git
cd pytorch
pip install pytest expecttest
pip install -r requirements.txt

pytest -v test/distributed/fsdp/test_fsdp_core.py::TestNoGradXPU::test_transformer_no_grad_mixed_precision_True_xpu
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/opt/conda/envs/test/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/opt/conda/envs/test/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 705, in wrapper
    self._join_processes(fn)
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 969, in _join_processes
    self._check_return_codes(fn, elapsed_time)
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1009, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 853, in run_test
    getattr(self, test_name)()
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 707, in wrapper
    fn()
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3217, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
    result = test(self, **param_kwargs)
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 221, in wrapper
    return func(*args, **kwargs)
  File "/tmp/pytorch/test/distributed/fsdp/test_fsdp_core.py", line 434, in test_transformer_no_grad
    self.assertEqual(ref_output, no_grad_output)
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4170, in assertEqual
    raise error_metas.pop()[0].to_error(  # type: ignore[index]
AssertionError: Tensor-likes are not close!

Mismatched elements: 68 / 184 (37.0%)
Greatest absolute difference: 0.01171875 at index (0, 1, 12) (up to 1e-05 allowed)
Greatest relative difference: 0.08544921875 at index (3, 0, 14) (up to 0.001 allowed)

To execute this test, run the following from the base repo dir:
    python test/distributed/fsdp/test_fsdp_core.py TestNoGradXPU.test_transformer_no_grad_mixed_precision_True_xpu

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 853, in run_test
    getattr(self, test_name)()
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 707, in wrapper
    fn()
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3217, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
    result = test(self, **param_kwargs)
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 221, in wrapper
    return func(*args, **kwargs)
  File "/tmp/pytorch/test/distributed/fsdp/test_fsdp_core.py", line 434, in test_transformer_no_grad
    self.assertEqual(ref_output, no_grad_output)
  File "/opt/conda/envs/test/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4170, in assertEqual
    raise error_metas.pop()[0].to_error(  # type: ignore[index]
AssertionError: Tensor-likes are not close!

Mismatched elements: 68 / 184 (37.0%)
Greatest absolute difference: 0.01171875 at index (0, 1, 12) (up to 1e-05 allowed)
Greatest relative difference: 0.08544921875 at index (3, 0, 14) (up to 0.001 allowed)

Versions

PyTorch: https://github.com/daisyden/pytorch/tree/distributed_2.9

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingmodule: distributedFor distributed feature issue

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions