Skip to content

Commit 6cd0970

Browse files
xwang233summerdo
authored andcommitted
[CUDA][Linalg} Patch crash of linalg.eigh when input matrix is ill-conditioned, in some cusolver version (pytorch#107082)
Related: pytorch#94772, pytorch#105359 I can locally reproduce this crash with pytorch 2.0.1 stable pip binary. The test already passes with the latest cuda 12.2 release. Re: pytorch#94772 (comment) > From discussion in triage review: - [x] we should add a test to prevent regressions - [x] properly document support wrt different CUDA versions - [x] possibly add support using MAGMA Pull Request resolved: pytorch#107082 Approved by: https://github.com/lezcano
1 parent 3fa80b3 commit 6cd0970

File tree

3 files changed

+38
-2
lines changed

3 files changed

+38
-2
lines changed

aten/src/ATen/cuda/Exceptions.h

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,13 @@ const char *cusparseGetErrorString(cusparseStatus_t status);
6969

7070
namespace at::cuda::solver {
7171
C10_EXPORT const char* cusolverGetErrorMessage(cusolverStatus_t status);
72+
73+
constexpr const char* _cusolver_backend_suggestion = \
74+
"If you keep seeing this error, you may use " \
75+
"`torch.backends.cuda.preferred_linalg_library()` to try " \
76+
"linear algebra operators with other supported backends. " \
77+
"See https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.preferred_linalg_library";
78+
7279
} // namespace at::cuda::solver
7380

7481
// When cuda < 11.5, cusolver raises CUSOLVER_STATUS_EXECUTION_FAILED when input contains nan.
@@ -85,13 +92,15 @@ C10_EXPORT const char* cusolverGetErrorMessage(cusolverStatus_t status);
8592
"cusolver error: ", \
8693
at::cuda::solver::cusolverGetErrorMessage(__err), \
8794
", when calling `" #EXPR "`", \
88-
". This error may appear if the input matrix contains NaN."); \
95+
". This error may appear if the input matrix contains NaN. ", \
96+
at::cuda::solver::_cusolver_backend_suggestion); \
8997
} else { \
9098
TORCH_CHECK( \
9199
__err == CUSOLVER_STATUS_SUCCESS, \
92100
"cusolver error: ", \
93101
at::cuda::solver::cusolverGetErrorMessage(__err), \
94-
", when calling `" #EXPR "`"); \
102+
", when calling `" #EXPR "`. ", \
103+
at::cuda::solver::_cusolver_backend_suggestion); \
95104
} \
96105
} while (0)
97106

test/test_linalg.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -988,6 +988,26 @@ def test_eigh_errors_and_warnings(self, device, dtype):
988988
with self.assertRaisesRegex(RuntimeError, "tensors to be on the same device"):
989989
torch.linalg.eigh(a, out=(out_w, out_v))
990990

991+
@skipCPUIfNoLapack
992+
@dtypes(torch.float, torch.double)
993+
@unittest.skipIf(_get_torch_cuda_version() < (12, 1), "Test is fixed on cuda 12.1 update 1.")
994+
def test_eigh_svd_illcondition_matrix_input_should_not_crash(self, device, dtype):
995+
# See https://github.com/pytorch/pytorch/issues/94772, https://github.com/pytorch/pytorch/issues/105359
996+
# This test crashes with `cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED` on cuda 11.8,
997+
# but passes on cuda 12.1 update 1 or later.
998+
a = torch.ones(512, 512, dtype=dtype, device=device)
999+
a[0, 0] = 1.0e-5
1000+
a[-1, -1] = 1.0e5
1001+
1002+
eigh_out = torch.linalg.eigh(a)
1003+
svd_out = torch.linalg.svd(a)
1004+
1005+
# Matrix input a is too ill-conditioned.
1006+
# We'll just compare the first two singular values/eigenvalues. They are 1.0e5 and 511.0
1007+
# The precision override with tolerance of 1.0 makes sense since ill-conditioned inputs are hard to converge
1008+
# to exact values.
1009+
self.assertEqual(eigh_out.eigenvalues.sort(descending=True).values[:2], [1.0e5, 511.0], atol=1.0, rtol=1.0e-2)
1010+
self.assertEqual(svd_out.S[:2], [1.0e5, 511.0], atol=1.0, rtol=1.0e-2)
9911011

9921012
@skipCUDAIfNoMagma
9931013
@skipCPUIfNoLapack

torch/linalg/__init__.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -647,6 +647,13 @@
647647
:math:`\lambda_i` through the computation of
648648
:math:`\frac{1}{\min_{i \neq j} \lambda_i - \lambda_j}`.
649649
650+
.. warning:: User may see pytorch crashes if running `eigh` on CUDA devices with CUDA versions before 12.1 update 1
651+
with large ill-conditioned matrices as inputs.
652+
Refer to :ref:`Linear Algebra Numerical Stability<Linear Algebra Stability>` for more details.
653+
If this is the case, user may (1) tune their matrix inputs to be less ill-conditioned,
654+
or (2) use :func:`torch.backends.cuda.preferred_linalg_library` to
655+
try other supported backends.
656+
650657
.. seealso::
651658
652659
:func:`torch.linalg.eigvalsh` computes only the eigenvalues of a Hermitian matrix.

0 commit comments

Comments
 (0)