"CUDA error, AssertionError: Tensor-likes are not close!" on model test for cuda-resnet101 #7618

atalman · 2023-05-23T15:01:26Z

🐛 Describe the bug

Similar to: #7143

When switching CI from CUDA 11.7 to CUDA 11.8. Unit tests on Linux fails:
#7616

2023-05-23T13:41:31.5794788Z =================================== FAILURES ===================================
2023-05-23T13:41:31.5795148Z �[31m�[1m__________________ test_classification_model[cuda-resnet101] ___________________�[0m
2023-05-23T13:41:31.5795459Z Traceback (most recent call last):
2023-05-23T13:41:31.5795737Z   File "/work/test/test_models.py", line 705, in test_classification_model
2023-05-23T13:41:31.5796046Z     _assert_expected(out.cpu(), model_name, prec=prec)
2023-05-23T13:41:31.5796368Z   File "/work/test/test_models.py", line 155, in _assert_expected
2023-05-23T13:41:31.5796725Z     torch.testing.assert_close(output, expected, rtol=rtol, atol=atol, check_dtype=False, check_device=False)
2023-05-23T13:41:31.5797207Z   File "/opt/conda/envs/ci/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1511, in assert_close
2023-05-23T13:41:31.5797525Z     raise error_metas[0].to_error(msg)
2023-05-23T13:41:31.5797814Z AssertionError: Tensor-likes are not close!
2023-05-23T13:41:31.5797970Z 
2023-05-23T13:41:31.5798060Z Mismatched elements: 1 / 50 (2.0%)
2023-05-23T13:41:31.5798341Z Greatest absolute difference: 5.10198974609375 at index (0, 22) (up to 0.2 allowed)
2023-05-23T13:41:31.5798665Z Greatest relative difference: 0.2689853608608246 at index (0, 22) (up to 0.2 allowed)

Versions

nightly 2.1.0

cc @pmeier @NicolasHug @ptrblck @malfet @ngimel

The text was updated successfully, but these errors were encountered:

ptrblck · 2023-05-25T05:47:05Z

I'm able to reproduce the issue on an A100 using a different seed that 0, which is hard-coded in the test.
Based on the failure it seems we are seeding the test and are comparing the result to a pre-defined result defined in the expect folder and stored as a pkl file as seen here.
I was also able to verify that indeed the model parameters as well as the inputs change based on the seed:

diff --git a/test/test_models.py b/test/test_models.py
index 91aa66c667..d2b312ab15 100644
--- a/test/test_models.py
+++ b/test/test_models.py
@@ -674,7 +674,10 @@ def test_vitc_models(model_fn, dev):
 @pytest.mark.parametrize("model_fn", list_model_fns(models))
 @pytest.mark.parametrize("dev", cpu_and_gpu())
 def test_classification_model(model_fn, dev):
-    set_rng_seed(0)
+    import os
+    seed = int(os.getenv("TORCH_SEED"))
+    print("using seed {}".format(seed))
+    set_rng_seed(seed)
     defaults = {
         "num_classes": 50,
         "input_shape": (1, 3, 224, 224),

So far I would guess the PRNG behavior might have changed between 11.7 and 11.8 for A10G, but I still need to verify it on the actual device.
The same behavior is observed on an A40 (seed=0 passes, every other fails as expected).

I'll try to lease an A10G next to reproduce the actual failure with the default seed to check if my guess is correct.

ptrblck · 2023-05-26T07:26:11Z

My guess was wrong and while changing the seed let's the test fail the reported errors are in a larger range.
I was now able to reproduce the issue on an A10G and it seems the numerical mismatch is caused if TF32 is allowed in cuDNN.
The output values have a large range reported as:

tensor([[ 8.1592e+03, -2.7165e+04,  2.9925e+03,  1.6079e+04, -6.1412e+03,
         -5.4558e+03,  8.6438e+03,  1.0517e+04,  2.7873e+04,  3.0356e+03,
         -1.1014e+04,  1.9574e+04,  7.1062e+03, -3.5376e+03,  6.9987e+03,
         -6.3800e+03, -1.8092e+04,  1.6719e+04,  2.5773e+03, -2.6049e+03,
         -1.3284e+04, -7.9999e+03,  1.3866e+01,  8.8126e+02, -6.2183e+03,
         -8.9771e+03, -1.0583e+03, -1.0977e+04,  6.3043e+03, -7.0138e+03,
         -1.6880e+04,  6.6776e+03, -1.1648e+04,  3.6115e+03,  2.0045e+04,
          7.8362e+02, -2.1655e+04, -9.3831e+03,  2.6998e+04,  1.5136e+04,
         -3.8140e+03,  1.4637e+03, -1.5687e+04, -6.0253e+03, -3.6343e+03,
          1.8916e+03,  7.9858e+03,  2.1514e+03, -1.0606e+04,  7.2659e+03]],

so the abs. error of ~5 might be expected.
Disabling TF32 for this test let's it pass again.

Additionally, I've created internal cuBLAS and cuDNN logs and compared the outputs to a CPU implementation without seeing any disallowed mismatches.

atalman added module: tests cuda labels May 23, 2023

ptrblck mentioned this issue May 26, 2023

disable tf32 in cuDNN for classification models #7634

Merged

pmeier closed this as completed in #7634 Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"CUDA error, AssertionError: Tensor-likes are not close!" on model test for cuda-resnet101 #7618

"CUDA error, AssertionError: Tensor-likes are not close!" on model test for cuda-resnet101 #7618

atalman commented May 23, 2023 •

edited

Loading

ptrblck commented May 25, 2023

Uh oh!

ptrblck commented May 26, 2023

Uh oh!

"CUDA error, AssertionError: Tensor-likes are not close!" on model test for cuda-resnet101 #7618

"CUDA error, AssertionError: Tensor-likes are not close!" on model test for cuda-resnet101 #7618

Comments

atalman commented May 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐛 Describe the bug

Versions

ptrblck commented May 25, 2023

Uh oh!

ptrblck commented May 26, 2023

Uh oh!

atalman commented May 23, 2023 •

edited

Loading