You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm able to reproduce the issue on an A100 using a different seed that 0, which is hard-coded in the test.
Based on the failure it seems we are seeding the test and are comparing the result to a pre-defined result defined in the expect folder and stored as a pkl file as seen here.
I was also able to verify that indeed the model parameters as well as the inputs change based on the seed:
So far I would guess the PRNG behavior might have changed between 11.7 and 11.8 for A10G, but I still need to verify it on the actual device.
The same behavior is observed on an A40 (seed=0 passes, every other fails as expected).
I'll try to lease an A10G next to reproduce the actual failure with the default seed to check if my guess is correct.
My guess was wrong and while changing the seed let's the test fail the reported errors are in a larger range.
I was now able to reproduce the issue on an A10G and it seems the numerical mismatch is caused if TF32 is allowed in cuDNN.
The output values have a large range reported as:
Uh oh!
There was an error while loading. Please reload this page.
🐛 Describe the bug
Similar to: #7143
When switching CI from CUDA 11.7 to CUDA 11.8. Unit tests on Linux fails:
#7616
Versions
nightly 2.1.0
cc @pmeier @NicolasHug @ptrblck @malfet @ngimel
The text was updated successfully, but these errors were encountered: