-
Notifications
You must be signed in to change notification settings - Fork 314
3.x Change CUDA install validation test to support both v11.4 and v11.x > 11.4 for ARM #4305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #4305 +/- ##
========================================
Coverage 88.29% 88.29%
========================================
Files 158 158
Lines 13349 13349
========================================
Hits 11787 11787
Misses 1562 1562 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
efc80e9 to
f1096fa
Compare
… 11.4 on ARM Starting with CUDA Toolkit v11.5, the samples where the deviceQuery source code is installed was moved out of the CUDA install package and into its own github repository. The structure of the sample directy tree also changed thus requiring different paths for compiling and executing the deviceQuery utility.
9c5af13 to
46267a9
Compare
| set -v | ||
| cuda_ver="{{ validate.CudaVersion.outputs.stdout }}" | ||
| if [ ${cuda_ver} \> '11.4' ]; then | ||
| PATTERN=$(grep -F "default['cluster']['nvidia']['cuda_sample_version']" {{ CookbookDefaultFile }}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find default['cluster']['nvidia']['cuda_sample_version'].
Is this related to another PR? Can you please reference it into the description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That change is coming in an as yet to be generated PR for parallelcluster-cookbook. So technically, with this PR, cuda_ver will be empty, so the logic here still kind of works but is definitely sloppy. I will fix this.
lukeseawalker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting with NVIDIA CUDA Toolkit v11.5, the samples where the deviceQuery source code is installed was moved out of the CUDA install package and into its own GitHub repository.
Does it mean that the samples are not in the CUDA bundle we use? If yes, when/where we are downloading this samples?
| - | | ||
| set -v | ||
| cuda_ver="{{ validate.CudaVersion.outputs.stdout }}" | ||
| if [ ${cuda_ver} \> '11.4' ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind that this is a string comparison, and cuda_ver=011.5 will fail the test. It's OK if we are confident about the versioning format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be retrieved from 'attributes/default.rb' in the cookbook recipes, so it is under our control. I can add some more processing to make sure it's normalized though if you like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the sample download will come from github and is done as part of my yet to be generated pull-request for the cookbook package. I'm just trying to set up the test here so that it will work with either the embedded samples in 11.4 or the external samples in 11.7 - or at least that is my goal. Let me know if there is a better way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the value is under our control I'm good with this, tnx
… 11.4 on ARM (aws#4305) Starting with CUDA Toolkit v11.5, the samples where the deviceQuery source code is installed was moved out of the CUDA install package and into its own github repository. The structure of the sample directy tree also changed thus requiring different paths for compiling and executing the deviceQuery utility. Co-authored-by: Luca Carrogu <[email protected]>
Description of changes
This change will use the correct path to build and execute the deviceQuery utility during image validation on ARM architectures depending on which version of the CUDA toolkit is installed.
Starting with NVIDIA CUDA Toolkit v11.5, the samples where the deviceQuery source code is installed was moved out of the CUDA install package and into its own GitHub repository. The structure of the sample directy tree also changed thus requiring different paths for compiling and executing the deviceQuery utility.
Tests
References
Checklist
Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.