Skip to content

Conversation

@davprat
Copy link
Contributor

@davprat davprat commented Aug 30, 2022

Description of changes

This change will use the correct path to build and execute the deviceQuery utility during image validation on ARM architectures depending on which version of the CUDA toolkit is installed.

Starting with NVIDIA CUDA Toolkit v11.5, the samples where the deviceQuery source code is installed was moved out of the CUDA install package and into its own GitHub repository. The structure of the sample directy tree also changed thus requiring different paths for compiling and executing the deviceQuery utility.

Tests

  • Ran developer build with CUDA 11.7 installed, validation test passed.
  • Ran developer build with CUDA 11.4 installed, validation test passed.

References

Checklist

  • Make sure you are pointing to the right branch and add a label in the PR title (i.e. 2.x vs 3.x)
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@davprat davprat added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Aug 30, 2022
@davprat davprat changed the title Change CUDA install validation test to support both v11.4 and v11.x > 11.4 for ARM 3.x Change CUDA install validation test to support both v11.4 and v11.x > 11.4 for ARM Aug 30, 2022
@codecov
Copy link

codecov bot commented Aug 30, 2022

Codecov Report

Merging #4305 (24f043e) into develop (a32bafb) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff            @@
##           develop    #4305   +/-   ##
========================================
  Coverage    88.29%   88.29%           
========================================
  Files          158      158           
  Lines        13349    13349           
========================================
  Hits         11787    11787           
  Misses        1562     1562           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@davprat davprat marked this pull request as ready for review August 30, 2022 21:28
@davprat davprat requested review from a team as code owners August 30, 2022 21:28
@davprat davprat force-pushed the PCLUSTER-5187 branch 2 times, most recently from efc80e9 to f1096fa Compare September 1, 2022 19:31
… 11.4 on ARM

Starting with CUDA Toolkit v11.5, the samples where the deviceQuery source code is installed
was moved out of the CUDA install package and into its own github repository. The structure
of the sample directy tree also changed thus requiring different paths for compiling and
executing the deviceQuery utility.
set -v
cuda_ver="{{ validate.CudaVersion.outputs.stdout }}"
if [ ${cuda_ver} \> '11.4' ]; then
PATTERN=$(grep -F "default['cluster']['nvidia']['cuda_sample_version']" {{ CookbookDefaultFile }})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find default['cluster']['nvidia']['cuda_sample_version'].
Is this related to another PR? Can you please reference it into the description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change is coming in an as yet to be generated PR for parallelcluster-cookbook. So technically, with this PR, cuda_ver will be empty, so the logic here still kind of works but is definitely sloppy. I will fix this.

Copy link
Contributor

@lukeseawalker lukeseawalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting with NVIDIA CUDA Toolkit v11.5, the samples where the deviceQuery source code is installed was moved out of the CUDA install package and into its own GitHub repository.

Does it mean that the samples are not in the CUDA bundle we use? If yes, when/where we are downloading this samples?

- |
set -v
cuda_ver="{{ validate.CudaVersion.outputs.stdout }}"
if [ ${cuda_ver} \> '11.4' ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind that this is a string comparison, and cuda_ver=011.5 will fail the test. It's OK if we are confident about the versioning format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be retrieved from 'attributes/default.rb' in the cookbook recipes, so it is under our control. I can add some more processing to make sure it's normalized though if you like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the sample download will come from github and is done as part of my yet to be generated pull-request for the cookbook package. I'm just trying to set up the test here so that it will work with either the embedded samples in 11.4 or the external samples in 11.7 - or at least that is my goal. Let me know if there is a better way to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the value is under our control I'm good with this, tnx

@davprat davprat merged commit 994d6a3 into aws:develop Sep 8, 2022
@davprat davprat deleted the PCLUSTER-5187 branch September 8, 2022 15:46
dreambeyondorange pushed a commit to dreambeyondorange/aws-parallelcluster that referenced this pull request Sep 9, 2022
… 11.4 on ARM (aws#4305)

Starting with CUDA Toolkit v11.5, the samples where the deviceQuery source code is installed
was moved out of the CUDA install package and into its own github repository. The structure
of the sample directy tree also changed thus requiring different paths for compiling and
executing the deviceQuery utility.

Co-authored-by: Luca Carrogu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants