Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ This file is used to list changes made in each version of the AWS ParallelCluste
**CHANGES**
- Upgrade NVIDIA driver to version 470.141.03.
- Upgrade NVIDIA Fabric Manager to version 470.141.03.
- Upgrade NVIDIA CUDA Toolkit to version 11.7.1.
- Disable cron job tasks man-db and mlocate, which may have a negative impact on node performance.
- Add support for generating Slurm Configuration files for Compute Resources with Multiple Instance Types.
- Reduce timeout from 50 to a maximum of 5min in case of DynamoDB connection issues at compute node bootstrap.
Expand Down
6 changes: 4 additions & 2 deletions attributes/default.rb
Original file line number Diff line number Diff line change
Expand Up @@ -189,11 +189,13 @@
# NVIDIA
default['cluster']['nvidia']['enabled'] = 'no'
default['cluster']['nvidia']['driver_version'] = '470.141.03'
default['cluster']['nvidia']['cuda_version'] = '11.4'
default['cluster']['nvidia']['cuda_version'] = '11.7'
default['cluster']['nvidia']['cuda_samples_version'] = '11.6'
default['cluster']['nvidia']['driver_url_architecture_id'] = arm_instance? ? 'aarch64' : 'x86_64'
default['cluster']['nvidia']['cuda_url_architecture_id'] = arm_instance? ? 'linux_sbsa' : 'linux'
default['cluster']['nvidia']['driver_url'] = "https://us.download.nvidia.com/tesla/#{node['cluster']['nvidia']['driver_version']}/NVIDIA-Linux-#{node['cluster']['nvidia']['driver_url_architecture_id']}-#{node['cluster']['nvidia']['driver_version']}.run"
default['cluster']['nvidia']['cuda_url'] = "https://developer.download.nvidia.com/compute/cuda/11.4.4/local_installers/cuda_11.4.4_470.82.01_#{node['cluster']['nvidia']['cuda_url_architecture_id']}.run"
default['cluster']['nvidia']['cuda_url'] = "https://developer.download.nvidia.com/compute/cuda/11.7.1/local_installers/cuda_11.7.1_515.65.01_#{node['cluster']['nvidia']['cuda_url_architecture_id']}.run"
default['cluster']['nvidia']['cuda_samples_url'] = "https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v#{node['cluster']['nvidia']['cuda_samples_version']}.tar.gz"

# NVIDIA fabric-manager
# The package name of Fabric Manager for alinux2 and centos7 is nvidia-fabric-manager-version
Expand Down
24 changes: 24 additions & 0 deletions cookbooks/aws-parallelcluster-install/recipes/nvidia.rb
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,30 @@
creates "/usr/local/cuda-#{node['cluster']['nvidia']['cuda_version']}"
end

# Get CUDA Sample Files
cuda_samples_directory = "/usr/local/cuda-#{node['cluster']['nvidia']['cuda_version']}/samples"
cuda_tmp_sample_file = "/tmp/cuda-sample.tar.gz"
remote_file cuda_tmp_sample_file do
source node['cluster']['nvidia']['cuda_samples_url']
mode '0644'
retries 3
retry_delay 5
not_if { ::File.exist?(cuda_samples_directory) }
end

# Unpack CUDA Samples
bash 'cuda.sample install' do
user 'root'
group 'root'
cwd '/tmp'
code <<-CUDA
set -e
tar xf "#{cuda_tmp_sample_file}" --directory "/usr/local/"
rm -f "#{cuda_tmp_sample_file}"
CUDA
creates cuda_samples_directory
end

cookbook_file 'blacklist-nouveau.conf' do
source 'nvidia/blacklist-nouveau.conf'
path '/etc/modprobe.d/blacklist-nouveau.conf'
Expand Down