[single_stage_detector] Updated run_and_time.sh for customizing number of GPUs on single node #808
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Dear MLCommons team,
I appreciate your work, which has helped me verify the training performance after applying hardware resource virtualization to our bare-metal server.
I want to validate the model training performance in a reasonable time on both virtualized and non-virtualized environments. However, this is very challenging with the default Open Images dataset and our relatively small GPUs in a single node. Thus, I made some changes to the performance evaluation script, which may help others with similar use cases.
This pull request modifies the
run_and_time.shof SSD training, and introduces the following changes:AddedDATASETenvironment variable for dataset customization. This allows the script to run the training script with datasets other than the default Open Image dataset. (For example, I use thecocodataset.)(Reverted: [single_stage_detector] Updated run_and_time.sh for customizing number of GPUs on single node #808 (comment))
--nproc_per_node=1oftorchruncommand line to--nproc_per_node=${DGXNGPU}. This enables the script to do training with more GPUs on a single node.