[single_stage_detector] Updated run_and_time.sh for customizing number of GPUs on single node #808

jamesljlster · 2025-07-28T08:32:20Z

Dear MLCommons team,

I appreciate your work, which has helped me verify the training performance after applying hardware resource virtualization to our bare-metal server.

I want to validate the model training performance in a reasonable time on both virtualized and non-virtualized environments. However, this is very challenging with the default Open Images dataset and our relatively small GPUs in a single node. Thus, I made some changes to the performance evaluation script, which may help others with similar use cases.

This pull request modifies the run_and_time.sh of SSD training, and introduces the following changes:

Added DATASET environment variable for dataset customization. This allows the script to run the training script with datasets other than the default Open Image dataset. (For example, I use the coco dataset.)
(Reverted: [single_stage_detector] Updated run_and_time.sh for customizing number of GPUs on single node #808 (comment))
Changed --nproc_per_node=1 of torchrun command line to --nproc_per_node=${DGXNGPU}. This enables the script to do training with more GPUs on a single node.

github-actions · 2025-07-28T08:32:29Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

jamesljlster · 2025-07-31T00:55:46Z

recheck

jamesljlster · 2025-08-05T00:58:50Z

recheck

ShriyaRishab · 2025-08-08T15:32:42Z

single_stage_detector/ssd/run_and_time.sh

 EVALBATCHSIZE=${EVALBATCHSIZE:-${BATCHSIZE}}
 NUMEPOCHS=${NUMEPOCHS:-30}
 LOG_INTERVAL=${LOG_INTERVAL:-20}
+DATASET=${DATASET:-"openimages-mlperf"}


MLPerf requires us to use the same dataset as in the reference to ensure results are comparable across submissions. So we should not make it a changeable parameter

Understood. I've reverted the dataset customization commit.

This reverts commit c37ae36.

jamesljlster added 2 commits July 28, 2025 15:30

Added support for dataset customization

c37ae36

Added support for setting GPUs on single node training

5238041

jamesljlster requested a review from a team as a code owner July 28, 2025 08:32

ShriyaRishab reviewed Aug 8, 2025

View reviewed changes

Revert "Added support for dataset customization"

5e3e7c0

This reverts commit c37ae36.

jamesljlster changed the title ~~[single_stage_detector] Updated run_and_time.sh for customizing datasets and number of GPUs~~ [single_stage_detector] Updated run_and_time.sh for customizing number of GPUs Aug 9, 2025

jamesljlster changed the title ~~[single_stage_detector] Updated run_and_time.sh for customizing number of GPUs~~ [single_stage_detector] Updated run_and_time.sh for customizing number of GPUs on single node Aug 9, 2025

ShriyaRishab approved these changes Aug 11, 2025

View reviewed changes

ShriyaRishab merged commit c7b2283 into mlcommons:master Aug 11, 2025
1 check passed

github-actions bot locked and limited conversation to collaborators Aug 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[single_stage_detector] Updated run_and_time.sh for customizing number of GPUs on single node #808

[single_stage_detector] Updated run_and_time.sh for customizing number of GPUs on single node #808

Uh oh!

jamesljlster commented Jul 28, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 28, 2025 •

edited

Loading

Uh oh!

jamesljlster commented Jul 31, 2025

Uh oh!

jamesljlster commented Aug 5, 2025

Uh oh!

ShriyaRishab Aug 8, 2025

Uh oh!

jamesljlster Aug 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[single_stage_detector] Updated run_and_time.sh for customizing number of GPUs on single node #808

[single_stage_detector] Updated run_and_time.sh for customizing number of GPUs on single node #808

Uh oh!

Conversation

jamesljlster commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesljlster commented Jul 31, 2025

Uh oh!

jamesljlster commented Aug 5, 2025

Uh oh!

ShriyaRishab Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

jamesljlster Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jamesljlster commented Jul 28, 2025 •

edited

Loading

github-actions bot commented Jul 28, 2025 •

edited

Loading