Skip to content

COLL TUNED: Use per-rank data size instead of total size for decision in allgatherv #8209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 17, 2020

Conversation

devreal
Copy link
Contributor

@devreal devreal commented Nov 13, 2020

The total size depends on number of ranks so the usual ranges don't work.
Thus, use the average across all ranks to make a decision.

Signed-off-by: Joseph Schuchart [email protected]

The total size depends on number of ranks so the usual ranges don't work.
Thus, use the average across all ranks to make a decision.

Signed-off-by: Joseph Schuchart <[email protected]>
@devreal
Copy link
Contributor Author

devreal commented Nov 13, 2020

@rajachan Can you give this branch a try to see if the numbers get better on your side? I have made some minor tweaks based on the data I get on our system. However, it appears that the switching points shift based on the number of ranks per node (with constant number of total ranks). I have observed that for other algorithms as well. I think we need to revisit the tuning approach to perhaps include more topology information (this effect may (have) become more prominent with rising core counts...)

@rajachan
Copy link
Member

@rajachan Can you give this branch a try to see if the numbers get better on your side?

They do. A tad bit :) This is a rather small job with 96 ranks (data on right with your PR):

[1,0]<stdout>:32                   2445.09	[1,0]<stdout>:32                    248.96
[1,0]<stdout>:64                   2398.30	[1,0]<stdout>:64                    251.44
[1,0]<stdout>:128                  2482.61	[1,0]<stdout>:128                   298.83
[1,0]<stdout>:256                  2441.82	[1,0]<stdout>:256                   424.72
[1,0]<stdout>:512                  2487.58	[1,0]<stdout>:512                   507.56
[1,0]<stdout>:1024                 4997.56	[1,0]<stdout>:1024                  674.26
[1,0]<stdout>:2048                 5020.70	[1,0]<stdout>:2048                 1066.09
[1,0]<stdout>:4096                 5040.57	[1,0]<stdout>:4096                 5336.04
[1,0]<stdout>:8192                 5371.58	[1,0]<stdout>:8192                 5509.40
[1,0]<stdout>:16384                8816.53	[1,0]<stdout>:16384                8019.78
[1,0]<stdout>:32768               13425.71	[1,0]<stdout>:32768               13454.88
[1,0]<stdout>:65536               26798.66	[1,0]<stdout>:65536               26835.25
[1,0]<stdout>:131072              53313.24	[1,0]<stdout>:131072              53153.74
[1,0]<stdout>:262144             106347.85	[1,0]<stdout>:262144             106278.15
[1,0]<stdout>:524288             214197.14	[1,0]<stdout>:524288             213896.30
[1,0]<stdout>:1048576            440599.42	[1,0]<stdout>:1048576            439966.04

@jsquyres jsquyres merged commit 12796a4 into open-mpi:master Nov 17, 2020
@devreal devreal deleted the fix-tuned-allgatherv branch October 3, 2022 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants