You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
make topology signature include a compressed list of available PUs
When mpirun collects topologies it first asks for signatures and then
only asks a host to send its topologif the signature is something new.
But with heterogeneous cgroups the signature doesn't have enough data
to notice differences that matter.
An example of the previous signature on one of our machines is
2N:2S:22L3:22L2:44L1:44C:176H:ppc64le:le
which is just a count of how many numa nodes, sockets, cores, PUs etc
the host has. If I create a small cgroup with just 4 PUs from one of
the cores, an example of the new signature is
1N:1S:1L3:1L2:1L1:1C:4H:ppc64le:le
But if I did that on two different machines and used different cores
on each, the signature would look the same. One way to distinguish
such signatures is to also list the available PUs by os_index (I might
be okay with logical indexes if it was logical under a WHOLE-SYSTEM
topology, but in the non-WHOLE-SYSTEM topologies we're using I don't
think the logical indexes would be good enough).
This list could be large so I've included a couple different
compressions of it and use whichever comes out shorter.
The first is a comma separated list that compresses simple ranges like
10-15 and also detects patterns analogous to MPI_Type_vector(), eg
{16,17,18,19,20,21,22,23} = 16-23
{1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25} = 1-5+10*3
{2,3, 10,11, 18,19, 26,27,
102,103, 110,111, 118,119,
200,201,202,203,204,205,206,207,208 } = 2-3+8*4,102-103+8*3,200-208
{1,3,6,7,9,11,12} = 1,3,6,7,9,11,12
The second compression is a hex string containing the indexes, and
further shrunk by noticing if the same char is repeated 5+ times, so
{16,17,18,19,20,21,22,23} = 0000ff
{1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25} = 7c1f07c
{2,3, 10,11, 18,19, 26,27,
102,103, 110,111, 118,119,
200,201,202,203,204,205,206,207,208 } = 30303030*18,303030*20,ff8
{1,3,6,7,9,11,12} = 5358
So final signature strings end up things like
example with PUs 0-175:
2N:2S:22L3:22L2:44L1:44C:176H:HXf*44:ppc64le:le
here "f*44" won the compression contest against "0-175"
example with PUs 12-15:
1N:1S:1L3:1L2:1L1:1C:4H:HX000f:ppc64le:le
here "000f" won against "12-15"
example with every other block of 4 PUs:
2N:2S:22L3:22L2:22L1:22C:88H:HL0-3+8*22:ppc64le:le
here "0-3+8*22" won against "f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f"
Signed-off-by: Mark Allen <[email protected]>
0 commit comments