Skip to content

Commit 25c3984

Browse files
committed
make topology signature include a compressed list of available PUs
When mpirun collects topologies it first asks for signatures and then only asks a host to send its topologif the signature is something new. But with heterogeneous cgroups the signature doesn't have enough data to notice differences that matter. An example of the previous signature on one of our machines is 2N:2S:22L3:22L2:44L1:44C:176H:ppc64le:le which is just a count of how many numa nodes, sockets, cores, PUs etc the host has. If I create a small cgroup with just 4 PUs from one of the cores, an example of the new signature is 1N:1S:1L3:1L2:1L1:1C:4H:ppc64le:le But if I did that on two different machines and used different cores on each, the signature would look the same. One way to distinguish such signatures is to also list the available PUs by os_index (I might be okay with logical indexes if it was logical under a WHOLE-SYSTEM topology, but in the non-WHOLE-SYSTEM topologies we're using I don't think the logical indexes would be good enough). This list could be large so I've included a couple different compressions of it and use whichever comes out shorter. The first is a comma separated list that compresses simple ranges like 10-15 and also detects patterns analogous to MPI_Type_vector(), eg {16,17,18,19,20,21,22,23} = 16-23 {1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25} = 1-5+10*3 {2,3, 10,11, 18,19, 26,27, 102,103, 110,111, 118,119, 200,201,202,203,204,205,206,207,208 } = 2-3+8*4,102-103+8*3,200-208 {1,3,6,7,9,11,12} = 1,3,6,7,9,11,12 The second compression is a hex string containing the indexes, and further shrunk by noticing if the same char is repeated 5+ times, so {16,17,18,19,20,21,22,23} = 0000ff {1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25} = 7c1f07c {2,3, 10,11, 18,19, 26,27, 102,103, 110,111, 118,119, 200,201,202,203,204,205,206,207,208 } = 30303030*18,303030*20,ff8 {1,3,6,7,9,11,12} = 5358 So final signature strings end up things like example with PUs 0-175: 2N:2S:22L3:22L2:44L1:44C:176H:HXf*44:ppc64le:le here "f*44" won the compression contest against "0-175" example with PUs 12-15: 1N:1S:1L3:1L2:1L1:1C:4H:HX000f:ppc64le:le here "000f" won against "12-15" example with every other block of 4 PUs: 2N:2S:22L3:22L2:22L1:22C:88H:HL0-3+8*22:ppc64le:le here "0-3+8*22" won against "f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f" Signed-off-by: Mark Allen <[email protected]>
1 parent ca59d47 commit 25c3984

File tree

1 file changed

+289
-2
lines changed

1 file changed

+289
-2
lines changed

opal/mca/hwloc/base/hwloc_base_util.c

Lines changed: 289 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2251,6 +2251,254 @@ int opal_hwloc_get_sorted_numa_list(hwloc_topology_t topo, char* device_name, op
22512251
return OPAL_ERR_NOT_FOUND;
22522252
}
22532253

2254+
static int
2255+
mycompare(const void* inta, const void* intb)
2256+
{
2257+
if (*(int*)inta < *(int*)intb) { return -1; }
2258+
if (*(int*)inta > *(int*)intb) { return 1; }
2259+
return 0;
2260+
}
2261+
2262+
// Patterns to compress adequately:
2263+
// 1. basic 0,1,2,3 to 0-3
2264+
// 2. repeated ranges of same len at same offset, eg 0,1,2 10,11,12 20,21,22 = 0,1,2+10*3
2265+
// note: assumes input is already sorted
2266+
static char*
2267+
mycompress1(int *a, int n)
2268+
{
2269+
char *str;
2270+
int i, maxlen, len;
2271+
2272+
maxlen = 128;
2273+
len = 0;
2274+
str = malloc(maxlen);
2275+
if (!str) { return NULL; }
2276+
str[0] = 0;
2277+
2278+
// start with "L" followed by the list
2279+
str[len++] = 'L';
2280+
str[len] = 0;
2281+
2282+
int pattern_blocksize = 0; // contiguous elements per block
2283+
int pattern_blockcount = 0; // number of blocks
2284+
int pattern_stride = 0;
2285+
int pattern_first_block_starting_offset = -1;
2286+
int pattern_prev_block_starting_offset = -1;
2287+
int pattern_nelements = 0;
2288+
int is_start_of_new_pattern = 0;
2289+
for (i=0; i<n; ++i) {
2290+
// see if this element a[i] is the start of a new pattern or
2291+
// if it's adding to the previous, we may increment i further
2292+
// processing multiple elements into the previous pattern
2293+
// only update the pattern_* data for the previous pattern
2294+
// (if we're starting a new pattern, we put it into pattern_*,
2295+
// later, after printing what was already in pattern_*)
2296+
2297+
// start by supposing it is a new pattern, change that decision
2298+
// if we process it into the existing pattern_* below
2299+
is_start_of_new_pattern = 1;
2300+
2301+
if (pattern_blocksize > 0) {
2302+
// detect [pppx]
2303+
if (pattern_blockcount == 1 &&
2304+
a[i] == pattern_prev_block_starting_offset + pattern_blocksize)
2305+
{
2306+
++pattern_blocksize;
2307+
++pattern_nelements;
2308+
is_start_of_new_pattern = 0;
2309+
} else {
2310+
// detect [pppp xxxx] or [pppp pppp xxxx] etc
2311+
if (pattern_blockcount == 1 ||
2312+
(pattern_blockcount > 1 &&
2313+
(a[i] == pattern_prev_block_starting_offset + pattern_stride)))
2314+
{
2315+
// this a[i] starts at the right offset, now does it contain
2316+
// pattern_blocksize contiguous entries?
2317+
int j, has_enough_entries = 1;
2318+
for (j=1; j<pattern_blocksize; ++j) {
2319+
if (i+j >= n) { has_enough_entries = 0; }
2320+
else if (a[i+j] != a[i] + j) { has_enough_entries = 0; }
2321+
}
2322+
if (has_enough_entries) {
2323+
pattern_stride = a[i] - pattern_prev_block_starting_offset;
2324+
pattern_prev_block_starting_offset = a[i];
2325+
++pattern_blockcount;
2326+
pattern_nelements += pattern_blocksize;
2327+
is_start_of_new_pattern = 0;
2328+
i = i+j-1;
2329+
}
2330+
}
2331+
}
2332+
}
2333+
2334+
// if started_new_pattern || i is the last element
2335+
// if previous pattern exists
2336+
// print previous pattern
2337+
// record new pattern
2338+
if (is_start_of_new_pattern || i==n-1) {
2339+
if (pattern_blocksize > 0) {
2340+
// print previous pattern
2341+
// make sure there's room for 4 ints plus some punctuation
2342+
// being conservative in supposing all ints used are <8 chars
2343+
if (len + 40 >= maxlen) {
2344+
maxlen *= 1.15;
2345+
maxlen += 128;
2346+
str = realloc(str, maxlen);
2347+
if (!str) { return NULL; }
2348+
}
2349+
// If the string isn't 'long enough' for compressed notation to be worthwhile:
2350+
if (!(pattern_nelements > 4 || (pattern_nelements >= 2 && pattern_blockcount == 1))) {
2351+
int j, k;
2352+
for (j=0; j<pattern_blockcount; ++j) {
2353+
for (k=0; k<pattern_blocksize; ++k) {
2354+
if (len != 1) { // the string starts with "L" for list
2355+
sprintf(&str[len], ",");
2356+
++len;
2357+
}
2358+
sprintf(&str[len], "%d", pattern_first_block_starting_offset + j*pattern_stride + k);
2359+
len = strlen(str);
2360+
}
2361+
}
2362+
}
2363+
else {
2364+
if (len != 1) { // the string starts with "L" for list
2365+
sprintf(&str[len], ",");
2366+
++len;
2367+
}
2368+
if (pattern_blockcount > 1) {
2369+
sprintf(&str[len], "%d-%d+%d*%d",
2370+
pattern_first_block_starting_offset,
2371+
pattern_first_block_starting_offset + pattern_blocksize - 1,
2372+
pattern_stride, pattern_blockcount);
2373+
} else {
2374+
sprintf(&str[len], "%d-%d",
2375+
pattern_first_block_starting_offset,
2376+
pattern_first_block_starting_offset + pattern_blocksize - 1);
2377+
}
2378+
len = strlen(str);
2379+
2380+
}
2381+
}
2382+
2383+
// record new pattern
2384+
pattern_first_block_starting_offset = a[i];
2385+
pattern_prev_block_starting_offset = a[i];
2386+
pattern_blocksize = 1;
2387+
pattern_blockcount = 1;
2388+
pattern_stride = 0;
2389+
pattern_nelements = 1;
2390+
}
2391+
// The overall logic till this point is
2392+
// loop i over some of a[], and either
2393+
// 1. put a[i] in the previous pattern (and print if it's end of a[])
2394+
// 2. put a[i] in a new pattern and print the previous pattern
2395+
// This leaves out printing in the case of a[i] being a new pattern
2396+
// and also being the end of a[]
2397+
if (is_start_of_new_pattern && i==n-1) {
2398+
if (len != 0) {
2399+
sprintf(&str[len], ",");
2400+
++len;
2401+
}
2402+
sprintf(&str[len], "%d", pattern_first_block_starting_offset);
2403+
len = strlen(str);
2404+
}
2405+
}
2406+
2407+
return str;
2408+
}
2409+
// This one just makes a hex string of the available bits
2410+
// update: and does a tiny bit of trivial compression of adjacent
2411+
// hex characters
2412+
// note: assumes input is already sorted
2413+
static char*
2414+
mycompress2(int *a, int n)
2415+
{
2416+
unsigned char *hex_values;
2417+
char *str, prev_char, *p;
2418+
int i, len, max, count;
2419+
2420+
max = a[n-1];
2421+
hex_values = malloc((max / 4 + 4) * sizeof(unsigned char));
2422+
str = malloc(max/4 + 32);
2423+
if (!str || !hex_values) { return NULL; }
2424+
2425+
// start with "X" followed by the hex bitmask
2426+
len = 0;
2427+
str[len++] = 'X';
2428+
2429+
for (i=0; i<=max/4; ++i) {
2430+
hex_values[i] = 0;
2431+
}
2432+
for (i=0; i<n; ++i) {
2433+
int bit = a[i];
2434+
2435+
hex_values[bit/4] |= (unsigned char)(8>>(bit % 4));
2436+
// eg if bit = 11 then hex_values[2] |= 8>>3, eg [0000;0000;0001]
2437+
}
2438+
2439+
for (i=0; i<=max/4; ++i) {
2440+
sprintf(&str[len], "%x", hex_values[i]);
2441+
++len;
2442+
}
2443+
free(hex_values);
2444+
str[len++] = 0;
2445+
2446+
// Add a trivial compression of adjacent chars that are the same so 0000 = 0*4,
2447+
p = str;
2448+
count = 0;
2449+
prev_char = (char)0;
2450+
for (i=0; i<len+1; ++i) {
2451+
if (i==len || str[i] != prev_char) {
2452+
// new section, so print pattern for previously stored stuff,
2453+
// then store new section
2454+
if (prev_char != (char)0) {
2455+
if (count <= 4) {
2456+
int j;
2457+
for (j=0; j<count; ++j) {
2458+
*p = prev_char;
2459+
++p;
2460+
}
2461+
} else {
2462+
sprintf(p, "%c*%d", prev_char, count);
2463+
p += strlen(p);
2464+
if (i < len-1) {
2465+
*p = ',';
2466+
++p;
2467+
}
2468+
}
2469+
}
2470+
prev_char = str[i];
2471+
count = 1;
2472+
} else {
2473+
// new char is part of previous section
2474+
++count;
2475+
}
2476+
}
2477+
*p = 0;
2478+
2479+
return str;
2480+
}
2481+
2482+
// Use two completely different compressions and pick the shorter
2483+
static char*
2484+
mycompress(int *a, int n)
2485+
{
2486+
char *str1, *str2;
2487+
2488+
str1 = mycompress1(a, n);
2489+
str2 = mycompress2(a, n);
2490+
2491+
if (strlen(str2) < strlen(str1)) {
2492+
free(str1);
2493+
return str2;
2494+
}
2495+
free(str2);
2496+
return str1;
2497+
}
2498+
2499+
// The only parsing of this string I can find is a strrchr(str,":")
2500+
// to evaluate the endianness of the nodes, so we have to let that
2501+
// remain as the last entry
22542502
char* opal_hwloc_base_get_topo_signature(hwloc_topology_t topo)
22552503
{
22562504
int nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt;
@@ -2288,8 +2536,47 @@ char* opal_hwloc_base_get_topo_signature(hwloc_topology_t topo)
22882536
endian = "unknown";
22892537
#endif
22902538

2291-
opal_asprintf(&sig, "%dN:%dS:%dL3:%dL2:%dL1:%dC:%dH:%s:%s",
2292-
nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch, endian);
2539+
// The signature used to just contain number of nodes, sockets, cores,
2540+
// hardware threads, etc. With heterogeneous cgroups it's not hard to
2541+
// have the counts come out the same even though the topologies of the
2542+
// available elements are different enough to matter
2543+
//
2544+
// Example hardware:
2545+
// [..../..../..../....][..../..../..../....]
2546+
// 0 4 8 12 16 20 24 28
2547+
// cgset -r cpuset.cpus=10,11,14,15 mycgroup1
2548+
// cgset -r cpuset.cpus=26,27,30,31 mycgroup2
2549+
// the above cgroups would leave only these hardware threads active:
2550+
// mycgroup1: [~~~~/~~~~/~~HH/~~HH][~~~~/~~~~/~~~~/~~~~]
2551+
// mycgroup2: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~HH/~~HH]
2552+
// The non-WHOLE-SYSTEM topology would only contain the availabe part
2553+
// of the tree and each host would report as 1 socket, 2 cores, etc
2554+
//
2555+
// I think a compressed list of the os_indexes in the available mask
2556+
// should be a modest-sized addition to the string that would suffice
2557+
// to make the signatures differ for different cgroups
2558+
2559+
int npu;
2560+
int *avail_pus;
2561+
char *avail_pus_string;
2562+
obj = NULL;
2563+
npu = hwloc_get_nbobjs_by_type(topo, HWLOC_OBJ_PU);
2564+
avail_pus = malloc(npu * sizeof(int));
2565+
if (!avail_pus) { return strdup("malloc failed"); }
2566+
npu = 0;
2567+
while ((obj = hwloc_get_next_obj_by_type(topo, HWLOC_OBJ_PU, obj)) != NULL) {
2568+
avail_pus[npu++] = obj->os_index;
2569+
}
2570+
qsort(avail_pus, npu, sizeof(int), &mycompare);
2571+
2572+
avail_pus_string = mycompress(avail_pus, npu);
2573+
opal_asprintf(&sig, "%dN:%dS:%dL3:%dL2:%dL1:%dC:%dH:H%s:%s:%s",
2574+
nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt,
2575+
avail_pus_string?avail_pus_string:"", arch, endian);
2576+
2577+
free(avail_pus);
2578+
if (avail_pus_string) { free(avail_pus_string); }
2579+
22932580
return sig;
22942581
}
22952582

0 commit comments

Comments
 (0)