Skip to content

Commit 42fed5d

Browse files
committed
make topology signature include a compressed list of available PUs
When mpirun collects topologies it first asks for signatures and then only asks a host to send its topologif the signature is something new. But with heterogeneous cgroups the signature doesn't have enough data to notice differences that matter. An example of the previous signature on one of our machines is 2N:2S:22L3:22L2:44L1:44C:176H:ppc64le:le which is just a count of how many numa nodes, sockets, cores, PUs etc the host has. If I create a small cgroup with just 4 PUs from one of the cores, an example of the new signature is 1N:1S:1L3:1L2:1L1:1C:4H:ppc64le:le But if I did that on two different machines and used different cores on each, the signature would look the same. One way to distinguish such signatures is to also list the available PUs by os_index (I might be okay with logical indexes if it was logical under a WHOLE-SYSTEM topology, but in the non-WHOLE-SYSTEM topologies we're using I don't think the logical indexes would be good enough). This list could be large so I've included a couple different compressions of it and use whichever comes out shorter. The first is a comma separated list that compresses simple ranges like 10-15 and also detects patterns analogous to MPI_Type_vector(), eg {16,17,18,19,20,21,22,23} = 16-23 {1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25} = 1-5+10*3 {2,3, 10,11, 18,19, 26,27, 102,103, 110,111, 118,119, 200,201,202,203,204,205,206,207,208 } = 2-3+8*4,102-103+8*3,200-208 {1,3,6,7,9,11,12} = 1,3,6,7,9,11,12 The second compression is a hex string containing the indexes, and further shrunk by noticing if the same char is repeated 5+ times, so {16,17,18,19,20,21,22,23} = 0000ff {1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25} = 7c1f07c {2,3, 10,11, 18,19, 26,27, 102,103, 110,111, 118,119, 200,201,202,203,204,205,206,207,208 } = 30303030*18,303030*20,ff8 {1,3,6,7,9,11,12} = 5358 So final signature strings end up things like example with PUs 0-175: 2N:2S:22L3:22L2:44L1:44C:176H:HXf*44:ppc64le:le here "f*44" won the compression contest against "0-175" example with PUs 12-15: 1N:1S:1L3:1L2:1L1:1C:4H:HX000f:ppc64le:le here "000f" won against "12-15" example with every other block of 4 PUs: 2N:2S:22L3:22L2:22L1:22C:88H:HL0-3+8*22:ppc64le:le here "0-3+8*22" won against "f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f" Signed-off-by: Mark Allen <[email protected]>
1 parent 0a091d0 commit 42fed5d

File tree

1 file changed

+289
-2
lines changed

1 file changed

+289
-2
lines changed

opal/mca/hwloc/base/hwloc_base_util.c

Lines changed: 289 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2273,6 +2273,254 @@ int opal_hwloc_get_sorted_numa_list(hwloc_topology_t topo, char* device_name, op
22732273
return OPAL_ERR_NOT_FOUND;
22742274
}
22752275

2276+
static int
2277+
mycompare(const void* inta, const void* intb)
2278+
{
2279+
if (*(int*)inta < *(int*)intb) { return -1; }
2280+
if (*(int*)inta > *(int*)intb) { return 1; }
2281+
return 0;
2282+
}
2283+
2284+
// Patterns to compress adequately:
2285+
// 1. basic 0,1,2,3 to 0-3
2286+
// 2. repeated ranges of same len at same offset, eg 0,1,2 10,11,12 20,21,22 = 0,1,2+10*3
2287+
// note: assumes input is already sorted
2288+
static char*
2289+
mycompress1(int *a, int n)
2290+
{
2291+
char *str;
2292+
int i, maxlen, len;
2293+
2294+
maxlen = 128;
2295+
len = 0;
2296+
str = malloc(maxlen);
2297+
if (!str) { return NULL; }
2298+
str[0] = 0;
2299+
2300+
// start with "L" followed by the list
2301+
str[len++] = 'L';
2302+
str[len] = 0;
2303+
2304+
int pattern_blocksize = 0; // contiguous elements per block
2305+
int pattern_blockcount = 0; // number of blocks
2306+
int pattern_stride = 0;
2307+
int pattern_first_block_starting_offset = -1;
2308+
int pattern_prev_block_starting_offset = -1;
2309+
int pattern_nelements = 0;
2310+
int is_start_of_new_pattern = 0;
2311+
for (i=0; i<n; ++i) {
2312+
// see if this element a[i] is the start of a new pattern or
2313+
// if it's adding to the previous, we may increment i further
2314+
// processing multiple elements into the previous pattern
2315+
// only update the pattern_* data for the previous pattern
2316+
// (if we're starting a new pattern, we put it into pattern_*,
2317+
// later, after printing what was already in pattern_*)
2318+
2319+
// start by supposing it is a new pattern, change that decision
2320+
// if we process it into the existing pattern_* below
2321+
is_start_of_new_pattern = 1;
2322+
2323+
if (pattern_blocksize > 0) {
2324+
// detect [pppx]
2325+
if (pattern_blockcount == 1 &&
2326+
a[i] == pattern_prev_block_starting_offset + pattern_blocksize)
2327+
{
2328+
++pattern_blocksize;
2329+
++pattern_nelements;
2330+
is_start_of_new_pattern = 0;
2331+
} else {
2332+
// detect [pppp xxxx] or [pppp pppp xxxx] etc
2333+
if (pattern_blockcount == 1 ||
2334+
(pattern_blockcount > 1 &&
2335+
(a[i] == pattern_prev_block_starting_offset + pattern_stride)))
2336+
{
2337+
// this a[i] starts at the right offset, now does it contain
2338+
// pattern_blocksize contiguous entries?
2339+
int j, has_enough_entries = 1;
2340+
for (j=1; j<pattern_blocksize; ++j) {
2341+
if (i+j >= n) { has_enough_entries = 0; }
2342+
else if (a[i+j] != a[i] + j) { has_enough_entries = 0; }
2343+
}
2344+
if (has_enough_entries) {
2345+
pattern_stride = a[i] - pattern_prev_block_starting_offset;
2346+
pattern_prev_block_starting_offset = a[i];
2347+
++pattern_blockcount;
2348+
pattern_nelements += pattern_blocksize;
2349+
is_start_of_new_pattern = 0;
2350+
i = i+j-1;
2351+
}
2352+
}
2353+
}
2354+
}
2355+
2356+
// if started_new_pattern || i is the last element
2357+
// if previous pattern exists
2358+
// print previous pattern
2359+
// record new pattern
2360+
if (is_start_of_new_pattern || i==n-1) {
2361+
if (pattern_blocksize > 0) {
2362+
// print previous pattern
2363+
// make sure there's room for 4 ints plus some punctuation
2364+
// being conservative in supposing all ints used are <8 chars
2365+
if (len + 40 >= maxlen) {
2366+
maxlen *= 1.15;
2367+
maxlen += 128;
2368+
str = realloc(str, maxlen);
2369+
if (!str) { return NULL; }
2370+
}
2371+
// If the string isn't 'long enough' for compressed notation to be worthwhile:
2372+
if (!(pattern_nelements > 4 || (pattern_nelements >= 2 && pattern_blockcount == 1))) {
2373+
int j, k;
2374+
for (j=0; j<pattern_blockcount; ++j) {
2375+
for (k=0; k<pattern_blocksize; ++k) {
2376+
if (len != 1) { // the string starts with "L" for list
2377+
sprintf(&str[len], ",");
2378+
++len;
2379+
}
2380+
sprintf(&str[len], "%d", pattern_first_block_starting_offset + j*pattern_stride + k);
2381+
len = strlen(str);
2382+
}
2383+
}
2384+
}
2385+
else {
2386+
if (len != 1) { // the string starts with "L" for list
2387+
sprintf(&str[len], ",");
2388+
++len;
2389+
}
2390+
if (pattern_blockcount > 1) {
2391+
sprintf(&str[len], "%d-%d+%d*%d",
2392+
pattern_first_block_starting_offset,
2393+
pattern_first_block_starting_offset + pattern_blocksize - 1,
2394+
pattern_stride, pattern_blockcount);
2395+
} else {
2396+
sprintf(&str[len], "%d-%d",
2397+
pattern_first_block_starting_offset,
2398+
pattern_first_block_starting_offset + pattern_blocksize - 1);
2399+
}
2400+
len = strlen(str);
2401+
2402+
}
2403+
}
2404+
2405+
// record new pattern
2406+
pattern_first_block_starting_offset = a[i];
2407+
pattern_prev_block_starting_offset = a[i];
2408+
pattern_blocksize = 1;
2409+
pattern_blockcount = 1;
2410+
pattern_stride = 0;
2411+
pattern_nelements = 1;
2412+
}
2413+
// The overall logic till this point is
2414+
// loop i over some of a[], and either
2415+
// 1. put a[i] in the previous pattern (and print if it's end of a[])
2416+
// 2. put a[i] in a new pattern and print the previous pattern
2417+
// This leaves out printing in the case of a[i] being a new pattern
2418+
// and also being the end of a[]
2419+
if (is_start_of_new_pattern && i==n-1) {
2420+
if (len != 0) {
2421+
sprintf(&str[len], ",");
2422+
++len;
2423+
}
2424+
sprintf(&str[len], "%d", pattern_first_block_starting_offset);
2425+
len = strlen(str);
2426+
}
2427+
}
2428+
2429+
return str;
2430+
}
2431+
// This one just makes a hex string of the available bits
2432+
// update: and does a tiny bit of trivial compression of adjacent
2433+
// hex characters
2434+
// note: assumes input is already sorted
2435+
static char*
2436+
mycompress2(int *a, int n)
2437+
{
2438+
unsigned char *hex_values;
2439+
char *str, prev_char, *p;
2440+
int i, len, max, count;
2441+
2442+
max = a[n-1];
2443+
hex_values = malloc((max / 4 + 4) * sizeof(unsigned char));
2444+
str = malloc(max/4 + 32);
2445+
if (!str || !hex_values) { return NULL; }
2446+
2447+
// start with "X" followed by the hex bitmask
2448+
len = 0;
2449+
str[len++] = 'X';
2450+
2451+
for (i=0; i<=max/4; ++i) {
2452+
hex_values[i] = 0;
2453+
}
2454+
for (i=0; i<n; ++i) {
2455+
int bit = a[i];
2456+
2457+
hex_values[bit/4] |= (unsigned char)(8>>(bit % 4));
2458+
// eg if bit = 11 then hex_values[2] |= 8>>3, eg [0000;0000;0001]
2459+
}
2460+
2461+
for (i=0; i<=max/4; ++i) {
2462+
sprintf(&str[len], "%x", hex_values[i]);
2463+
++len;
2464+
}
2465+
free(hex_values);
2466+
str[len++] = 0;
2467+
2468+
// Add a trivial compression of adjacent chars that are the same so 0000 = 0*4,
2469+
p = str;
2470+
count = 0;
2471+
prev_char = (char)0;
2472+
for (i=0; i<len+1; ++i) {
2473+
if (i==len || str[i] != prev_char) {
2474+
// new section, so print pattern for previously stored stuff,
2475+
// then store new section
2476+
if (prev_char != (char)0) {
2477+
if (count <= 4) {
2478+
int j;
2479+
for (j=0; j<count; ++j) {
2480+
*p = prev_char;
2481+
++p;
2482+
}
2483+
} else {
2484+
sprintf(p, "%c*%d", prev_char, count);
2485+
p += strlen(p);
2486+
if (i < len-1) {
2487+
*p = ',';
2488+
++p;
2489+
}
2490+
}
2491+
}
2492+
prev_char = str[i];
2493+
count = 1;
2494+
} else {
2495+
// new char is part of previous section
2496+
++count;
2497+
}
2498+
}
2499+
*p = 0;
2500+
2501+
return str;
2502+
}
2503+
2504+
// Use two completely different compressions and pick the shorter
2505+
static char*
2506+
mycompress(int *a, int n)
2507+
{
2508+
char *str1, *str2;
2509+
2510+
str1 = mycompress1(a, n);
2511+
str2 = mycompress2(a, n);
2512+
2513+
if (strlen(str2) < strlen(str1)) {
2514+
free(str1);
2515+
return str2;
2516+
}
2517+
free(str2);
2518+
return str1;
2519+
}
2520+
2521+
// The only parsing of this string I can find is a strrchr(str,":")
2522+
// to evaluate the endianness of the nodes, so we have to let that
2523+
// remain as the last entry
22762524
char* opal_hwloc_base_get_topo_signature(hwloc_topology_t topo)
22772525
{
22782526
int nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt;
@@ -2310,8 +2558,47 @@ char* opal_hwloc_base_get_topo_signature(hwloc_topology_t topo)
23102558
endian = "unknown";
23112559
#endif
23122560

2313-
opal_asprintf(&sig, "%dN:%dS:%dL3:%dL2:%dL1:%dC:%dH:%s:%s",
2314-
nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch, endian);
2561+
// The signature used to just contain number of nodes, sockets, cores,
2562+
// hardware threads, etc. With heterogeneous cgroups it's not hard to
2563+
// have the counts come out the same even though the topologies of the
2564+
// available elements are different enough to matter
2565+
//
2566+
// Example hardware:
2567+
// [..../..../..../....][..../..../..../....]
2568+
// 0 4 8 12 16 20 24 28
2569+
// cgset -r cpuset.cpus=10,11,14,15 mycgroup1
2570+
// cgset -r cpuset.cpus=26,27,30,31 mycgroup2
2571+
// the above cgroups would leave only these hardware threads active:
2572+
// mycgroup1: [~~~~/~~~~/~~HH/~~HH][~~~~/~~~~/~~~~/~~~~]
2573+
// mycgroup2: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~HH/~~HH]
2574+
// The non-WHOLE-SYSTEM topology would only contain the availabe part
2575+
// of the tree and each host would report as 1 socket, 2 cores, etc
2576+
//
2577+
// I think a compressed list of the os_indexes in the available mask
2578+
// should be a modest-sized addition to the string that would suffice
2579+
// to make the signatures differ for different cgroups
2580+
2581+
int npu;
2582+
int *avail_pus;
2583+
char *avail_pus_string;
2584+
obj = NULL;
2585+
npu = hwloc_get_nbobjs_by_type(topo, HWLOC_OBJ_PU);
2586+
avail_pus = malloc(npu * sizeof(int));
2587+
if (!avail_pus) { return strdup("malloc failed"); }
2588+
npu = 0;
2589+
while ((obj = hwloc_get_next_obj_by_type(topo, HWLOC_OBJ_PU, obj)) != NULL) {
2590+
avail_pus[npu++] = obj->os_index;
2591+
}
2592+
qsort(avail_pus, npu, sizeof(int), &mycompare);
2593+
2594+
avail_pus_string = mycompress(avail_pus, npu);
2595+
opal_asprintf(&sig, "%dN:%dS:%dL3:%dL2:%dL1:%dC:%dH:H%s:%s:%s",
2596+
nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt,
2597+
avail_pus_string?avail_pus_string:"", arch, endian);
2598+
2599+
free(avail_pus);
2600+
if (avail_pus_string) { free(avail_pus_string); }
2601+
23152602
return sig;
23162603
}
23172604

0 commit comments

Comments
 (0)