Fixing the partial pack unpack issue. #8769

bosilca · 2021-04-06T06:22:10Z

When unpacking a partial predefined element check the boundaries of the
description vector type, and adjust the memory pointer accordingly (to
reflect not only when a single basic type was correctly unpacked, but
also when an entire blocklen has been unpacked).

Fixes #8466.

Signed-off-by: George Bosilca [email protected]

hoopoepg · 2021-04-06T09:57:07Z

@bosilca it seems failures are relevant

bosilca · 2021-04-06T15:03:18Z

Interesting, I can replicate on Intel but they don't appear on M1. Let me take a look.

When unpacking a partial predefined element check the boundaries of the description vector type, and adjust the memory pointer accordingly (to reflect not only when a single basic type was correctly unpacked, but also when an entire blocklen has been unpacked). Signed-off-by: George Bosilca <[email protected]>

markalle · 2021-04-07T18:45:50Z

I've avoided reading the unpack_partial bookkeeping before, but now that I've read it I have a lot of questions.

First, structurally why do we have the asymmetry where the iov_ptr/iov_len_local is updated by the partial_length so it steps through the 'from' buffer byte by byte when partial datatype unpacking happens, but the conv_ptr being unpacked into has a coarser update that leaves it at the beginning of the datatype until the whole thing is unpacked then it steps by the whole datatype size that was just unpacked? I'm guessing for cases of homogeneous vanilla byte-copying it's unnecessary but it's there for heterogeneous conversions where perhaps the partial bytes we're writing aren't actually correct until the end, so we don't want the conv_ptr bookkeeping to indicate we've unpacked data that we're not truly done unpacking yet?

But even then I still don't understand the need for unused_bytes, especially in the homogeneous byte-copying case which seems to be the only thing handled in opal_unpack_partial_predefined(). It really appears to be doing the below (for example if we're unpacking a 4b int and partial_length = 3b unpacked so far):

   unpack buffer at conv_ptr is [ data data data , nodatayet ]
   we make saved_buffer as [ data data data, nodatayet ]
   and make temporary_buffer as [ 7F 7F 7F newdata ]
   copy the whole thing so conv_ptr gets [ 7F 7F 7F newdata ]
   walk the 7F entries of conv_ptr restoring data,data,data back out of saved_buffer

I think we're using a special code to restore data that we didn't need to overwrite in the first place. We already know partial_length = 3. So why aren't we just copying the newdata at conv_ptr + partial_length.

I could see conversions getting more complex than the vanilla byte-copying case exampled above, but as near as I can tell opal_unpack_partial_predefined() is all vanilla byte copying like above, no conversions. And I'm trying to figure out if conversions would require the whole "unused_byte" concept, but even then I think I'd just copy the data byte-by-byte and run a conversion after the byte copy if relevant.

And the biggest bookkeeping confusion I have so far is why does opal_unpack_partial_predefined use UNPACK_PREDEFINED_DATATYPE which sure looks like it's literally a memcpy + bookkeeping which you're taking care to un-bookkeep by use of single_elem and resetting the arguments the macro changed. It looks like you're using a complex higher level macro with lots of bookkeeping where you don't want any of its bookkeeping behavior, and you're using it for a vanilla memcpy where you already have all the necessary offsets (?).

bosilca · 2021-04-07T19:19:46Z

Thanks for the review, these are all very good questions that I can hopefully answer clearly enough.

because the iov data has been consumed while the data in the typemap has not yet been completely delivered (because potential partial). This code supports more than the simple contiguous memcpy case, in fact the case where we do not need any conversion is handled in a different code path (where the receive convertor is just a scatter), where we just do memcpy of contiguous bytes.
In the homogeneous case we should not go through this code. With the exception of some floating point representations where we manipulate the bits, as long as we look for conversion between little and bit endian, bytes placement in memory are the key to a successful conversion. Thus, the idea in the code is that whatever piece of information we receive we place it the correct memory location by doing moving only the bytes that are different than the unused byte. This approach stores all temporaries in the user memory, allowing the convertor and the caller to be memoryless (nobody needs to remember the partially moved data).
in the homogeneous case your example would work, but this code is not supposed to be used in the homogeneous case.
opal_unpack_partial_predefined is not supposed to be a vanilla conversion, it should end up calling the pFunction conversion for a single element. Let me take a look at this part, I think it might not do what was expected in which case my entire explanation here reflects what the code did and should have been doing and not what the code actually does.

gpaulsen · 2021-04-07T19:59:03Z

@bosilca @markalle Are you two interested in a webex? I can host one whenever is convenient for you two.

markalle · 2021-04-07T21:09:29Z

For me the problem with a webex is it takes me like 30min minimum to process the bookkeeping of one function, so I'm not very useful trying to evaluate this code in real time.

The explanation that this isn't meant for the vanilla homogeneous case makes more sense. So I'll be curious if my current reading is wrong then. Because when I looked up the UNPACK_PREDEFINED_DATATYPE macro it looked like it boiled down to a vanilla memcpy.

If I imagine the same code but going into a pFunction at the bottom and if it's just doing byte reversal for example, it becomes more plausible. Is it meant to also handle size changes like for an MPI_LONG too? That's something my #7919 commit had extra cases to handle if the conversion was growing or shrinking the number of bytes as well as reversing them.

I still don't like unused_byte though. The iov data aren't necessarily all available to look at up front, so that loop that selects a value could happen multiple times and reach different conclusions while gradually unpacking a single predefined type (maybe unlikely enough to not really matter, but still). In all the non-resizing cases I think I'd have it start with vanilla memcpy of the bytes and then pFunction the result after all the data is available. And even in the resize LONG case I think the extra data needed beyond a vanilla memcpy is just whether the from/to side is big/little so you can decide whether to copy the high or the low bits

markalle · 2021-04-08T17:37:48Z

Am I understanding the top-ish-level behavior that common Pack/Unpack goes through opal_generic_simple_unpack_function() which boils down to vanilla memcpy as it bookkeeps its way through iov and conv_ptr, while external32 goes into opal_unpack_general_function() which does similar bookkeeping but uses pFunctions at the bottom for conversions?

What I'm seeing is both are using opal_unpack_partial_datatype/predefined() and that function still looks entirely vanilla memcpy-based to me. So I lean toward saying opal_generic_simple_unpack_function() is right (although I still think the main memcpy inside opal_unpack_partial_datatype/predefined() can't be the right macro to use at that level).

Then in opal_unpack_general_function() I don't understand its bookkeeping for partials, I don't think it's handling them. But if it were I think it would have to use something pFunction-based rather than opal_unpack_partial_predefined()

markalle · 2021-04-08T18:55:15Z

Another part of the design I was wondering about, to what extent are iov[] entries required to be fully consumed? Eg I see the iov[].iov_len fields get updated to tell the caller how much was actually used.

In the case of pack this doesn't strike me as too unreasonable. So for example maybe the caller gives an iov_count=3 with iov[].iov_len entries of 10,64,64 and when we pack our 16 bytes we only use 8,8,0 so we return iov_count=2 with the iov[].iov_len entries set to 8,8,, showing what we packed into. For unpack this doesn't make quite as much sense, but if iov_count=1 it could still be okay I think.

My rough understanding of the heterogeneous case is that we're not packing partial datatypes, and just declining to fully fill iov[] entries if the boundaries don't line up. Is the higher level code up in opal_convertor_unpack() for example okay with partial use of its iov[] entries?

markalle · 2021-04-08T22:54:51Z

I was starting to experiment with partial.c to see what happens if I try to artificially say that one side is big endian. But before I got to that I'm not understanding the calls. I modified the code in your test to just have 3 INTs in the packed buf: so
01 00 00 00, 02 00 00 00, 03 00 00 00
Then I try to convert 5b followed by 7b, but rbuf isn't coming out right. I assume I'm just making the calls wrong, but I thought this is about what partial.c is doing

So before I look into the bookkeeping, is my test below legitimate?

    int *rbuf, *packed;
    rbuf = malloc( 4 * sizeof(int) );
    packed = malloc( 4 * sizeof(int) );

    packed[0] = 1;
    packed[1] = 2;
    packed[2] = 3;
    memset(rbuf, -1, 4 * sizeof(int));
    convertor = opal_convertor_create( opal_local_arch, 0 );
    opal_convertor_prepare_for_recv( convertor, MPI_INT, 3, rbuf );
    length = 0; // how much is unpacked so far

    iov[0].iov_base = packed + length;
    iov[0].iov_len = 5;
    max_data = iov[0].iov_len;
    iov_count = 1;
    opal_convertor_unpack( convertor, iov, &iov_count, &max_data );
    length += max_data;
    printf("unpacked %d bytes:", max_data);
    print_buf(rbuf, 16);

    iov[0].iov_base = packed + length;
    iov[0].iov_len = 7;
    max_data = iov[0].iov_len;
    iov_count = 1;
    opal_convertor_unpack( convertor, iov, &iov_count, &max_data );
    length += max_data;
    printf("unpacked %d bytes:", max_data);
    print_buf(rbuf, 16);

markalle · 2021-04-13T18:17:14Z

I guess I can review as "approved" because I'm convinced the changes are improvements and are at least as correct as what was there before.

I still have questions in the discussion above that are in the category of "does this cover everything?" but I'm at least confident the book-keeping is still handling correctly anything it used to handle correctly plus the fixed case.

bosilca · 2021-04-13T20:33:02Z

The iov entries are not supposed to be consumed entirely, such that you should (with a little bit of code) be able to use them on an incoming stream to unpack local types. I don't like either that I had to alter the iov, but in combination with the fact that the iov_count is also altered we can make it work as long as the upper level keeps a copy of the original iov.

Going back to your example iov_count=3, iov[].iov_len ={10,64,64}, you are correct we should return {8, 8} with an iov_count set to 2 for pack assuming we are doing a sender side pack and that the datatype is not locally contiguous. This is the main reason we need to handle partial data, we can send the data via a different packing mechanism (one that use an optimized datatype description because the typemap is contiguous in memory) and unpack it into a non-contiguous memory layout. The unpack is not symmetric, it must handle all cases and never leave anything in the input buffer. This means that calling unpack with iov_count=3, iov[].iov_len ={10,64,64} should return `iov_count = 2, iov[].iov_len = {0, 58, 64}' after unpacking 2 doubles, the first one from a single iov entry and the second one from 2 (2 bytes from one and 6 bytes from the other).

Based on a quick look at your I think the code is legitimate and should work.

Let's move this discussion to a datatype issue, while we release the 5.0.

markalle · 2021-04-15T17:38:04Z

I'm more familiar with opal_generic_simple_unpack_function() and opal_unpack_general_function() in the context of MPI_Unpack/Unpack_external() than with other codepaths like opal_convertor_unpack(). So for continued discussion of the testcase, I made this bug:
#8818

dalcinl · 2021-04-22T09:53:37Z

@markalle After the merge of this PR, is this code of yours supposed to work as expected? It is not working for me. https://gist.github.com/markalle/ad7e69f026471e2baa8e842c938d8048

gpaulsen · 2021-04-26T20:15:52Z

@dalcinl I'm talking to @markalle now, and we believe that for this test (https://gist.github.com/markalle/ad7e69f026471e2baa8e842c938d8048) to pass, we need #8735 PR instead of THIS PR.

dalcinl · 2021-04-30T13:54:32Z

@gpaulsen Looks like #8735 did not make it (mpi4py test log). The code in the gist you linked is not working on my side either.

@markalle Is the code in your gist supposed to work after the merge of #8735 ?

awlauria · 2021-05-03T13:19:32Z

@dalcinl @gpaulsen maybe we should open up a new issue for this failure?

dalcinl · 2021-05-04T07:48:14Z

@awlauria I opened #8918 to follow up.

bosilca requested a review from hoopoepg April 6, 2021 06:22

bosilca added the ⚠️ WIP-DNM! label Apr 6, 2021

hoopoepg approved these changes Apr 6, 2021

View reviewed changes

bosilca force-pushed the fix/partial_pack_unpack branch from 6dc87ad to fb07960 Compare April 7, 2021 00:11

This was referenced Apr 7, 2021

Data corruption with OpenMPI 4.0.4 and UCX 1.9.0 #8442

Closed

Pack/unpack external broken for long double real and complex datatypes #8770

Closed

bosilca removed the ⚠️ WIP-DNM! label Apr 7, 2021

dalcinl mentioned this pull request Apr 7, 2021

test: Update testsuite to run with MPICH and Open MPI development mpi4py/mpi4py#37

Merged

jsquyres requested a review from markalle April 7, 2021 14:06

markalle approved these changes Apr 13, 2021

View reviewed changes

jsquyres merged commit 02c5abf into open-mpi:master Apr 13, 2021

markalle mentioned this pull request Apr 14, 2021

checking if 8769 covers everything regarding partial datatypes #8818

Closed

dalcinl added a commit to mpi4py/mpi4py that referenced this pull request May 1, 2021

WIP: See open-mpi/ompi#8769

53f38a9

dalcinl mentioned this pull request May 4, 2021

Pack/Unpack external32 with long double still broken #8918

Closed

Fixing the partial pack unpack issue. #8769

Fixing the partial pack unpack issue. #8769

Uh oh!

Conversation

bosilca commented Apr 6, 2021

Uh oh!

hoopoepg commented Apr 6, 2021

Uh oh!

bosilca commented Apr 6, 2021

Uh oh!

markalle commented Apr 7, 2021

Uh oh!

bosilca commented Apr 7, 2021

Uh oh!

gpaulsen commented Apr 7, 2021

Uh oh!

markalle commented Apr 7, 2021

Uh oh!

markalle commented Apr 8, 2021

Uh oh!

markalle commented Apr 8, 2021

Uh oh!

markalle commented Apr 8, 2021

Uh oh!

markalle commented Apr 13, 2021

Uh oh!

bosilca commented Apr 13, 2021

Uh oh!

markalle commented Apr 15, 2021

Uh oh!

dalcinl commented Apr 22, 2021

Uh oh!

gpaulsen commented Apr 26, 2021

Uh oh!

dalcinl commented Apr 30, 2021

Uh oh!

awlauria commented May 3, 2021

Uh oh!

dalcinl commented May 4, 2021

Uh oh!

Uh oh!