-
Notifications
You must be signed in to change notification settings - Fork 900
Fixing the partial pack unpack issue. #8769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@bosilca it seems failures are relevant |
Interesting, I can replicate on Intel but they don't appear on M1. Let me take a look. |
When unpacking a partial predefined element check the boundaries of the description vector type, and adjust the memory pointer accordingly (to reflect not only when a single basic type was correctly unpacked, but also when an entire blocklen has been unpacked). Signed-off-by: George Bosilca <[email protected]>
6dc87ad
to
fb07960
Compare
I've avoided reading the unpack_partial bookkeeping before, but now that I've read it I have a lot of questions. First, structurally why do we have the asymmetry where the iov_ptr/iov_len_local is updated by the partial_length so it steps through the 'from' buffer byte by byte when partial datatype unpacking happens, but the conv_ptr being unpacked into has a coarser update that leaves it at the beginning of the datatype until the whole thing is unpacked then it steps by the whole datatype size that was just unpacked? I'm guessing for cases of homogeneous vanilla byte-copying it's unnecessary but it's there for heterogeneous conversions where perhaps the partial bytes we're writing aren't actually correct until the end, so we don't want the conv_ptr bookkeeping to indicate we've unpacked data that we're not truly done unpacking yet? But even then I still don't understand the need for unused_bytes, especially in the homogeneous byte-copying case which seems to be the only thing handled in opal_unpack_partial_predefined(). It really appears to be doing the below (for example if we're unpacking a 4b int and partial_length = 3b unpacked so far):
I think we're using a special code to restore data that we didn't need to overwrite in the first place. We already know partial_length = 3. So why aren't we just copying the newdata at conv_ptr + partial_length. I could see conversions getting more complex than the vanilla byte-copying case exampled above, but as near as I can tell opal_unpack_partial_predefined() is all vanilla byte copying like above, no conversions. And I'm trying to figure out if conversions would require the whole "unused_byte" concept, but even then I think I'd just copy the data byte-by-byte and run a conversion after the byte copy if relevant. And the biggest bookkeeping confusion I have so far is why does opal_unpack_partial_predefined use UNPACK_PREDEFINED_DATATYPE which sure looks like it's literally a memcpy + bookkeeping which you're taking care to un-bookkeep by use of single_elem and resetting the arguments the macro changed. It looks like you're using a complex higher level macro with lots of bookkeeping where you don't want any of its bookkeeping behavior, and you're using it for a vanilla memcpy where you already have all the necessary offsets (?). |
Thanks for the review, these are all very good questions that I can hopefully answer clearly enough.
|
For me the problem with a webex is it takes me like 30min minimum to process the bookkeeping of one function, so I'm not very useful trying to evaluate this code in real time. The explanation that this isn't meant for the vanilla homogeneous case makes more sense. So I'll be curious if my current reading is wrong then. Because when I looked up the UNPACK_PREDEFINED_DATATYPE macro it looked like it boiled down to a vanilla memcpy. If I imagine the same code but going into a pFunction at the bottom and if it's just doing byte reversal for example, it becomes more plausible. Is it meant to also handle size changes like for an MPI_LONG too? That's something my #7919 commit had extra cases to handle if the conversion was growing or shrinking the number of bytes as well as reversing them. I still don't like unused_byte though. The iov data aren't necessarily all available to look at up front, so that loop that selects a value could happen multiple times and reach different conclusions while gradually unpacking a single predefined type (maybe unlikely enough to not really matter, but still). In all the non-resizing cases I think I'd have it start with vanilla memcpy of the bytes and then pFunction the result after all the data is available. And even in the resize LONG case I think the extra data needed beyond a vanilla memcpy is just whether the from/to side is big/little so you can decide whether to copy the high or the low bits |
Am I understanding the top-ish-level behavior that common Pack/Unpack goes through opal_generic_simple_unpack_function() which boils down to vanilla memcpy as it bookkeeps its way through iov and conv_ptr, while external32 goes into opal_unpack_general_function() which does similar bookkeeping but uses pFunctions at the bottom for conversions? What I'm seeing is both are using opal_unpack_partial_datatype/predefined() and that function still looks entirely vanilla memcpy-based to me. So I lean toward saying opal_generic_simple_unpack_function() is right (although I still think the main memcpy inside opal_unpack_partial_datatype/predefined() can't be the right macro to use at that level). Then in opal_unpack_general_function() I don't understand its bookkeeping for partials, I don't think it's handling them. But if it were I think it would have to use something pFunction-based rather than opal_unpack_partial_predefined() |
Another part of the design I was wondering about, to what extent are iov[] entries required to be fully consumed? Eg I see the iov[].iov_len fields get updated to tell the caller how much was actually used. In the case of pack this doesn't strike me as too unreasonable. So for example maybe the caller gives an iov_count=3 with iov[].iov_len entries of 10,64,64 and when we pack our 16 bytes we only use 8,8,0 so we return iov_count=2 with the iov[].iov_len entries set to 8,8,, showing what we packed into. For unpack this doesn't make quite as much sense, but if iov_count=1 it could still be okay I think. My rough understanding of the heterogeneous case is that we're not packing partial datatypes, and just declining to fully fill iov[] entries if the boundaries don't line up. Is the higher level code up in opal_convertor_unpack() for example okay with partial use of its iov[] entries? |
I was starting to experiment with partial.c to see what happens if I try to artificially say that one side is big endian. But before I got to that I'm not understanding the calls. I modified the code in your test to just have 3 INTs in the packed buf: so So before I look into the bookkeeping, is my test below legitimate?
|
I guess I can review as "approved" because I'm convinced the changes are improvements and are at least as correct as what was there before. I still have questions in the discussion above that are in the category of "does this cover everything?" but I'm at least confident the book-keeping is still handling correctly anything it used to handle correctly plus the fixed case. |
The iov entries are not supposed to be consumed entirely, such that you should (with a little bit of code) be able to use them on an incoming stream to unpack local types. I don't like either that I had to alter the iov, but in combination with the fact that the iov_count is also altered we can make it work as long as the upper level keeps a copy of the original iov. Going back to your example Based on a quick look at your I think the code is legitimate and should work. Let's move this discussion to a datatype issue, while we release the 5.0. |
I'm more familiar with opal_generic_simple_unpack_function() and opal_unpack_general_function() in the context of MPI_Unpack/Unpack_external() than with other codepaths like opal_convertor_unpack(). So for continued discussion of the testcase, I made this bug: |
@markalle After the merge of this PR, is this code of yours supposed to work as expected? It is not working for me. https://gist.github.com/markalle/ad7e69f026471e2baa8e842c938d8048 |
@dalcinl I'm talking to @markalle now, and we believe that for this test (https://gist.github.com/markalle/ad7e69f026471e2baa8e842c938d8048) to pass, we need #8735 PR instead of THIS PR. |
When unpacking a partial predefined element check the boundaries of the
description vector type, and adjust the memory pointer accordingly (to
reflect not only when a single basic type was correctly unpacked, but
also when an entire blocklen has been unpacked).
Fixes #8466.
Signed-off-by: George Bosilca [email protected]