Skip to content

yalla with irregular contig datatype -- Fixes 3566 #3567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 24, 2017

Conversation

markalle
Copy link
Contributor

Yalla has a macro PML_YALLA_INIT_MXM_REQ_DATA that checks if a datatype
is contiguous via opal_datatype_is_contiguous_memory_layout(dt,count)
and if so it selects a size and lb that presumably is what will rdma, as

            ompi_datatype_type_size(_dtype, &size); \
            ompi_datatype_type_lb(_dtype, &lb); \

This failed when I gave it a datatype constructed as [ ...] with extent 4.
What I mean by that datatype is

    lens[0] = 3;
    disps[0] = 1;
    types[0] = MPI_CHAR;
    MPI_Type_struct(1, lens, disps, types, &tmpdt);
    MPI_Type_create_resized(tmpdt, 0, 4, &mydt);

So there are 3 chars at offset 1, and the LB is 0 and the UB is 4.

So that macro decides that size=4 and lb=0 and later I suppose size is getting
updated to 3 for the final rdma, and so a send of a buffer
[ 0 1 2 3 ] gets recved as [ 0 1 2 _ ]. I think it should use the true lb
and the true extent.

For "regular" contig datatypes it would be the same, and for the irregular
ones that are still deemed contiguous by that utility function it should
still be the right thing to use.

Signed-off-by: Mark Allen [email protected]

Yalla has a macro PML_YALLA_INIT_MXM_REQ_DATA that checks if a datatype
is contiguous via opal_datatype_is_contiguous_memory_layout(dt,count)
and if so it selects a size and lb that presumably is what will rdma, as
            ompi_datatype_type_size(_dtype, &size); \
            ompi_datatype_type_lb(_dtype, &lb); \

This failed when I gave it a datatype constructed as [ ...] with extent 4.
What I mean by that datatype is
    lens[0] = 3;
    disps[0] = 1;
    types[0] = MPI_CHAR;
    MPI_Type_struct(1, lens, disps, types, &tmpdt);
    MPI_Type_create_resized(tmpdt, 0, 4, &mydt);
So there are 3 chars at offset 1, and the LB is 0 and the UB is 4.

So that macro decides that size=4 and lb=0 and later I suppose size is getting
updated to 3 for the final rdma, and so a send of a buffer
[ 0 1 2 3 ] gets recved as [ 0 1 2 _ ]. I think it should use the true lb
and the true extent.

For "regular" contig datatypes it would be the same, and for the irregular
ones that are still deemed contiguous by that utility function it should
still be the right thing to use.

Signed-off-by: Mark Allen <[email protected]>
@markalle
Copy link
Contributor Author

Here's a gist with a testcase example that hits the problem when run with the yalla pml:
https://gist.github.com/markalle/24f101dfb20b6dfc5ebbe3c8aaa9505a

@gpaulsen
Copy link
Member

@jladd-mlnx @artpol84 - Please take a look.

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice catch.

@bwbarrett
Copy link
Member

bot:ompi:retest

@jjhursey
Copy link
Member

Fixes Issue #3566

@jladd-mlnx
Copy link
Member

@yosefe Please review and bless.

@jladd-mlnx jladd-mlnx merged commit 02c288c into open-mpi:master May 24, 2017
@jladd-mlnx
Copy link
Member

@gpaulsen @markalle Please back-port to 2.0.x, 2.1.x, and PR to 3.0.

@markalle
Copy link
Contributor Author

I'm tempted to propose that a utility function that identifies is_contiguous(count, dt) should have a couple extra arguments so it could return "yes it's contiguous, and here's the offset and length of the contiguous region represented by that count and dt."

I think the answer is always a fairly simple true_lb and count*true_extent, so I guess the utility function doesn't really need that feature, but it could be a convenience.

@bosilca
Copy link
Member

bosilca commented May 24, 2017

opal_datatype_span is your friend.

@markalle
Copy link
Contributor Author

Thanks, I agree, that's exactly the right way to compute the byte range for (count,dt) pairs. I just submitted another bug and pull request that I switched to use opal_datatype_span() if you want to review the use of opal_datatype_span() (in osc rdma):
#3576
#3577

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants