Skip to content

oshmem yoda spml call to add_procs seems wrong #2023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jsquyres opened this issue Aug 27, 2016 · 10 comments
Closed

oshmem yoda spml call to add_procs seems wrong #2023

jsquyres opened this issue Aug 27, 2016 · 10 comments
Assignees
Milestone

Comments

@jsquyres
Copy link
Member

When compiling master and v2.x:

spml_yoda.c: In function 'mca_spml_yoda_add_procs':
spml_yoda.c:645:33: warning: passing argument 1 of 'mca_pml.pml_add_procs' from incompatible pointer type [-Wincompatible-pointer-types]
     rc = MCA_PML_CALL(add_procs(procs, nprocs));
                                 ^
../../../../ompi/mca/pml/pml.h:550:41: note: in definition of macro 'MCA_PML_CALL'
 #define MCA_PML_CALL(a) mca_pml.pml_ ## a
                                         ^
spml_yoda.c:645:33: note: expected 'struct ompi_proc_t **' but argument is of type 'oshmem_proc_t ** {aka struct oshmem_proc_t **}'
     rc = MCA_PML_CALL(add_procs(procs, nprocs));
                                 ^
../../../../ompi/mca/pml/pml.h:550:41: note: in definition of macro 'MCA_PML_CALL'
 #define MCA_PML_CALL(a) mca_pml.pml_ ## a
                                         ^

I note that pml add_procs takes an (ompi_proc_t **), but https://github.com/open-mpi/ompi/blob/master/oshmem/mca/spml/yoda/spml_yoda.c#L645 is calling with an (oshmem_proc_t **).

This was added just 10 days ago by @ggouaillardet in 6b7bc64.

However, oshmem_proc_t is defined as (https://github.com/open-mpi/ompi/blob/master/oshmem/proc/proc.h#L44-L58):

struct oshmem_proc_t {
    opal_proc_t                    super;
    /* endpoint data */
    void *proc_endpoints[OMPI_PROC_ENDPOINT_TAG_MAX];
    /*
     * All transport channels are globally ordered.
     * pe(s) can talk to each other via subset of transports
     * these holds indexes of each transport into global array
     *  proc -> id, where id can be btl id in yoda or mxm ptl id
     *  in ikrit
     *  spml is supposed to fill this during add_procs()
     **/
    int                             num_transports;
    char                           *transport_ids;
};

which does not contain an (ompi_proc_t), thereby making the (implicit) cast from (oshmem_proc_t**) to (ompi_proc_t**) incorrect.

@ggouaillardet Can you please have a look?

@jsquyres
Copy link
Member Author

Actually, looking at ompi/proc/proc.h, I guess (ompi_proc_t) is compatible with (oshmem_proc_t), so I guess the implicit cast is functional. Ugh -- this is strange and unexpected (i.e., it doesn't fit in with the rest of the OMPI code base convention).

Can we do something a a little less unexpected with the (oshmem_proc_t) type -- e.g., make it have its first member be ompi_proc_t? I realize something would need to be done with the padding in the ompi_proc_t (since oshmem is using it differently).

@jsquyres jsquyres modified the milestones: v2.0.2, v2.0.1 Aug 27, 2016
jsquyres added a commit to jsquyres/ompi that referenced this issue Aug 27, 2016
This is really only a partial fix.  Per
open-mpi#2023, the (oshmem_proc_t) and
(ompi_proc_t) types should be unified properly (e.g., oshmem_proc_t
should inherit an ompi_proc_t, or something like that).  The fact that
this code works by casting between two unrelated types is fragile and
susceptible to break in the future.

Signed-off-by: Jeff Squyres <[email protected]>
@ggouaillardet
Copy link
Contributor

I was typing when you posted your last comment...

I can easily silence the warning.

struct ompi_proc_t {
    opal_proc_t                     super;

    /* endpoint data */
    void *proc_endpoints[OMPI_PROC_ENDPOINT_TAG_MAX];

    char padding[16];         /* for future extensions (OSHMEM uses this area also)*/
};

and

struct oshmem_proc_t {
    opal_proc_t                    super;
    /* endpoint data */
    void *proc_endpoints[OMPI_PROC_ENDPOINT_TAG_MAX];
    /*
     * All transport channels are globally ordered.
     * pe(s) can talk to each other via subset of transports
     * these holds indexes of each transport into global array
     *  proc -> id, where id can be btl id in yoda or mxm ptl id
     *  in ikrit
     *  spml is supposed to fill this during add_procs()
     **/
    int                             num_transports;
    char                           *transport_ids;
};

the only difference is ompi has padding whereas oshmem has real data.

bottom line, if an ompi_proc_t is expected, it is fine to pass an oshmem_proc_t,
but the compiler cannot figure it out by itself.

I have several options on top of my head

  1. type casting to silence the warning
  2. make oshmem_proc_t an union that includes ompi_proc_t
  3. get rid of oshmem_proc_t and replace the padding of ompi_proc_t with the data used by oshmem (this is a kind of abstraction violation, and only fly if there is no more project xxx that defines xxx_proc_t)
  4. get rid of oshmem_proc_t and update oshmem so it accesses num_transports and transport_ids via accessors (I am thinking macros to keep it simple)

please share your thoughts and let me know if this should be added to the agenda of the next weekly meeting.

@jsquyres
Copy link
Member Author

I added the cast in #2024.

@ggouaillardet Maybe another option is to have a common / shared type that both ompi_proc_t and oshmem_proc_t use as their first member, and then also have an assert() somewhere to ensure that the sizeof(oshmem_proc_t) <= sizeof(ompi_proc_t) (i.e., so that oshmem's extra data doesn't exceed the padding in the ompi_proc_t)...?

@ggouaillardet
Copy link
Contributor

btw, oshmem_proc_t should likely end with a 8 bytes padding if sizeof(char *) == 4 (assuming the compiler adds a 4 bytes padding between the int and the char* in 64 bits arch)
a safer option would be to swap the two elements and pad 4 or 8 bytes

@jsquyres
Copy link
Member Author

@ggouaillardet Good point; I don't know if there's any code that assumes that the sizeof the two types are the same. It would probably be best to ensure that they are the same (e.g., perhaps an assert() should be added to ensure that sizeof(oshmem_proc_t) == sizeof(ompi_proc_t) -- not <=, as I proposed above).

@jsquyres
Copy link
Member Author

The more I think about it, it probably is important to guarantee that the sizes of the two types are identical. Otherwise, there wouldn't be any padding at the end of (ompi_proc_t), right?

@ggouaillardet
Copy link
Contributor

that makes sense to me.
I like option 4. best, since it does not require any modifications in ompi
(though oshmem changes might be quite important) and it is future-proof
(assuming we put the right asserts at the right places)

@jsquyres
Copy link
Member Author

I'm not familiar enough with the OSHMEM code -- is it easy to make a PR for option 4?

@ggouaillardet
Copy link
Contributor

I will give it a try this week.
that should not be hard, but that could be a bit tedious ...

@jsquyres
Copy link
Member Author

@ggouaillardet Excellent; thanks.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Aug 29, 2016
and use these macros to access oshmem related per proc data :
 - OSHMEM_PROC_NUM_TRANSPORTS(proc)
 - OSHMEM_PROC_TRANSPORT_IDS(proc)

Fixes open-mpi#2023
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Aug 30, 2016
store oshmem related per proc data in a oshmem_proc_data_t, that is
written in the padding section of a ompi_proc_t

this data can be accessed via the OSHMEM_PROC_DATA(proc) macro

Fixes open-mpi#2023
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Aug 30, 2016
store oshmem related per proc data in an oshmem_proc_data_t struct,
that is stored in the padding section of an ompi_proc_t

this data can be accessed via the OSHMEM_PROC_DATA(proc) macro

Fixes open-mpi#2023
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Aug 31, 2016
store oshmem related per proc data in an oshmem_proc_data_t struct,
that is stored in the padding section of an ompi_proc_t

this data can be accessed via the OSHMEM_PROC_DATA(proc) macro

Fixes open-mpi#2023
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Aug 31, 2016
store oshmem related per proc data in an oshmem_proc_data_t struct,
that is stored in the padding section of an ompi_proc_t

this data can be accessed via the OSHMEM_PROC_DATA(proc) macro

Fixes open-mpi#2023
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Sep 1, 2016
store oshmem related per proc data in an oshmem_proc_data_t struct,
that is stored in the padding section of an ompi_proc_t

this data can be accessed via the OSHMEM_PROC_DATA(proc) macro

Fixes open-mpi#2023
ggouaillardet added a commit to ggouaillardet/ompi-release that referenced this issue Sep 1, 2016
store oshmem related per proc data in an oshmem_proc_data_t struct,
that is stored in the padding section of an ompi_proc_t

this data can be accessed via the OSHMEM_PROC_DATA(proc) macro

Fixes open-mpi/ompi#2023

(back-ported from commit open-mpi/ompi@0a25420)
bosilca pushed a commit to bosilca/ompi that referenced this issue Oct 3, 2016
store oshmem related per proc data in an oshmem_proc_data_t struct,
that is stored in the padding section of an ompi_proc_t

this data can be accessed via the OSHMEM_PROC_DATA(proc) macro

Fixes open-mpi#2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants