Skip to content

Cleanup the temporary memory allocation in collectives #1091

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 4, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 124 additions & 0 deletions ompi/mca/coll/base/README.memory_management
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
/* This comment applies to all collectives (including the basic
* module) where we allocate a temporary buffer. For the next few
* lines of code, it's tremendously complicated how we decided that
* this was the Right Thing to do. Sit back and enjoy. And prepare
* to have your mind warped. :-)
*
* Recall some definitions (I always get these backwards, so I'm
* going to put them here):
*
* extent: the length from the lower bound to the upper bound -- may
* be considerably larger than the buffer required to hold the data
* (or smaller! But it's easiest to think about when it's larger).
*
* true extent: the exact number of bytes required to hold the data
* in the layout pattern in the datatype.
*
* For example, consider the following buffer (just talking about
* true_lb, extent, and true extent -- extrapolate for true_ub:
*
* A B C
* --------------------------------------------------------
* | | |
* --------------------------------------------------------
*
* There are multiple cases:
*
* 1. A is what we give to MPI_Send (and friends), and A is where
* the data starts, and C is where the data ends. In this case:
*
* - extent: C-A
* - true extent: C-A
* - true_lb: 0
*
* A C
* --------------------------------------------------------
* | |
* --------------------------------------------------------
* <=======================extent=========================>
* <======================true extent=====================>
*
* 2. A is what we give to MPI_Send (and friends), B is where the
* data starts, and C is where the data ends. In this case:
*
* - extent: C-A
* - true extent: C-B
* - true_lb: positive
*
* A B C
* --------------------------------------------------------
* | | User buffer |
* --------------------------------------------------------
* <=======================extent=========================>
* <===============true extent=============>
*
* 3. B is what we give to MPI_Send (and friends), A is where the
* data starts, and C is where the data ends. In this case:
*
* - extent: C-A
* - true extent: C-A
* - true_lb: negative
*
* A B C
* --------------------------------------------------------
* | | User buffer |
* --------------------------------------------------------
* <=======================extent=========================>
* <======================true extent=====================>
*
* 4. MPI_BOTTOM is what we give to MPI_Send (and friends), B is
* where the data starts, and C is where the data ends. In this
* case:
*
* - extent: C-MPI_BOTTOM
* - true extent: C-B
* - true_lb: [potentially very large] positive
*
* MPI_BOTTOM B C
* --------------------------------------------------------
* | | User buffer |
* --------------------------------------------------------
* <=======================extent=========================>
* <===============true extent=============>
*
* So in all cases, for a temporary buffer, all we need to malloc()
* is a buffer of size true_extent. We therefore need to know two
* pointer values: what value to give to MPI_Send (and friends) and
* what value to give to free(), because they might not be the same.
*
* Clearly, what we give to free() is exactly what was returned from
* malloc(). That part is easy. :-)
*
* What we give to MPI_Send (and friends) is a bit more complicated.
* Let's take the 4 cases from above:
*
* 1. If A is what we give to MPI_Send and A is where the data
* starts, then clearly we give to MPI_Send what we got back from
* malloc().
*
* 2. If B is what we get back from malloc, but we give A to
* MPI_Send, then the buffer range [A,B) represents "dead space"
* -- no data will be put there. So it's safe to give B-true_lb to
* MPI_Send. More specifically, the true_lb is positive, so B-true_lb is
* actually A.
*
* 3. If A is what we get back from malloc, and B is what we give to
* MPI_Send, then the true_lb is negative, so A-true_lb will actually equal
* B.
*
* 4. Although this seems like the weirdest case, it's actually
* quite similar to case #2 -- the pointer we give to MPI_Send is
* smaller than the pointer we got back from malloc().
*
* Hence, in all cases, we give (return_from_malloc - true_lb) to MPI_Send.
*
* This works fine and dandy if we only have (count==1), which we
* rarely do. ;-) So we really need to allocate (true_extent +
* ((count - 1) * extent)) to get enough space for the rest. This may
* be more than is necessary, but it's ok.
*
* Simple, no? :-)
*
*/


11 changes: 4 additions & 7 deletions ompi/mca/coll/base/coll_base_allgather.c
Original file line number Diff line number Diff line change
Expand Up @@ -167,19 +167,16 @@ int ompi_coll_base_allgather_intra_bruck(const void *sbuf, int scount,
- copy blocks from shift buffer starting at block [rank] in rbuf.
*/
if (0 != rank) {
ptrdiff_t true_extent, true_lb;
char *free_buf = NULL, *shift_buf = NULL;
ptrdiff_t span, gap;

err = ompi_datatype_get_true_extent(rdtype, &true_lb, &true_extent);
if (MPI_SUCCESS != err) { line = __LINE__; goto err_hndl; }
span = opal_datatype_span(&rdtype->super, (size - rank) * rcount, &gap);

free_buf = (char*) calloc(((true_extent +
((ptrdiff_t)(size - rank) * (ptrdiff_t)rcount - 1) * rext)),
sizeof(char));
free_buf = (char*)calloc(span, sizeof(char));
if (NULL == free_buf) {
line = __LINE__; err = OMPI_ERR_OUT_OF_RESOURCE; goto err_hndl;
}
shift_buf = free_buf - true_lb;
shift_buf = free_buf - gap;

/* 1. copy blocks [0 .. (size - rank - 1)] from rbuf to shift buffer */
err = ompi_datatype_copy_content_same_ddt(rdtype, ((ptrdiff_t)(size - rank) * (ptrdiff_t)rcount),
Expand Down
25 changes: 10 additions & 15 deletions ompi/mca/coll/base/coll_base_allreduce.c
Original file line number Diff line number Diff line change
Expand Up @@ -135,8 +135,8 @@ ompi_coll_base_allreduce_intra_recursivedoubling(const void *sbuf, void *rbuf,
int ret, line, rank, size, adjsize, remote, distance;
int newrank, newremote, extra_ranks;
char *tmpsend = NULL, *tmprecv = NULL, *tmpswap = NULL, *inplacebuf = NULL;
ptrdiff_t true_lb, true_extent, lb, extent;
ompi_request_t *reqs[2] = {NULL, NULL};
OPAL_PTRDIFF_TYPE span, gap;

size = ompi_comm_size(comm);
rank = ompi_comm_rank(comm);
Expand All @@ -154,12 +154,8 @@ ompi_coll_base_allreduce_intra_recursivedoubling(const void *sbuf, void *rbuf,
}

/* Allocate and initialize temporary send buffer */
ret = ompi_datatype_get_extent(dtype, &lb, &extent);
if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
ret = ompi_datatype_get_true_extent(dtype, &true_lb, &true_extent);
if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }

inplacebuf = (char*) malloc(true_extent + (ptrdiff_t)(count - 1) * extent);
span = opal_datatype_span(&dtype->super, count, &gap);
inplacebuf = (char*) malloc(span);
if (NULL == inplacebuf) { ret = -1; line = __LINE__; goto error_hndl; }

if (MPI_IN_PLACE == sbuf) {
Expand Down Expand Up @@ -629,9 +625,9 @@ ompi_coll_base_allreduce_intra_ring_segmented(const void *sbuf, void *rbuf, int
int segcount, max_segcount, num_phases, phase, block_count, inbi;
size_t typelng;
char *tmpsend = NULL, *tmprecv = NULL, *inbuf[2] = {NULL, NULL};
ptrdiff_t true_lb, true_extent, lb, extent;
ptrdiff_t block_offset, max_real_segsize;
ompi_request_t *reqs[2] = {NULL, NULL};
OPAL_PTRDIFF_TYPE lb, extent, gap;

size = ompi_comm_size(comm);
rank = ompi_comm_rank(comm);
Expand All @@ -649,10 +645,6 @@ ompi_coll_base_allreduce_intra_ring_segmented(const void *sbuf, void *rbuf, int
}

/* Determine segment count based on the suggested segment size */
ret = ompi_datatype_get_extent(dtype, &lb, &extent);
if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
ret = ompi_datatype_get_true_extent(dtype, &true_lb, &true_extent);
if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
ret = ompi_datatype_type_size( dtype, &typelng);
if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
segcount = count;
Expand Down Expand Up @@ -685,7 +677,10 @@ ompi_coll_base_allreduce_intra_ring_segmented(const void *sbuf, void *rbuf, int
early_blockcount, late_blockcount );
COLL_BASE_COMPUTE_BLOCKCOUNT( early_blockcount, num_phases, inbi,
max_segcount, k);
max_real_segsize = true_extent + (ptrdiff_t)(max_segcount - 1) * extent;

ret = ompi_datatype_get_extent(dtype, &lb, &extent);
if (MPI_SUCCESS != ret) { line = __LINE__; goto error_hndl; }
max_real_segsize = opal_datatype_span(&dtype->super, max_segcount, &gap);

/* Allocate and initialize temporary buffers */
inbuf[0] = (char*)malloc(max_real_segsize);
Expand Down Expand Up @@ -740,8 +735,8 @@ ompi_coll_base_allreduce_intra_ring_segmented(const void *sbuf, void *rbuf, int
block_count = ((rank < split_rank)? early_blockcount : late_blockcount);
COLL_BASE_COMPUTE_BLOCKCOUNT(block_count, num_phases, split_phase,
early_phase_segcount, late_phase_segcount)
phase_count = ((phase < split_phase)?
(early_phase_segcount) : (late_phase_segcount));
phase_count = ((phase < split_phase)?
(early_phase_segcount) : (late_phase_segcount));
phase_offset = ((phase < split_phase)?
((ptrdiff_t)phase * (ptrdiff_t)early_phase_segcount) :
((ptrdiff_t)phase * (ptrdiff_t)late_phase_segcount + split_phase));
Expand Down
20 changes: 9 additions & 11 deletions ompi/mca/coll/base/coll_base_alltoall.c
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,10 @@ mca_coll_base_alltoall_intra_basic_inplace(const void *rbuf, int rcount,
{
mca_coll_base_module_t *base_module = (mca_coll_base_module_t*) module;
int i, j, size, rank, err = MPI_SUCCESS, line;
OPAL_PTRDIFF_TYPE ext, gap;
MPI_Request *preq;
char *tmp_buffer;
size_t max_size;
ptrdiff_t ext, true_lb, true_ext;

/* Initialize. */

Expand All @@ -60,14 +60,14 @@ mca_coll_base_alltoall_intra_basic_inplace(const void *rbuf, int rcount,

/* Find the largest receive amount */
ompi_datatype_type_extent (rdtype, &ext);
ompi_datatype_get_true_extent ( rdtype, &true_lb, &true_ext);
max_size = true_ext + ext * (rcount-1);
max_size = opal_datatype_span(&rdtype->super, rcount, &gap);

/* Allocate a temporary buffer */
tmp_buffer = calloc (max_size, 1);
if (NULL == tmp_buffer) {
return OMPI_ERR_OUT_OF_RESOURCE;
}
tmp_buffer -= gap;
max_size = ext * rcount;

/* in-place alltoall slow algorithm (but works) */
Expand Down Expand Up @@ -199,7 +199,7 @@ int ompi_coll_base_alltoall_intra_bruck(const void *sbuf, int scount,
int i, k, line = -1, rank, size, err = 0;
int sendto, recvfrom, distance, *displs = NULL, *blen = NULL;
char *tmpbuf = NULL, *tmpbuf_free = NULL;
ptrdiff_t rlb, slb, tlb, sext, rext, tsext;
OPAL_PTRDIFF_TYPE sext, rext, span, gap;
struct ompi_datatype_t *new_ddt;

if (MPI_IN_PLACE == sbuf) {
Expand All @@ -213,25 +213,23 @@ int ompi_coll_base_alltoall_intra_bruck(const void *sbuf, int scount,
OPAL_OUTPUT((ompi_coll_base_framework.framework_output,
"coll:base:alltoall_intra_bruck rank %d", rank));

err = ompi_datatype_get_extent (sdtype, &slb, &sext);
err = ompi_datatype_type_extent (sdtype, &sext);
if (err != MPI_SUCCESS) { line = __LINE__; goto err_hndl; }

err = ompi_datatype_get_true_extent(sdtype, &tlb, &tsext);
if (err != MPI_SUCCESS) { line = __LINE__; goto err_hndl; }

err = ompi_datatype_get_extent (rdtype, &rlb, &rext);
err = ompi_datatype_type_extent (rdtype, &rext);
if (err != MPI_SUCCESS) { line = __LINE__; goto err_hndl; }

span = opal_datatype_span(&sdtype->super, size * scount, &gap);

displs = (int *) malloc(size * sizeof(int));
if (displs == NULL) { line = __LINE__; err = -1; goto err_hndl; }
blen = (int *) malloc(size * sizeof(int));
if (blen == NULL) { line = __LINE__; err = -1; goto err_hndl; }

/* tmp buffer allocation for message data */
tmpbuf_free = (char *) malloc(tsext + ((ptrdiff_t)scount * (ptrdiff_t)size - 1) * sext);
tmpbuf_free = (char *)malloc(span);
if (tmpbuf_free == NULL) { line = __LINE__; err = -1; goto err_hndl; }
tmpbuf = tmpbuf_free - slb;
tmpbuf = tmpbuf_free - gap;

/* Step 1 - local rotation - shift up by rank */
err = ompi_datatype_copy_content_same_ddt (sdtype,
Expand Down
13 changes: 7 additions & 6 deletions ompi/mca/coll/base/coll_base_alltoallv.c
Original file line number Diff line number Diff line change
Expand Up @@ -38,16 +38,16 @@

int
mca_coll_base_alltoallv_intra_basic_inplace(const void *rbuf, const int *rcounts, const int *rdisps,
struct ompi_datatype_t *rdtype,
struct ompi_communicator_t *comm,
mca_coll_base_module_t *module)
struct ompi_datatype_t *rdtype,
struct ompi_communicator_t *comm,
mca_coll_base_module_t *module)
{
mca_coll_base_module_t *base_module = (mca_coll_base_module_t*) module;
int i, j, size, rank, err=MPI_SUCCESS;
MPI_Request *preq;
char *tmp_buffer;
size_t max_size, rdtype_size;
ptrdiff_t ext;
OPAL_PTRDIFF_TYPE ext, gap;

/* Initialize. */

Expand All @@ -63,16 +63,17 @@ mca_coll_base_alltoallv_intra_basic_inplace(const void *rbuf, const int *rcounts
/* Find the largest receive amount */
ompi_datatype_type_extent (rdtype, &ext);
for (i = 0, max_size = 0 ; i < size ; ++i) {
size_t size = ext * rcounts[i];

size_t size = opal_datatype_span(&rdtype->super, rcounts[i], &gap);
max_size = size > max_size ? size : max_size;
}
/* The gap will always be the same as we are working on the same datatype */

/* Allocate a temporary buffer */
tmp_buffer = calloc (max_size, 1);
if (NULL == tmp_buffer) {
return OMPI_ERR_OUT_OF_RESOURCE;
}
tmp_buffer += gap;

/* in-place alltoallv slow algorithm (but works) */
for (i = 0 ; i < size ; ++i) {
Expand Down
20 changes: 10 additions & 10 deletions ompi/mca/coll/base/coll_base_gather.c
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ ompi_coll_base_gather_intra_binomial(const void *sbuf, int scount,
char *ptmp = NULL, *tempbuf = NULL;
ompi_coll_tree_t* bmtree;
MPI_Status status;
MPI_Aint sextent, slb, strue_lb, strue_extent;
MPI_Aint rextent, rlb, rtrue_lb, rtrue_extent;
MPI_Aint sextent, sgap, ssize;
MPI_Aint rextent, rgap, rsize;
mca_coll_base_module_t *base_module = (mca_coll_base_module_t*) module;
mca_coll_base_comm_t *data = base_module->base_data;

Expand All @@ -64,14 +64,14 @@ ompi_coll_base_gather_intra_binomial(const void *sbuf, int scount,
COLL_BASE_UPDATE_IN_ORDER_BMTREE( comm, base_module, root );
bmtree = data->cached_in_order_bmtree;

ompi_datatype_get_extent(sdtype, &slb, &sextent);
ompi_datatype_get_true_extent(sdtype, &strue_lb, &strue_extent);
ompi_datatype_type_extent(sdtype, &sextent);
ompi_datatype_type_extent(rdtype, &rextent);
ssize = opal_datatype_span(&sdtype->super, scount * size, &sgap);
rsize = opal_datatype_span(&rdtype->super, rcount * size, &rgap);

vrank = (rank - root + size) % size;

if (rank == root) {
ompi_datatype_get_extent(rdtype, &rlb, &rextent);
ompi_datatype_get_true_extent(rdtype, &rtrue_lb, &rtrue_extent);
if (0 == root){
/* root on 0, just use the recv buffer */
ptmp = (char *) rbuf;
Expand All @@ -83,12 +83,12 @@ ompi_coll_base_gather_intra_binomial(const void *sbuf, int scount,
} else {
/* root is not on 0, allocate temp buffer for recv,
* rotate data at the end */
tempbuf = (char *) malloc(rtrue_extent + ((ptrdiff_t)rcount * (ptrdiff_t)size - 1) * rextent);
tempbuf = (char *) malloc(rsize);
if (NULL == tempbuf) {
err= OMPI_ERR_OUT_OF_RESOURCE; line = __LINE__; goto err_hndl;
}

ptmp = tempbuf - rtrue_lb;
ptmp = tempbuf - rgap;
if (sbuf != MPI_IN_PLACE) {
/* copy from sbuf to temp buffer */
err = ompi_datatype_sndrcv((void *)sbuf, scount, sdtype,
Expand All @@ -106,12 +106,12 @@ ompi_coll_base_gather_intra_binomial(const void *sbuf, int scount,
/* other non-leaf nodes, allocate temp buffer for data received from
* children, the most we need is half of the total data elements due
* to the property of binimoal tree */
tempbuf = (char *) malloc(strue_extent + ((ptrdiff_t)scount * (ptrdiff_t)size - 1) * sextent);
tempbuf = (char *) malloc(ssize);
if (NULL == tempbuf) {
err= OMPI_ERR_OUT_OF_RESOURCE; line = __LINE__; goto err_hndl;
}

ptmp = tempbuf - strue_lb;
ptmp = tempbuf - sgap;
/* local copy to tempbuf */
err = ompi_datatype_sndrcv((void *)sbuf, scount, sdtype,
ptmp, scount, sdtype);
Expand Down
Loading