Skip to content

specify multiple aggregation functions #3

Closed
@d1manson

Description

@d1manson

Discussed initially in this other issue.

Design of syntax

Suggested syntax for input:

# using tuple...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs=('mean', 'std', 'min', 'max'))
# using list...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs=['mean', 'std', 'min', 'max'])
# using comma/semicolon and/or white-space-delimited string
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs='mean std min max')

For output, you could argue that a dict, class or namedtuple would be a safer solution as the user is less likely to mix up the order. Actually, it seems that namedtuple is probably a pretty good solution because it will naturally unpack if the user wants it to, otherwise the user can treat it like a class/dict. Incidentally, the field_names arg for namedtuple supports basically the same set of input syntaxes described above for aggregate. I guess you would need to dynamically generate the namedtuple classes based on the requested funcs list, and then store the class definitions in a little cache - but that's easy enough to do.

If output is named tuple the following is possible:

# direct unpacking...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs='mean std min max')
# using as object...
a_agg = aggregate(group_idx, a, funcs='mean std min max')
plt.errorbar(a_agg.mean, yerr=a_gg.std) # or whatever

As previously discussed, the advantage of accepting multiple functions is that the aggregation machinery will then have scope for a variety of optimizations (though it could do the simplest thing and loop over each function in turn, or simply throw NotImplemented). Any alternative syntax (e.g. wrapping the aggregation into an object and then providing methods on that object in the way that pandas does) is likely to require some degree of explicit caching and a bunch of complex logic to deal with it, whereas this syntax should keep any logic reasonably straightforward and permit optimal use of memory.

Some suggestions on possible optimizations

In JIT-C implementations there is the option of squashing everything into a single pass over the array, which hopefully would offer very nice speed ups. To get the most out of the cache it would probably make sense to arrange each of the multiple outputs for a given group contiguously in memory, e.g. for min max, the output would be [min_0 max_0 min_1 max_1 .. min_n max_n]. Whether or not the output is produced in this manner will actually not be visible to the user as they will only get views onto it, which can be provided with basic numpy indexing return custom_tuple(min=combined[:,0], max=combined[:,1]). I think that will be safe, because there is no overlapping of the views into the combined array, so the user could never accidentally overwrite one variable and unexpectedly effect another. One thing to note though is that different outputs may need different dtypes - it is still possible to achieve this with numpy ndarrays, but it's an extra implementation detail to get right.

In the numpy implementation there are a few functions which go nicely together: sum mean var std all start with sum, so that can be reused etc. min max can be run using argsort(a) combined with the first/last trick. any all both need to work on a.astype(bool), and allnan anynan both need to work on isnan(a).

Of course, if nan- versions are used, then that overhead only needs to appear once for multiple functions. The same is true of multi_ravel_index and potentially all bounds-checking only needs to be done once.

edit: and if custom functions are involved, then it's obviously easier to loop over each of them for a given array rather than doing all the sorting etc for each function...although admittedly that has got to be a pretty rare usage case, and the custom function could have dealt with it internally anyway...though I guess this is relevant to the pure-python implementation as everything is done in this manner.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions