Description
Discussed initially in this other issue.
Design of syntax
Suggested syntax for input:
# using tuple...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs=('mean', 'std', 'min', 'max'))
# using list...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs=['mean', 'std', 'min', 'max'])
# using comma/semicolon and/or white-space-delimited string
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs='mean std min max')
For output, you could argue that a dict, class or namedtuple
would be a safer solution as the user is less likely to mix up the order. Actually, it seems that namedtuple
is probably a pretty good solution because it will naturally unpack if the user wants it to, otherwise the user can treat it like a class/dict. Incidentally, the field_names
arg for namedtuple
supports basically the same set of input syntaxes described above for aggregate
. I guess you would need to dynamically generate the namedtuple
classes based on the requested funcs
list, and then store the class definitions in a little cache - but that's easy enough to do.
If output is named tuple the following is possible:
# direct unpacking...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs='mean std min max')
# using as object...
a_agg = aggregate(group_idx, a, funcs='mean std min max')
plt.errorbar(a_agg.mean, yerr=a_gg.std) # or whatever
As previously discussed, the advantage of accepting multiple functions is that the aggregation machinery will then have scope for a variety of optimizations (though it could do the simplest thing and loop over each function in turn, or simply throw NotImplemented
). Any alternative syntax (e.g. wrapping the aggregation into an object and then providing methods on that object in the way that pandas
does) is likely to require some degree of explicit caching and a bunch of complex logic to deal with it, whereas this syntax should keep any logic reasonably straightforward and permit optimal use of memory.
Some suggestions on possible optimizations
In JIT-C implementations there is the option of squashing everything into a single pass over the array, which hopefully would offer very nice speed ups. To get the most out of the cache it would probably make sense to arrange each of the multiple outputs for a given group contiguously in memory, e.g. for min max
, the output would be [min_0 max_0 min_1 max_1 .. min_n max_n]
. Whether or not the output is produced in this manner will actually not be visible to the user as they will only get views onto it, which can be provided with basic numpy indexing return custom_tuple(min=combined[:,0], max=combined[:,1])
. I think that will be safe, because there is no overlapping of the views into the combined
array, so the user could never accidentally overwrite one variable and unexpectedly effect another. One thing to note though is that different outputs may need different dtype
s - it is still possible to achieve this with numpy ndarrays, but it's an extra implementation detail to get right.
In the numpy
implementation there are a few functions which go nicely together: sum mean var std
all start with sum
, so that can be reused etc. min max
can be run using argsort(a)
combined with the first/last trick. any all
both need to work on a.astype(bool)
, and allnan anynan
both need to work on isnan(a)
.
Of course, if nan-
versions are used, then that overhead only needs to appear once for multiple functions. The same is true of multi_ravel_index
and potentially all bounds-checking only needs to be done once.
edit: and if custom functions are involved, then it's obviously easier to loop over each of them for a given array rather than doing all the sorting etc for each function...although admittedly that has got to be a pretty rare usage case, and the custom function could have dealt with it internally anyway...though I guess this is relevant to the pure-python implementation as everything is done in this manner.