specify multiple aggregation functions

Discussed initially in [this other issue](https://github.com/ml31415/accumarray/issues/2).

**Design of syntax**

Suggested syntax for input:

``` python
# using tuple...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs=('mean', 'std', 'min', 'max'))
# using list...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs=['mean', 'std', 'min', 'max'])
# using comma/semicolon and/or white-space-delimited string
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs='mean std min max')
```

For output, you could argue that a dict, class or [`namedtuple`](https://docs.python.org/2/library/collections.html#collections.namedtuple) would be a safer solution as the user is less likely to mix up the order.  Actually, it seems that `namedtuple` is probably a pretty good solution because it will naturally unpack if the user wants it to, otherwise the user can treat it like a class/dict. Incidentally, the `field_names` arg for `namedtuple` supports basically the same set of input syntaxes described above for `aggregate`.  I guess you would need to dynamically generate the `namedtuple` classes based on the requested `funcs` list, and then store the class definitions in a little cache - but that's easy enough to do.

If output is named tuple the following is possible:

``` python
# direct unpacking...
a_mean, a_std, a_min, a_max = aggregate(group_idx, a, funcs='mean std min max')
# using as object...
a_agg = aggregate(group_idx, a, funcs='mean std min max')
plt.errorbar(a_agg.mean, yerr=a_gg.std) # or whatever
```

As previously discussed, the advantage of accepting multiple functions is that the aggregation machinery will then have scope for a variety of optimizations (though it could do the simplest thing and loop over each function in turn, or simply throw `NotImplemented`).  Any alternative syntax (e.g. wrapping the aggregation into an object and then providing methods on that object in the way that `pandas` does) is likely to require some degree of explicit caching and a bunch of complex logic to deal with it, whereas this syntax should keep any logic reasonably straightforward and permit optimal use of memory.

**Some suggestions on possible optimizations** 

In JIT-C implementations there is the option of squashing everything into a single pass over the array, which hopefully would offer very nice speed ups.  To get the most out of the cache it would probably make sense to arrange each of the multiple outputs for a given group contiguously in memory, e.g. for `min max`, the output would be `[min_0 max_0 min_1 max_1 .. min_n max_n]`.  Whether or not the output is produced in this manner will actually not be visible to the user as they will only get views onto it, which can be provided with basic numpy indexing `return custom_tuple(min=combined[:,0], max=combined[:,1])`. I think that will be safe, because there is no overlapping of the views into the `combined` array, so the user could never accidentally overwrite one variable and unexpectedly effect another.  One thing to note though is that different outputs may need different `dtype`s - it is still possible to achieve this with numpy ndarrays, but it's an extra implementation detail to get right.

In the `numpy` implementation there are a few functions which go nicely together: `sum mean var std` all start with `sum`, so that can be reused etc.  `min max` can be run using `argsort(a)` combined with the first/last trick.  `any all` both need to work on `a.astype(bool)`, and `allnan anynan` both need to work on `isnan(a)`.

Of course, if `nan-` versions are used, then that overhead only needs to appear once for multiple functions. The same is true of  `multi_ravel_index` and potentially all bounds-checking only needs to be done once.

_edit:_ and if custom functions are involved, then it's obviously easier to loop over each of them for a given array rather than doing all the sorting etc for each function...although admittedly that has got to be a pretty rare usage case, and the custom function could have dealt with it internally anyway...though I guess this is relevant to the pure-python implementation as everything is done in this manner. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

specify multiple aggregation functions #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

specify multiple aggregation functions #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions