Skip to content

Order statistics #461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

LukeMathWalker
Copy link
Member

@LukeMathWalker LukeMathWalker commented Jun 2, 2018

Motivation

Percentiles are a nice-to-have feature, but they turn out to be extremely useful when dealing with Machine Learning projects (in particular, to support efficient and distribution-agnostic splitting criteria for Decision Trees operating on continuous values, paper - Sec 3.3)

Overview

I have added an array method called percentile_axis_mut that shuffles in-place each 1-dimensional lane along the specified axis and returns the desired percentile.
The function signature is pretty similar to np.percentile. Main external differences/limitations:

  • q is required to be in range [0, 1] instead of range [0, 100];
  • no out parameter;
  • overwrite_input is True, being a _mut method. The False behaviour can be recovered by adding a percentile_axis method taking its argument as immutable reference;
  • interpolation=lower is the only supported (and thus default) behaviour. The function can be easily extended to support highest and nearest strategy, because they both require the recovery of a single element by index;
  • keepdims=False.

To keep functions small and separate concerns I have also implemented two new methods for ArrayViewMut1: ith_mut and partition_mut.
They can be used as stepping stones to provide ith_axis_mut and partition_axis_mut methods for n-dimensional arrays. Ideally I would have liked to implement both ith_mut and partition_mut for n-dimensional arrays, but I ran into some doubts:

Array::from_iter(a.iter())

to emulate np.flatten but I'd actually like to emulate np.ravel behaviour in this case to avoid extra memory allocation.

Algorithm

np.percentile uses introselect while my partition_axis_mut/ith_mut relies on randomized quickselect: same expected complexity in the average case, O(n), while quickselect has O(n^2) complexity in the worst case compared to O(n) complexity in the worst case for introselect.
Nothing against introselect - I have started with quickselect because it was easier to implement, but I structured the code to make it simple to switch to introselect with a minimum amount of effort (i.e. we just need a working implementation of median of medians).

To use randomized quickselect I need to sample random integers => I need to use the rand crate.
As far as I understand, ndarray does not want to add rand as a dependency, thus justifying the existence of ndarray-rand.
Choosing the pivot index randomly improves running time increasing the number of inputs falling in the "average case" bucket. But if rand is an issue, it is sufficient to modify random_pivot, a private helper function used by ith_mut, into a function returning a pivot index using a deterministic algorithm (first value, last value, middle value, whatever) to get rid of the dependency.

Trait bounds

All new methods work for A: [...] + Ord, thus excluding f32 and f64 as valid element types. Because of the peculiarities ofNaN handling, I'd imagine that specific float equivalents of these methods will have to be added (as in np.nanpercentile).
This will be an on-going problem for all order-induced statistics (maximum, minimum, etc.): I am happy to work on it, but I'd like to first understand what kind of design choices would be preferred @bluss @jturner314.

P.S.

LukeMathWalker and others added 30 commits May 14, 2018 07:17
Implemented partition and a small test, which currently fails due to
index overflow.
mutability to modify in place the array passed as argument. Added some
tests for partition.
get i from q. This provides the right behaviour when calling the
function on arrays with odd and even number of values when computing the
median value. Added some test for percentile_axis_mut.
…==0. Implemented a test to check that percentile 0 returns the array minimum
@LukeMathWalker
Copy link
Member Author

I'd like to wrap up this PR in the next couple of weeks - can I get your view on how we want to move forward with it with respect to the last comments? @bluss @jturner314

@LukeMathWalker
Copy link
Member Author

LukeMathWalker commented Sep 2, 2018

percentile_axis_mut now supports Lower, Upper, Midpoint, Nearest and Linear as strategies using this signature:

    pub fn percentile_axis_mut<I>(&mut self, axis: Axis, q: f64) -> Array<A, D::Smaller>
        where D: RemoveAxis,
              A: Ord + Clone,
              S: DataMut,
              I: Interpolate<A>,

with

pub trait Interpolate<T> {
    [private utility methods]
}

pub struct Upper;
pub struct Lower;
pub struct Nearest;
pub struct Midpoint;
pub struct Linear;

[implementations]

I am still not 100% sure that the Interpolate trait should be public.

PartialOrd types are still not supported in this implementation.

@jturner314
Copy link
Member

I'm sorry for taking so long to get back to this.

I think that for the time being, it makes the most sense to put this functionality in a separate crate. It doesn't need access to ndarray internals, and it would be good to refine the API before we commit to it in ndarray. It's easier to make breaking changes in a crate separate from ndarray. I've created a repository ndarray-stats that I'd like to release on crates.io. I've copied the percentile functions you provided in this PR, added a few more functions, and worked on NaN handling. (View the docs with cargo doc --open.)

See the MaybeNan trait and the method variants in the QuantileExt trait for the NaN handling. It provides support for f32, f64, and Option<integer> types. I'd like to add a quantile_axis_partialord_mut method to QuantileExt, similar to min_partialord and max_partialord.

Other functionality that would be good to add at some point are NaN-supporting sum, mean, variance, standard deviation, and histogram methods (basically everything in https://docs.scipy.org/doc/numpy/reference/routines.statistics.html).

What do you think?

@LukeMathWalker
Copy link
Member Author

I agree that working on a separate sub-crate could allow for a faster and easier iteration, especially with respect to API design.
Happy to move my efforts there :) @jturner314

@jturner314
Copy link
Member

Okay, sounds good. By the way, if you'd like to be the maintainer of the ndarray-stats crate, that would be fine with me. I haven't claimed the name on crates.io yet. I just created a repository myself because that was the most expedient thing to do.

@LukeMathWalker
Copy link
Member Author

LukeMathWalker commented Sep 4, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants