Order statistics #461

LukeMathWalker · 2018-06-02T18:07:25Z

Motivation

Percentiles are a nice-to-have feature, but they turn out to be extremely useful when dealing with Machine Learning projects (in particular, to support efficient and distribution-agnostic splitting criteria for Decision Trees operating on continuous values, paper - Sec 3.3)

Overview

I have added an array method called percentile_axis_mut that shuffles in-place each 1-dimensional lane along the specified axis and returns the desired percentile.
The function signature is pretty similar to np.percentile. Main external differences/limitations:

q is required to be in range [0, 1] instead of range [0, 100];
no out parameter;
overwrite_input is True, being a _mut method. The False behaviour can be recovered by adding a percentile_axis method taking its argument as immutable reference;
interpolation=lower is the only supported (and thus default) behaviour. The function can be easily extended to support highest and nearest strategy, because they both require the recovery of a single element by index;
keepdims=False.

To keep functions small and separate concerns I have also implemented two new methods for ArrayViewMut1: ith_mut and partition_mut.
They can be used as stepping stones to provide ith_axis_mut and partition_axis_mut methods for n-dimensional arrays. Ideally I would have liked to implement both ith_mut and partition_mut for n-dimensional arrays, but I ran into some doubts:

should we flatten the array implicitly, as numpy does in np.ndarray.partition if axis=-1, or is it better to make it explicit that the method runs on 1-dimensional objects?
how do I flatten an array? In ndarray for Numpy users @jturner314 suggests to use

Array::from_iter(a.iter())

to emulate np.flatten but I'd actually like to emulate np.ravel behaviour in this case to avoid extra memory allocation.

Algorithm

np.percentile uses introselect while my partition_axis_mut/ith_mut relies on randomized quickselect: same expected complexity in the average case, O(n), while quickselect has O(n^2) complexity in the worst case compared to O(n) complexity in the worst case for introselect.
Nothing against introselect - I have started with quickselect because it was easier to implement, but I structured the code to make it simple to switch to introselect with a minimum amount of effort (i.e. we just need a working implementation of median of medians).

To use randomized quickselect I need to sample random integers => I need to use the rand crate.
As far as I understand, ndarray does not want to add rand as a dependency, thus justifying the existence of ndarray-rand.
Choosing the pivot index randomly improves running time increasing the number of inputs falling in the "average case" bucket. But if rand is an issue, it is sufficient to modify random_pivot, a private helper function used by ith_mut, into a function returning a pivot index using a deterministic algorithm (first value, last value, middle value, whatever) to get rid of the dependency.

Trait bounds

All new methods work for A: [...] + Ord, thus excluding f32 and f64 as valid element types. Because of the peculiarities ofNaN handling, I'd imagine that specific float equivalents of these methods will have to be added (as in np.nanpercentile).
This will be an on-going problem for all order-induced statistics (maximum, minimum, etc.): I am happy to work on it, but I'd like to first understand what kind of design choices would be preferred @bluss @jturner314.

P.S.

If Add map_mut and map_axis_mut #460 gets merged I can use it to simplify percentile_axis_mut function body, which now relies on @jturner314's clever workaround (see map_axis with FnMut(ArrayViewMut1<'a, A>) #452);
I am not very convinced by ith_mut/ith as method names. I was considering pick_mut/pick or find_mut/find as alternatives.

sorted.

randomized_partition.

partition.

Implemented partition and a small test, which currently fails due to index overflow.

mutability to modify in place the array passed as argument. Added some tests for partition.

owned arrays. Adapted tests as well.

reference to the array (thus renamed _mut)

get i from q. This provides the right behaviour when calling the function on arrays with odd and even number of values when computing the median value. Added some test for percentile_axis_mut.

…==0. Implemented a test to check that percentile 0 returns the array minimum

LukeMathWalker · 2018-08-31T09:40:25Z

I'd like to wrap up this PR in the next couple of weeks - can I get your view on how we want to move forward with it with respect to the last comments? @bluss @jturner314

…statistics

Interpolation strategy

…tegies Add more interpolation strategies

LukeMathWalker · 2018-09-02T15:00:11Z

percentile_axis_mut now supports Lower, Upper, Midpoint, Nearest and Linear as strategies using this signature:

    pub fn percentile_axis_mut<I>(&mut self, axis: Axis, q: f64) -> Array<A, D::Smaller>
        where D: RemoveAxis,
              A: Ord + Clone,
              S: DataMut,
              I: Interpolate<A>,

with

pub trait Interpolate<T> {
    [private utility methods]
}

pub struct Upper;
pub struct Lower;
pub struct Nearest;
pub struct Midpoint;
pub struct Linear;

[implementations]

I am still not 100% sure that the Interpolate trait should be public.

PartialOrd types are still not supported in this implementation.

already found.

jturner314 · 2018-09-04T05:12:38Z

I'm sorry for taking so long to get back to this.

I think that for the time being, it makes the most sense to put this functionality in a separate crate. It doesn't need access to ndarray internals, and it would be good to refine the API before we commit to it in ndarray. It's easier to make breaking changes in a crate separate from ndarray. I've created a repository ndarray-stats that I'd like to release on crates.io. I've copied the percentile functions you provided in this PR, added a few more functions, and worked on NaN handling. (View the docs with cargo doc --open.)

See the MaybeNan trait and the method variants in the QuantileExt trait for the NaN handling. It provides support for f32, f64, and Option<integer> types. I'd like to add a quantile_axis_partialord_mut method to QuantileExt, similar to min_partialord and max_partialord.

Other functionality that would be good to add at some point are NaN-supporting sum, mean, variance, standard deviation, and histogram methods (basically everything in https://docs.scipy.org/doc/numpy/reference/routines.statistics.html).

What do you think?

LukeMathWalker · 2018-09-04T06:07:51Z

I agree that working on a separate sub-crate could allow for a faster and easier iteration, especially with respect to API design.
Happy to move my efforts there :) @jturner314

jturner314 · 2018-09-04T17:42:57Z

Okay, sounds good. By the way, if you'd like to be the maintainer of the ndarray-stats crate, that would be fine with me. I haven't claimed the name on crates.io yet. I just created a repository myself because that was the most expedient thing to do.

LukeMathWalker · 2018-09-04T17:44:59Z

I'd be happy to do it - I definitely plan to engage for the long term in this niche of the Rust ecosystem.

LukeMathWalker and others added 30 commits May 14, 2018 07:17

Added function signature.

ee0d54d

Determine the index of the desired element if the array were to be

9b775d0

sorted.

Added macro use for slice

a4bec9d

Barebone implementation of randomized_select with the signature of

470af6c

randomized_partition.

Added rand as a dependency.

f2eec96

Added extern crate rand to lib.rs.

92d3433

Basic implementation of randomized_partition first section. Missing

8e25ae0

partition.

Using swap where needed, instead of manually doing the swap using clone.

fc3e570

Implemented partition and a small test, which currently fails due to index overflow.

Cleaned comments and debugging prints.

0f5fb47

Fixed index overflow bug.

177e03b

Fixed issues with index overflow.

811e84f

Remove useless type annotation.

b618377

Refactored randomized_partition and partition to properly use

8994eae

mutability to modify in place the array passed as argument. Added some tests for partition.

Restored previous version of randomized_select.

9359134

Modified all function signatures to accept mutable views instead of

43d765e

owned arrays. Adapted tests as well.

Added macro_use to get azip! in numerics.

0dec699

Added macro_use to macrozip to get azip! in numeric.

9315413

Added an implementation of percentile_axis that requires a mutable

212d1b7

reference to the array (thus renamed _mut)

We need to increment the index by 1 to get the correct result

ccc8b04

Mapping passes i instead of i+1 to randomized_select, but we use ceil to

54f227a

get i from q. This provides the right behaviour when calling the function on arrays with odd and even number of values when computing the median value. Added some test for percentile_axis_mut.

Added some documentation for percentile_axis_mut.

461e258

Added an assert to prevent invalid q values.

184d65d

Changed end of recursion condition - it needs to check for n==1 not n…

e290bff

…==0. Implemented a test to check that percentile 0 returns the array minimum

Added a new test for the maximum, as well as testing it on arr1.

911a656

Split test into two more meaningful tests.

c57fb25

Added more documentation

2107f56

Updated to version 0.5.0 of rand.

07bc655

Updated to use the proper way of generating random integers.

891fe95

Updated docs.

540c495

Updated docs.

e22f889

LukeMathWalker and others added 20 commits September 1, 2018 19:05

Merge branch 'master' of https://github.com/bluss/ndarray into order-…

c6c459d

…statistics

Implementing interpolation strategies as trait bounds.

1ec4e42

Providing some default methods.

63bbb07

Mysterious type annotation error.

3d1a196

Fixed.

10a72d0

Making Lower, Upper, Nearest public.

f8b7713

Removed empty line.

e7d1f61

Re-exporting functions, structs and traits.

cec96ca

Fixed import.

c786eac

Fixed test cases.

744a427

Updated docs.

bdc8c3a

Fixed docs.

b526445

Fixed warnings on unused variables.

1b39c61

Merge pull request #1 from LukeMathWalker/interpolation-strategy

00ddf37

Interpolation strategy

Implemented Midpoint. Refactored code to change types of variables.

e90555e

Defined Linear.

1d094e8

Added Interpolate implementation for Linear.

f5f77ea

Fixed docs!

ff51479

Merge pull request #2 from LukeMathWalker/add-more-interpolation-stra…

15aa626

…tegies Add more interpolation strategies

Removed unused constraint on A.

d4e7de7

Reducing the scope of search for upper_index element if lower_index was

caf58ee

already found.

jturner314 mentioned this pull request Oct 28, 2018

Implement scalar_min and scalar_max for A: Ord #512

Closed

LukeMathWalker closed this Nov 8, 2018

LukeMathWalker deleted the order-statistics branch January 4, 2019 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Order statistics #461

Order statistics #461

Uh oh!

LukeMathWalker commented Jun 2, 2018 •

edited

Loading

Uh oh!

LukeMathWalker commented Aug 31, 2018

Uh oh!

LukeMathWalker commented Sep 2, 2018 •

edited

Loading

Uh oh!

jturner314 commented Sep 4, 2018

Uh oh!

LukeMathWalker commented Sep 4, 2018

Uh oh!

jturner314 commented Sep 4, 2018

Uh oh!

LukeMathWalker commented Sep 4, 2018 via email •

edited

Loading

Uh oh!

Uh oh!

Order statistics #461

Order statistics #461

Uh oh!

Conversation

LukeMathWalker commented Jun 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Overview

Algorithm

Trait bounds

P.S.

Uh oh!

LukeMathWalker commented Aug 31, 2018

Uh oh!

LukeMathWalker commented Sep 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jturner314 commented Sep 4, 2018

Uh oh!

LukeMathWalker commented Sep 4, 2018

Uh oh!

jturner314 commented Sep 4, 2018

Uh oh!

LukeMathWalker commented Sep 4, 2018 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

LukeMathWalker commented Jun 2, 2018 •

edited

Loading

LukeMathWalker commented Sep 2, 2018 •

edited

Loading

LukeMathWalker commented Sep 4, 2018 via email •

edited

Loading