-
Notifications
You must be signed in to change notification settings - Fork 321
Order statistics #461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Order statistics #461
Conversation
randomized_partition.
Implemented partition and a small test, which currently fails due to index overflow.
mutability to modify in place the array passed as argument. Added some tests for partition.
owned arrays. Adapted tests as well.
reference to the array (thus renamed _mut)
get i from q. This provides the right behaviour when calling the function on arrays with odd and even number of values when computing the median value. Added some test for percentile_axis_mut.
…==0. Implemented a test to check that percentile 0 returns the array minimum
I'd like to wrap up this PR in the next couple of weeks - can I get your view on how we want to move forward with it with respect to the last comments? @bluss @jturner314 |
Interpolation strategy
…tegies Add more interpolation strategies
pub fn percentile_axis_mut<I>(&mut self, axis: Axis, q: f64) -> Array<A, D::Smaller>
where D: RemoveAxis,
A: Ord + Clone,
S: DataMut,
I: Interpolate<A>, with pub trait Interpolate<T> {
[private utility methods]
}
pub struct Upper;
pub struct Lower;
pub struct Nearest;
pub struct Midpoint;
pub struct Linear;
[implementations] I am still not 100% sure that the
|
I'm sorry for taking so long to get back to this. I think that for the time being, it makes the most sense to put this functionality in a separate crate. It doesn't need access to See the Other functionality that would be good to add at some point are NaN-supporting sum, mean, variance, standard deviation, and histogram methods (basically everything in https://docs.scipy.org/doc/numpy/reference/routines.statistics.html). What do you think? |
I agree that working on a separate sub-crate could allow for a faster and easier iteration, especially with respect to API design. |
Okay, sounds good. By the way, if you'd like to be the maintainer of the |
I'd be happy to do it - I definitely plan to engage for the long term in this niche of the Rust ecosystem.
|
Motivation
Percentiles are a nice-to-have feature, but they turn out to be extremely useful when dealing with Machine Learning projects (in particular, to support efficient and distribution-agnostic splitting criteria for Decision Trees operating on continuous values, paper - Sec 3.3)
Overview
I have added an array method called
percentile_axis_mut
that shuffles in-place each 1-dimensional lane along the specified axis and returns the desired percentile.The function signature is pretty similar to np.percentile. Main external differences/limitations:
q
is required to be in range[0, 1]
instead of range[0, 100]
;out
parameter;overwrite_input
isTrue
, being a_mut
method. TheFalse
behaviour can be recovered by adding apercentile_axis
method taking its argument as immutable reference;interpolation=lower
is the only supported (and thus default) behaviour. The function can be easily extended to supporthighest
andnearest
strategy, because they both require the recovery of a single element by index;keepdims=False
.To keep functions small and separate concerns I have also implemented two new methods for
ArrayViewMut1
:ith_mut
andpartition_mut
.They can be used as stepping stones to provide
ith_axis_mut
andpartition_axis_mut
methods for n-dimensional arrays. Ideally I would have liked to implement bothith_mut
andpartition_mut
for n-dimensional arrays, but I ran into some doubts:axis=-1
, or is it better to make it explicit that the method runs on 1-dimensional objects?to emulate
np.flatten
but I'd actually like to emulatenp.ravel
behaviour in this case to avoid extra memory allocation.Algorithm
np.percentile
uses introselect while mypartition_axis_mut
/ith_mut
relies on randomized quickselect: same expected complexity in the average case, O(n), while quickselect has O(n^2) complexity in the worst case compared to O(n) complexity in the worst case forintroselect
.Nothing against introselect - I have started with quickselect because it was easier to implement, but I structured the code to make it simple to switch to introselect with a minimum amount of effort (i.e. we just need a working implementation of median of medians).
To use randomized quickselect I need to sample random integers => I need to use the
rand
crate.As far as I understand,
ndarray
does not want to addrand
as a dependency, thus justifying the existence ofndarray-rand
.Choosing the pivot index randomly improves running time increasing the number of inputs falling in the "average case" bucket. But if
rand
is an issue, it is sufficient to modifyrandom_pivot
, a private helper function used byith_mut
, into a function returning a pivot index using a deterministic algorithm (first value, last value, middle value, whatever) to get rid of the dependency.Trait bounds
All new methods work for
A: [...] + Ord
, thus excludingf32
andf64
as valid element types. Because of the peculiarities ofNaN
handling, I'd imagine that specific float equivalents of these methods will have to be added (as innp.nanpercentile
).This will be an on-going problem for all order-induced statistics (maximum, minimum, etc.): I am happy to work on it, but I'd like to first understand what kind of design choices would be preferred @bluss @jturner314.
P.S.
percentile_axis_mut
function body, which now relies on @jturner314's clever workaround (see map_axis with FnMut(ArrayViewMut1<'a, A>) #452);ith_mut
/ith
as method names. I was consideringpick_mut
/pick
orfind_mut
/find
as alternatives.