-
Notifications
You must be signed in to change notification settings - Fork 323
Faster iterator for arbitrary order #469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Would it be possible to simply use Imo, the simplest name should be used for the most frequent case. Of course, I'm not aware of all use-cases, but here we would always use the "fast" iter, not the "logical" one, as we simply don't care about the order. Do you plan on removing the adapters (fold, scalar_sum, visit, etc.)? We can simply call |
I think this is a good idea @jturner314, but I maintain the opinion that the Iterator api is not a good way to deliver good performance for ndarray |
Comments on arbitrary order iterator
I'd prefer to avoid doing this for a few reasons:
For these reasons, I'd like for
I don't think so. These are sufficiently useful and commonly-used that I think we should keep them. The
That's a good point. I forgot about the issues with I just realized something else: what type do we return from an arbitrary-order iterator? It would be nice to use Edit: We might be able to use Alternative ideaSo, after this realization, I have a different proposal. Instead of providing arbitrary-order iterators, provide the following methods on
These are sufficient to provide the functionality of most of the iterator adapters I listed in my first comment. We could also add
but I'm not sure if we could outperform @bluss What do you think of this alternative idea? By the way, my original motivation for this issue was "It would be nice to have |
For arbitrary order iterators, yeah I'd be most concerned about the The current Iter is already wrapping an enum, and in one of the two cases it's a slice iterator. For many constructions, the enum check is lifted out of the loop! (The inversion turns it into a conditional with a loop in each branch, which is code bloat!) |
Indexed Zip does outperform indexed iter by something, at least. |
Thus current Iter is a good model for arbitrary order iterators? Just initialize it differently.. Just an idea Edit: This is kind of the low tech way, but it's not good enough for arrays that are not contiguous. |
Yes, that's true. I generally don't think of
I'm not worried about that as long as
I didn't realize that. That makes the most straightforward "arbitrary order" implementation quite easy then – just use the
That's really surprising and impressive to me.
Yes, I think that would work. The cases I can think of where you could improve over This could be accomplished by adding a third iterator variant |
Instead of using IxDyn, isn't it possible to shuffle the axes around / merging axes that are contiguous together? Set the remaining axes to length 1. But it mostly seems to be a benefit if you can merge into the axis that becomes the core of the innermost loop. But in some cases the "innermost" would be an axis of length 1, and then it's a simple win to move the next axis into its place. I also want to do this for Zip. Starting simple, for example merging the next-to-inner axis with the innermost if we can. (In Zip the "innermost" axis of the loop is the n - 1-th axis). |
I've been working on a project to generalize /// Optimizes the producer, possibly changing the order, and adjusts `axes`
/// into good iteration order (assuming the last index moves the fastest).
///
/// This function may change the shape of the producer and the order of
/// iteration. Optimization is performed only on the given `axes`; all other
/// axes are left unchanged.
///
/// When choosing axes to attempt merging, it only tries merging axes when the
/// absolute stride of the `take` axes is >= the absolute stride of the `into`
/// axis.
///
/// The suggested iteration order is in order of descending absolute stride
/// (except for axes of length <= 1, which are positioned as outer axes). This
/// isn't necessarily the optimal iteration order, but it should be a
/// reasonable heuristic in most cases.
///
/// **Panics** if any of the axes in `axes` are out of bounds or if an axis is
/// repeated more than once.
pub fn optimize_any_ord_axes<T, D>(producer: &mut T, axes: &mut D)
where
T: NdReshape + ?Sized,
D: Dimension,
{
// ...
}
For impl<A, B> NdReshape for Zip<A, B>
where
A: NdReshape,
B: NdReshape<Dim = A::Dim>,
{
// ...
fn approx_abs_strides(&self) -> Self::Dim {
let a = self.a.approx_abs_strides();
let b = self.b.approx_abs_strides();
a + b
}
// ...
} For impl<P, Df, I, F> NdReshape for FoldAxesProducer<P, Df, I, F>
where
P: NdReshape,
Df: Dimension,
P::Dim: SubDim<Df, Out = I::Dim>,
I: NdReshape,
{
// ...
fn approx_abs_strides(&self) -> Self::Dim {
let mut strides = self.init.approx_abs_strides();
let inner_strides = self.inner.approx_abs_strides();
for (ax, s) in strides.slice_mut().iter_mut().enumerate() {
*s += inner_strides[self.outer_to_inner[ax]];
}
strides
}
// ...
} This approach matches or outperforms the current version of |
That sounds really exciting, @jturner314. I was going to say there is always one unique shortest stride axis isn't there, if we count axes of length > 1, but that's just for one producer. How do you select axes to merge with many producers? |
There are a couple of questions when dealing with multiple producers. (1) What is the best order to iterate over the axes? (2) How do we decide what axes to merge/invert? Answering the first question is not easy because there is potentially a tradeoff between trying to pick short strides for the inner loop(s) (to use the processor's cache effectively) and trying to make the inner loop(s) as long as possible (to help with branch prediction and because determining the next index / pointer location after reaching the end of the axis is relatively expensive compared to just incrementing a pointer). For example, consider a 100×2 row-major array without merging the axes. Performing the inner loop over axis 0 has a longer stride (worse caching) but makes the inner loop longer (better branch prediction and cheaper on average to calculate the next pointer), while performing the inner loop over axis 1 has a shorter stride (better caching) but makes the inner loop very short (worse branch prediction and more expensive to calculate the next pointer on average). Initially, I tried to develop a cost model, but that was too complicated, so I just settled on the heuristic of sorting the axes by descending absolute stride (the last axis is the inner loop), where adapters that combine producers (like The second question is simpler to answer. The goal is to merge as many axes a possible. It might seem like it's necessary to try merging every axis into every other axis, but it turns out that's not necessary. Let's ignore axes of length 0 or 1. Then, axes can only be merged if the absolute value of the Note, however, that the axes that can be merged are not necessarily consecutive in the sorted order. For example, consider shape = [10, 2, 2], strides of first producer = [1, 20, 10], strides of second producer = [15, 2, 1]. The sum of strides is [16, 22, 11]. Sorted by descending stride, this gives axes 1, 0, 2. Axis 1 can be merged into axis 2, but that wouldn't be obvious if we just looked at consecutive axes in the sorted order (1 into 0 or 0 into 2). So, we need to try merging each axis wth all axes with larger absolute stride. Fortunately, we usually don't need to try all these cases because once a The algorithm is O(n2) where n is the number of axes, but n is usually pretty small. The implementation needs some cleanup (and I think I can replace the custom /// Executes `optimize_any_ord_axes` for all axes and returns the suggested
/// iteration order.
pub fn optimize_any_ord<T>(producer: &mut T) -> T::Dim
where
T: NdReshape + ?Sized,
{
let mut axes = T::Dim::zeros(producer.ndim());
for (i, ax) in axes.slice_mut().iter_mut().enumerate() {
*ax = i;
}
unsafe { optimize_any_ord_axes_unchecked(producer, &mut axes) };
axes
}
/// Optimizes the producer, possibly changing the order, and adjusts `axes`
/// into good iteration order (assuming the last index moves the fastest).
///
/// This function may change the shape of the producer and the order of
/// iteration. Optimization is performed only on the given `axes`; all other
/// axes are left unchanged.
///
/// When choosing axes to attempt merging, it only tries merging axes when the
/// absolute stride of the `take` axes is >= the absolute stride of the `into`
/// axis.
///
/// The suggested iteration order is in order of descending absolute stride
/// (except for axes of length <= 1, which are positioned as outer axes). This
/// isn't necessarily the optimal iteration order, but it should be a
/// reasonable heuristic in most cases.
///
/// **Panics** if any of the axes in `axes` are out of bounds or if an axis is
/// repeated more than once.
pub fn optimize_any_ord_axes<T, D>(producer: &mut T, axes: &mut D)
where
T: NdReshape + ?Sized,
D: Dimension,
{
assert_valid_unique_axes::<T::Dim>(producer.ndim(), axes.slice());
unsafe { optimize_any_ord_axes_unchecked(producer, axes) }
}
/// `unsafe` because `axes` are not checked to ensure that they're in-bounds
/// and not repeated.
unsafe fn optimize_any_ord_axes_unchecked<T, D>(producer: &mut T, axes: &mut D)
where
T: NdReshape + ?Sized,
D: Dimension,
{
if axes.ndim() == 0 {
return;
}
// TODO: Should there be a minimum producer size for the more advanced (and
// costly) optimizations?
// Determine initial order of axes. Sort axes by descending absolute stride
// (except for axes with length <= 1, which are moved to the left).
{
let shape = producer.shape();
let abs_strides = producer.approx_abs_strides();
axes.slice_mut().sort_unstable_by(|&a, &b| {
if shape[a] <= 1 || shape[b] <= 1 {
shape[a].cmp(&shape[b])
} else {
abs_strides[b].cmp(&abs_strides[a])
}
});
}
// Merge as many axes with lengths > 1 as possible and move `take` axes
// (which now have length <= 1) to the left.
if let Some(mut rest) = axes
.slice()
.iter()
.enumerate()
.find(|(_, &ax)| producer.len_of(Axis(ax)) > 1)
.map(|(i, _)| i)
{
let mut i = axes.ndim() - 1;
while i > rest {
let mut t_inc = i;
while t_inc > rest {
//println!("i={}, t={}, axes={:?}, shape={:?}", i, t_inc - 1, axes, producer.shape());
let take = Axis(axes[t_inc - 1]);
let into = Axis(axes[i]);
match producer.can_merge_axes(take, into) {
CanMerge::IfUnchanged | CanMerge::IfEither => {
producer.merge_axes(take, into);
// TODO: Would it be better to delay reordering until the end, and then partition/sort?
roll(&mut axes.slice_mut()[rest..t_inc], 1);
rest += 1;
}
CanMerge::IfInverted => {
producer.invert_axis(take);
producer.merge_axes(take, into);
roll(&mut axes.slice_mut()[rest..t_inc], 1);
rest += 1;
}
CanMerge::Never => {
t_inc -= 1;
}
}
}
i -= 1;
}
}
} The |
With Zip, this optimization would run only if the result is not that all producers are of the same c/f layout right? With things like roll Vs std rotate, I think of code size. It's not certain that a big blob of simdified rotating code is what we need.. dimensions are short and we will never prioritize the cases where ndim is very large. See equality testing of IxDyn, it's written to have smaller code size and avoid slice's partialeq implementation. |
Sure, that would work. Alternatively, I think it's possible to improve the implementation of
I'm not sure what you mean. I don't see any SIMD instructions in /// Rolls the slice by the given shift.
///
/// Rolling is like a shift, except that elements shifted off the end are moved
/// to the other end. Rolling is performed in the direction of `shift`
/// (positive for right, negative for left).
fn roll<T>(slice: &mut [T], mut shift: isize) {
let len = slice.len();
if len == 0 {
return;
}
// Minimize the absolute shift.
shift = shift % len as isize;
if shift > len as isize / 2 {
shift -= len as isize;
} else if shift < -(len as isize) / 2 {
shift += len as isize;
}
// Perform the roll.
if shift >= 0 {
for _ in 0..shift {
for i in 0..(len - 1) {
slice.swap(i, len - 1);
}
}
} else {
for _ in 0..(-shift) {
for i in (1..len).rev() {
slice.swap(i, 0);
}
}
}
} This is a fairly large function, but if it's not inlined, that should be fine, right? It's worth noting that we could specialize the
Are you referring to this? Why is it implemented like that instead of using impl<T: PartialEq> PartialEq for IxDynRepr<T> {
fn eq(&self, rhs: &Self) -> bool {
self.deref() == rhs.deref()
}
} |
About rotate, that's what I know of its implementation, but I'll check later. About dim eq That's what i wanted to call attention to: It's implemented that way to avoid slice's partial eq for code size and memcmp overhead. Fwiw, the difference in performance was tested at the time. |
Oh, I understand now! You're talking about the amount of code for a given ndim, especially for arrays with small ndim. (Specializing on dimension types or |
.iter()
provides an iterator over all the elements, but it always iterates in logical order, which may be slow depending on the memory layout of the array. In some cases, however, the order of iteration doesn't matter. Recent issues regarding these types of cases include #466 and #468. Examples of methods where order doesn't matter include the most common uses of these from theIterator
trait.fold()
.for_each()
.all()
and.any()
.find()
.min()
,.max()
,.min_by()
,.max_by()
,.min_by_key()
,.max_by_key()
.sum()
,.product()
and these from
Itertools
.cartesian_product()
.unique()
,.unique_by()
.combinations()
.all_equal()
.foreach()
.fold_results()
,.fold_options()
,.fold1()
,.tree_fold()
,.fold_while()
.sorted()
,.sorted_by()
,.sorted_by_key()
.partition_map()
.into_group_map()
.minmax()
,.minmax_by_key()
,minmax_by()
We have already implemented some of these "arbitrary order" adapters as individual methods on
ArrayBase
, including.fold()
,.scalar_sum()
, and.visit()
. However, it doesn't make sense to create separate methods for all of the possible iterator adapters.As a result, I'd like to add "arbitrary order"
.iter()
,.iter_mut()
,.indexed_iter()
, and.indexed_iter_mut()
methods designed to iterate in the fastest possible order so that we can hopefully get good performance with iterator adapters.What does everyone think these "arbitrary order" iterators should be named?
I've thought of
.iter_arbitrary()
and.iter_unordered()
, but those names seem somewhat unclear and unnecessarily verbose.The text was updated successfully, but these errors were encountered: