Skip to content

Faster iterator for arbitrary order #469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jturner314 opened this issue Jun 26, 2018 · 15 comments
Open

Faster iterator for arbitrary order #469

jturner314 opened this issue Jun 26, 2018 · 15 comments

Comments

@jturner314
Copy link
Member

.iter() provides an iterator over all the elements, but it always iterates in logical order, which may be slow depending on the memory layout of the array. In some cases, however, the order of iteration doesn't matter. Recent issues regarding these types of cases include #466 and #468. Examples of methods where order doesn't matter include the most common uses of these from the Iterator trait

  • .fold()
  • .for_each()
  • .all() and .any()
  • .find()
  • .min(), .max(), .min_by(), .max_by(), .min_by_key(), .max_by_key()
  • .sum(), .product()

and these from Itertools

  • .cartesian_product()
  • .unique(), .unique_by()
  • .combinations()
  • .all_equal()
  • .foreach()
  • .fold_results(), .fold_options(), .fold1(), .tree_fold(), .fold_while()
  • .sorted(), .sorted_by(), .sorted_by_key()
  • .partition_map()
  • .into_group_map()
  • .minmax(), .minmax_by_key(), minmax_by()

We have already implemented some of these "arbitrary order" adapters as individual methods on ArrayBase, including .fold(), .scalar_sum(), and .visit(). However, it doesn't make sense to create separate methods for all of the possible iterator adapters.

As a result, I'd like to add "arbitrary order" .iter(), .iter_mut(), .indexed_iter(), and .indexed_iter_mut() methods designed to iterate in the fastest possible order so that we can hopefully get good performance with iterator adapters.

What does everyone think these "arbitrary order" iterators should be named?

I've thought of .iter_arbitrary() and .iter_unordered(), but those names seem somewhat unclear and unnecessarily verbose.

@nilgoyette
Copy link
Collaborator

Would it be possible to simply use iter() and change the current iter() for iter_logical()? You're at v0.11 so breaking changes are expected.

Imo, the simplest name should be used for the most frequent case. Of course, I'm not aware of all use-cases, but here we would always use the "fast" iter, not the "logical" one, as we simply don't care about the order.

Do you plan on removing the adapters (fold, scalar_sum, visit, etc.)? We can simply call iter().adapter() and be done with it, I guess.

@bluss
Copy link
Member

bluss commented Jun 27, 2018

I think this is a good idea @jturner314, but I maintain the opinion that the Iterator api is not a good way to deliver good performance for ndarray iterators traversal. Not for algorithms that end up going through the Iterator::next method at least (we already have some special case fold implementations which maintains the order, but still avoids the overhead of next).

@jturner314
Copy link
Member Author

jturner314 commented Jun 27, 2018

Comments on arbitrary order iterator

Would it be possible to simply use iter() and change the current iter() for iter_logical()? You're at v0.11 so breaking changes are expected.

I'd prefer to avoid doing this for a few reasons:

  1. Changing the behavior of .iter() could easily break someone's existing code in subtle, hard-to-detect ways. Breaking changes are fine, but I prefer breaking changes that make it obvious that the user's code is broken (by causing compiler warnings or errors).

  2. If we changed .iter() to use arbitrary order, and someone isn't paying close attention, it would be easy to write new code that is wrong in subtle ways. For example, my guess is that most people write unit tests that use row-major arrays (simply because row-major arrays are the easiest to create in ndarray). As a result, it would be easy to implicitly depend on the order of iteration without realizing it or even detecting the issue in tests.

  3. The typical behavior of methods named .iter() for Rust collections is that they operate in a predictable order.

For these reasons, I'd like for .iter() to follow a predictable order and for the names of "arbitrary order" iterators to clearly indicate that they have arbitrary order. I'd like to cause users to make a conscious choice to use an arbitrary-order iterator.

Do you plan on removing the adapters (fold, scalar_sum, visit, etc.)?

I don't think so. These are sufficiently useful and commonly-used that I think we should keep them. The .iter_any_order().adapter() combination is also quite a bit more verbose.

I think this is a good idea, but I maintain the opinion that the Iterator api is not a good way to deliver good performance for ndarray iterators. Not for algorithms that end up going through the Iterator::next method at least

That's a good point. I forgot about the issues with Iterator::next.

I just realized something else: what type do we return from an arbitrary-order iterator? It would be nice to use ::std::slice::Iter for contiguous arrays, but we'd need a separate iterator for arbitrary layouts. As a result, we'd need to return an enum with multiple variants. This would hurt performance because .next() would have to check the enum variant, then call .next() on the inner iterator for that variant. The only way to avoid this is to provide specialized implementations for all of the iterator methods that checks the iterator variant only once. That still doesn't help with other iterator adapters, though, such as those from Itertools.

Edit: We might be able to use ndarray::iter::Iter<'a, A, IxDyn> as a common iterator type to avoid the enum, and for contiguous arrays use a shape of IxDyn(&[len]) and strides of IxDyn(&[1]).

Alternative idea

So, after this realization, I have a different proposal. Instead of providing arbitrary-order iterators, provide the following methods on ArrayBase:

  • .fold_while()
  • .fold_while_mut()
  • .fold_mut()
  • .visit_mut()

These are sufficient to provide the functionality of most of the iterator adapters I listed in my first comment.

We could also add

  • .fold_indexed()
  • .fold_while_indexed()
  • .fold_indexed_mut()
  • .fold_while_indexed_mut()

but I'm not sure if we could outperform .indexed_iter() by much.

@bluss What do you think of this alternative idea?

By the way, my original motivation for this issue was "It would be nice to have any and all methods for #468." => "Okay, we don't have those, but is there a good workaround?" => "You could use .fold(), but that doesn't short-circuit, or you could use .iter(), but that's slow. What about a .fold_while() method?" => "Wait, this problem occurs more generally; what about an 'arbitrary order' iterator?" On further reflection, it may make sense just to go back to the .fold_while() idea.

@bluss
Copy link
Member

bluss commented Jun 28, 2018

Zip already covers all those fold while combinations IIUC. It's far from perfect of course.

For arbitrary order iterators, yeah I'd be most concerned about the a.arbitrary_iter().zip(b.arbitrary_iter()) construction to backfire, pairing items in an unexpected way.

The current Iter is already wrapping an enum, and in one of the two cases it's a slice iterator. For many constructions, the enum check is lifted out of the loop! (The inversion turns it into a conditional with a loop in each branch, which is code bloat!)

@bluss
Copy link
Member

bluss commented Jun 28, 2018

Indexed Zip does outperform indexed iter by something, at least.

@bluss
Copy link
Member

bluss commented Jun 28, 2018

Thus current Iter is a good model for arbitrary order iterators? Just initialize it differently.. Just an idea

Edit: This is kind of the low tech way, but it's not good enough for arrays that are not contiguous.

@jturner314
Copy link
Member Author

Zip already covers all those fold while combinations IIUC. It's far from perfect of course.

Yes, that's true. I generally don't think of Zip for iterating over single arrays, but I probably should. I think it would still be worth adding a .fold_while() method to ArrayBase just for convenience. The other methods I suggested probably aren't common enough to warrant their own method on ArrayBase since it's fairly straightforward to accomplish the same thing with Zip.

For arbitrary order iterators, yeah I'd be most concerned about the a.arbitrary_iter().zip(b.arbitrary_iter()) construction to backfire, pairing items in an unexpected way.

I'm not worried about that as long as .arbitrary_iter() is clearly named. People should be using a.iter().zip(b.iter()) or Zip for zipping arrays.

The current Iter is already wrapping an enum, and in one of the two cases it's a slice iterator.

I didn't realize that. That makes the most straightforward "arbitrary order" implementation quite easy then – just use the ElementsRepr::Slice variant in more cases (contiguous arrays with arbitrary layout) and rearrange the shape and strides in the ElementsBase iterator (the ElementsRepr::Count variant) for speed (like fold puts the narrowest axis at the end).

For many constructions, the enum check is lifted out of the loop!

That's really surprising and impressive to me.

Thus current Iter is a good model for arbitrary order iterators? Just initialize it differently.. Just an idea

Edit: This is kind of the low tech way, but it's not good enough for arrays that are not contiguous.

Yes, I think that would work.

The cases I can think of where you could improve over Iter are where there are multiple contiguous sections. For example, v produced by let a = Array3::<i32>::ones((3, 2, 2)); let v = a.slice(s![..;2, .., ..]); has shape=[2, 2, 2], strides=[8, 2, 1], layout=Custom (0x0) and is split into two contiguous pieces. In this case, it may be advantageous to use IxDyn with axes 1 and 2 combined.

This could be accomplished by adding a third iterator variant CountedDyn(D) where D = ElementsBase<'a, A, IxDyn>.

@bluss
Copy link
Member

bluss commented Nov 18, 2018

Instead of using IxDyn, isn't it possible to shuffle the axes around / merging axes that are contiguous together? Set the remaining axes to length 1.

But it mostly seems to be a benefit if you can merge into the axis that becomes the core of the innermost loop. But in some cases the "innermost" would be an axis of length 1, and then it's a simple win to move the next axis into its place.

I also want to do this for Zip. Starting simple, for example merging the next-to-inner axis with the innermost if we can. (In Zip the "innermost" axis of the loop is the n - 1-th axis).

@jturner314
Copy link
Member Author

Instead of using IxDyn, isn't it possible to shuffle the axes around / merging axes that are contiguous together? Set the remaining axes to length 1.

I've been working on a project to generalize NdProducer/Zip to handle operations like .map()/.fold_axes()/.collect()/etc., and that's the approach that I found worked the best. I have a function like this:

/// Optimizes the producer, possibly changing the order, and adjusts `axes`
/// into good iteration order (assuming the last index moves the fastest).
///
/// This function may change the shape of the producer and the order of
/// iteration. Optimization is performed only on the given `axes`; all other
/// axes are left unchanged.
///
/// When choosing axes to attempt merging, it only tries merging axes when the
/// absolute stride of the `take` axes is >= the absolute stride of the `into`
/// axis.
///
/// The suggested iteration order is in order of descending absolute stride
/// (except for axes of length <= 1, which are positioned as outer axes). This
/// isn't necessarily the optimal iteration order, but it should be a
/// reasonable heuristic in most cases.
///
/// **Panics** if any of the axes in `axes` are out of bounds or if an axis is
/// repeated more than once.
pub fn optimize_any_ord_axes<T, D>(producer: &mut T, axes: &mut D)
where
    T: NdReshape + ?Sized,
    D: Dimension,
{
    // ...
}

optimize_any_ord_axes determines which axes can be merged, merges as many axes as possible (possibly inverting axes if necessary), and adjusts axes into the best order for iteration. It uses a heuristic of trying to merge into axes with a smaller "approximate absolute stride", so the axes with the shortest stride end up as long as possible and are the inner loops of the iteration.

For Zip, I've implemented the approx. abs. stride of each axis as the sum of the corresponding approx. abs. strides of the two producers being zipped:

impl<A, B> NdReshape for Zip<A, B>
where
    A: NdReshape,
    B: NdReshape<Dim = A::Dim>,
{
    // ...
    fn approx_abs_strides(&self) -> Self::Dim {
        let a = self.a.approx_abs_strides();
        let b = self.b.approx_abs_strides();
        a + b
    }
    // ...
}

For FoldAxes (an adapter that folds over a subset of the axes), I've implemented the approx. abs. stride of each axis as the sum of the corresponding stride of the "init" producer and the strides of the axes being folded into the axis:

impl<P, Df, I, F> NdReshape for FoldAxesProducer<P, Df, I, F>
where
    P: NdReshape,
    Df: Dimension,
    P::Dim: SubDim<Df, Out = I::Dim>,
    I: NdReshape,
{
    // ...
    fn approx_abs_strides(&self) -> Self::Dim {
        let mut strides = self.init.approx_abs_strides();
        let inner_strides = self.inner.approx_abs_strides();
        for (ax, s) in strides.slice_mut().iter_mut().enumerate() {
            *s += inner_strides[self.outer_to_inner[ax]];
        }
        strides
    }
    // ...
}

This approach matches or outperforms the current version of Zip in most cases I've tried except for small arrays where the gain in iteration performance doesn't outweigh the cost of optimizing (sorting/inverting/merging) the axes. I think using a simpler (less expensive) optimizer for smaller arrays would help, but I haven't gotten around to trying that yet.

@bluss
Copy link
Member

bluss commented Nov 22, 2018

That sounds really exciting, @jturner314.

I was going to say there is always one unique shortest stride axis isn't there, if we count axes of length > 1, but that's just for one producer. How do you select axes to merge with many producers?

@jturner314
Copy link
Member Author

There are a couple of questions when dealing with multiple producers. (1) What is the best order to iterate over the axes? (2) How do we decide what axes to merge/invert?

Answering the first question is not easy because there is potentially a tradeoff between trying to pick short strides for the inner loop(s) (to use the processor's cache effectively) and trying to make the inner loop(s) as long as possible (to help with branch prediction and because determining the next index / pointer location after reaching the end of the axis is relatively expensive compared to just incrementing a pointer). For example, consider a 100×2 row-major array without merging the axes. Performing the inner loop over axis 0 has a longer stride (worse caching) but makes the inner loop longer (better branch prediction and cheaper on average to calculate the next pointer), while performing the inner loop over axis 1 has a shorter stride (better caching) but makes the inner loop very short (worse branch prediction and more expensive to calculate the next pointer on average).

Initially, I tried to develop a cost model, but that was too complicated, so I just settled on the heuristic of sorting the axes by descending absolute stride (the last axis is the inner loop), where adapters that combine producers (like Zip) simply add the absolute strides of their inner producers. I think this is a reasonable strategy because it's pretty fast to do and seems to work well in most cases. I'd like to do some benchmarking to see what axis order is actually the best in different cases to maybe improve the heuristic, but I haven't gotten around to it yet.

The second question is simpler to answer. The goal is to merge as many axes a possible. It might seem like it's necessary to try merging every axis into every other axis, but it turns out that's not necessary. Let's ignore axes of length 0 or 1. Then, axes can only be merged if the absolute value of the take stride is larger than the absolute value of the into stride, t > i. If we have multiple producers, that's also true for the sum of the strides: (t₁ > i₁ and t₂ > i₂) implies (t₁ + t₂ > i₁ + i₂). So, by making Zip sum the absolute strides of the inner producers, we only need to try merging axes in order of decreasing stride.

Note, however, that the axes that can be merged are not necessarily consecutive in the sorted order. For example, consider shape = [10, 2, 2], strides of first producer = [1, 20, 10], strides of second producer = [15, 2, 1]. The sum of strides is [16, 22, 11]. Sorted by descending stride, this gives axes 1, 0, 2. Axis 1 can be merged into axis 2, but that wouldn't be obvious if we just looked at consecutive axes in the sorted order (1 into 0 or 0 into 2). So, we need to try merging each axis wth all axes with larger absolute stride. Fortunately, we usually don't need to try all these cases because once a take axis is merged and its new length is 1, it can be ignored.

The algorithm is O(n2) where n is the number of axes, but n is usually pretty small. The implementation needs some cleanup (and I think I can replace the custom roll function with rotate_right from std) but should express the general idea:

/// Executes `optimize_any_ord_axes` for all axes and returns the suggested
/// iteration order.
pub fn optimize_any_ord<T>(producer: &mut T) -> T::Dim
where
    T: NdReshape + ?Sized,
{
    let mut axes = T::Dim::zeros(producer.ndim());
    for (i, ax) in axes.slice_mut().iter_mut().enumerate() {
        *ax = i;
    }
    unsafe { optimize_any_ord_axes_unchecked(producer, &mut axes) };
    axes
}

/// Optimizes the producer, possibly changing the order, and adjusts `axes`
/// into good iteration order (assuming the last index moves the fastest).
///
/// This function may change the shape of the producer and the order of
/// iteration. Optimization is performed only on the given `axes`; all other
/// axes are left unchanged.
///
/// When choosing axes to attempt merging, it only tries merging axes when the
/// absolute stride of the `take` axes is >= the absolute stride of the `into`
/// axis.
///
/// The suggested iteration order is in order of descending absolute stride
/// (except for axes of length <= 1, which are positioned as outer axes). This
/// isn't necessarily the optimal iteration order, but it should be a
/// reasonable heuristic in most cases.
///
/// **Panics** if any of the axes in `axes` are out of bounds or if an axis is
/// repeated more than once.
pub fn optimize_any_ord_axes<T, D>(producer: &mut T, axes: &mut D)
where
    T: NdReshape + ?Sized,
    D: Dimension,
{
    assert_valid_unique_axes::<T::Dim>(producer.ndim(), axes.slice());
    unsafe { optimize_any_ord_axes_unchecked(producer, axes) }
}

/// `unsafe` because `axes` are not checked to ensure that they're in-bounds
/// and not repeated.
unsafe fn optimize_any_ord_axes_unchecked<T, D>(producer: &mut T, axes: &mut D)
where
    T: NdReshape + ?Sized,
    D: Dimension,
{
    if axes.ndim() == 0 {
        return;
    }

    // TODO: Should there be a minimum producer size for the more advanced (and
    // costly) optimizations?

    // Determine initial order of axes. Sort axes by descending absolute stride
    // (except for axes with length <= 1, which are moved to the left).
    {
        let shape = producer.shape();
        let abs_strides = producer.approx_abs_strides();
        axes.slice_mut().sort_unstable_by(|&a, &b| {
            if shape[a] <= 1 || shape[b] <= 1 {
                shape[a].cmp(&shape[b])
            } else {
                abs_strides[b].cmp(&abs_strides[a])
            }
        });
    }

    // Merge as many axes with lengths > 1 as possible and move `take` axes
    // (which now have length <= 1) to the left.
    if let Some(mut rest) = axes
        .slice()
        .iter()
        .enumerate()
        .find(|(_, &ax)| producer.len_of(Axis(ax)) > 1)
        .map(|(i, _)| i)
    {
        let mut i = axes.ndim() - 1;
        while i > rest {
            let mut t_inc = i;
            while t_inc > rest {
                //println!("i={}, t={}, axes={:?}, shape={:?}", i, t_inc - 1, axes, producer.shape());
                let take = Axis(axes[t_inc - 1]);
                let into = Axis(axes[i]);
                match producer.can_merge_axes(take, into) {
                    CanMerge::IfUnchanged | CanMerge::IfEither => {
                        producer.merge_axes(take, into);
                        // TODO: Would it be better to delay reordering until the end, and then partition/sort?
                        roll(&mut axes.slice_mut()[rest..t_inc], 1);
                        rest += 1;
                    }
                    CanMerge::IfInverted => {
                        producer.invert_axis(take);
                        producer.merge_axes(take, into);
                        roll(&mut axes.slice_mut()[rest..t_inc], 1);
                        rest += 1;
                    }
                    CanMerge::Never => {
                        t_inc -= 1;
                    }
                }
            }
            i -= 1;
        }
    }
}

The producer may be a producer such as Zip that combines multiple producers. Note that a .can_merge_axes() method helps with things like Zip because it can check if the axes can be merged in both inner producers before actually merging them. (Otherwise, if Zip just called .merge_axes() on both inner producers without knowing if both could be merged or not, only one may get merged.) It can also identify cases where inverting an axis will allow the axes to be merged.

@bluss
Copy link
Member

bluss commented Nov 23, 2018

With Zip, this optimization would run only if the result is not that all producers are of the same c/f layout right?

With things like roll Vs std rotate, I think of code size. It's not certain that a big blob of simdified rotating code is what we need.. dimensions are short and we will never prioritize the cases where ndim is very large.

See equality testing of IxDyn, it's written to have smaller code size and avoid slice's partialeq implementation.

@jturner314
Copy link
Member Author

With Zip, this optimization would run only if the result is not that all producers are of the same c/f layout right?

Sure, that would work. Alternatively, I think it's possible to improve the implementation of optimize_any_ord_axes_unchecked so that it operates in linear time for c/f layout arrays and hopefully wouldn't cost much more than checking if the arrays have c/f layout. (sort_unstable_by operates in linear time if the list is already sorted or in reverse order, and if axes are merged left-to-right instead of right-to-left, no rolling is necessary in the c/f case.)

With things like roll Vs std rotate, I think of code size. It's not certain that a big blob of simdified rotating code is what we need.. dimensions are short and we will never prioritize the cases where ndim is very large.

I'm not sure what you mean. I don't see any SIMD instructions in std's rotate methods. My current roll implementation is this:

/// Rolls the slice by the given shift.
///
/// Rolling is like a shift, except that elements shifted off the end are moved
/// to the other end. Rolling is performed in the direction of `shift`
/// (positive for right, negative for left).
fn roll<T>(slice: &mut [T], mut shift: isize) {
    let len = slice.len();
    if len == 0 {
        return;
    }

    // Minimize the absolute shift.
    shift = shift % len as isize;
    if shift > len as isize / 2 {
        shift -= len as isize;
    } else if shift < -(len as isize) / 2 {
        shift += len as isize;
    }

    // Perform the roll.
    if shift >= 0 {
        for _ in 0..shift {
            for i in 0..(len - 1) {
                slice.swap(i, len - 1);
            }
        }
    } else {
        for _ in 0..(-shift) {
            for i in (1..len).rev() {
                slice.swap(i, 0);
            }
        }
    }
}

This is a fairly large function, but if it's not inlined, that should be fine, right?

It's worth noting that we could specialize the optimize_/roll implementations for specific ndim if desired. (For example, 1-D and 2-D producers would be much simpler.)

See equality testing of IxDyn, it's written to have smaller code size and avoid slice's partialeq implementation.

Are you referring to this? Why is it implemented like that instead of using deref and equality on slices?

impl<T: PartialEq> PartialEq for IxDynRepr<T> {
    fn eq(&self, rhs: &Self) -> bool {
        self.deref() == rhs.deref()
    }
}

@bluss
Copy link
Member

bluss commented Nov 27, 2018

About rotate, that's what I know of its implementation, but I'll check later.

About dim eq That's what i wanted to call attention to: It's implemented that way to avoid slice's partial eq for code size and memcmp overhead. Fwiw, the difference in performance was tested at the time.

@jturner314
Copy link
Member Author

Oh, I understand now! You're talking about the amount of code for a given ndim, especially for arrays with small ndim. (Specializing on dimension types or IxDynRepr::Inline helps with this.) I was confused because I thought you were talking about the total amount of code in the final binary. (In this case, specializing on ndim/IxDynRepr::Inline should be avoided.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants