Skip to content

Conversation

@Jutho
Copy link
Contributor

@Jutho Jutho commented Feb 26, 2015

Use stagedfunction definition of sub2ind and ind2sub to generate specialized code for every N in dims::NTuple{N,Integer}. See #10331 for related discussion and performance check. I can do more performance benchmarking if somebody can point me to a good test suite that depends heavily on the use of those 2 functions.

@ViralBShah
Copy link
Member

I wonder if there are better names for these functions, that can be significantly better than the matlab names.

@ViralBShah
Copy link
Member

BTW, what are the performance gains on some examples? I do not have a code handy that uses these a lot.

@ViralBShah
Copy link
Member

Sparse indexing and various find* routines on sparse matrices use these functions. Perhaps this helps some sparse perf benchmarks?

@Jutho
Copy link
Contributor Author

Jutho commented Feb 26, 2015

I also don't have a set of benchmark examples and the problem is that they get optimized away in a simple for loop that just tries to call this functions a number of times with fixed argument. Ah but maybe I can avoid that behavior by letting the arguments of the function call depend on the iteration variable. I'll report back soon.

@Jutho
Copy link
Contributor Author

Jutho commented Feb 26, 2015

With the following test code (only for the hard case: ind2sub)

function test(N)
    d1=N
    @time for i=1:N;s1=ind2sub((d1,),i);end
    d2=ceil(Int,N^(1/2))
    @time for i=1:N;s2=ind2sub((d2,d2),i);end
    d3=ceil(Int,N^(1/3))
    @time for i=1:N;s3=ind2sub((d3,d3,d3),i);end
    d4=ceil(Int,N^(1/4))
    @time for i=1:N;s4=ind2sub((d4,d4,d4,d4),i);end
    d5=ceil(Int,N^(1/5))
    @time for i=1:N;s5=ind2sub((d5,d5,d5,d5,d5),i);end
end

I obtain on master:

elapsed time: 3.38e-7 seconds (0 bytes allocated)
elapsed time: 0.001355097 seconds (0 bytes allocated)
elapsed time: 0.02040562 seconds (0 bytes allocated)
elapsed time: 0.290835901 seconds (228 MB allocated, 4.73% gc time in 10 pauses with 0 full sweep)
elapsed time: 0.382516052 seconds (305 MB allocated, 4.53% gc time in 14 pauses with 0 full sweep)

and with this PR:

elapsed time: 4.26e-7 seconds (0 bytes allocated)
elapsed time: 0.001568929 seconds (0 bytes allocated)
elapsed time: 0.043808195 seconds (0 bytes allocated)
elapsed time: 0.059630003 seconds (0 bytes allocated)
elapsed time: 0.082583044 seconds (0 bytes allocated)

Clearly, the 1D case is just optimized away (the time does not depend on N) both on master and with this PR. In this PR, the code for the 1d case is just ind-=1;return (ind+1,). I guess the -1 +1 will always be optimized away, or will never have any significant effect`.

The 2d case is roughly the same or slightly slower. The 3d case seems a factor 2 slower with this PR, this I will need to fix, either by a specialized version or a more clever formulation of the stagedfunction. I could also add a specialized version for the 1d case to get rid of the -1+1 and anyway noticed that I need to fix two ambiguity warnings in the build process.

For higher dimensions, this PR is significantly faster.

@Jutho
Copy link
Contributor Author

Jutho commented Feb 26, 2015

Not sure what was wrong before, but the new timings with the latest commit are:

elapsed time: 2.81e-7 seconds (0 bytes allocated)
elapsed time: 0.001192448 seconds (0 bytes allocated)
elapsed time: 0.010499385 seconds (0 bytes allocated)
elapsed time: 0.019768971 seconds (0 bytes allocated)
elapsed time: 0.031037068 seconds (0 bytes allocated)

which beat the timings on master for every case, and even more significantly as before in the higher-dimensional case.

@Jutho
Copy link
Contributor Author

Jutho commented Feb 26, 2015

grepping ind2sub in julia/base/*.jl only returns function definitions, so it seems like this is not used anywhere in find. There are indeed some occurrences in julia/base/sparse/sparsematrix.jl but these are all the two-dimensional case, so I don't think there will be any performance effect.

@Jutho
Copy link
Contributor Author

Jutho commented Feb 26, 2015

The improved timings seem to come from the fact that I now explicitly forced ind2sub to be inlined. I forgot that I added that line, but it seems to have a significant effect (for this particular test function). Not sure if it is always a good idea to inline this, although the generated code is pretty short ( 2N+1 lines for N dimensions).

@JeffBezanson
Copy link
Member

We really have to avoid using staged functions for too many things. They make it basically impossible to statically compile programs. Staged functions become totally useless with the JIT off. For simpler functions like this, I think we should try to do compiler improvements and rewrite them in other ways.

@kmsquire
Copy link
Member

For the purpose of statically compiling, wouldn't it be possible to keep track of what functions were generated by staged functions, e.g., during a particular run of code, and allowing the user to precompile those functions? Obviously, this wouldn't work if the final code took a different code path (although even here, an appropriate error could be issued), but a 90% solution could still be quite useful.

@JeffBezanson
Copy link
Member

Obviously, this wouldn't work if the final code took a different code path

Exactly :)

@Jutho
Copy link
Contributor Author

Jutho commented Feb 26, 2015

We really have to avoid using staged functions for too many things. They make it basically impossible to statically compile programs. Staged functions become totally useless with the JIT off. For simpler functions like this, I think we should try to do compiler improvements and rewrite them in other ways.

I didn't know this. I am happy with closing this PR, or trying to see whether I can improve the ind2sub/sub2ind code without stagedfunctions, but I don't really know what the general conclusion is. What should be the general rule when one can use stagedfunction? Would it help if e.g. the most used cases (e.g. 1 and 2 dimensions, possibly up to 3 or 4) are regular method definitions and then the higher ones are a stagedfunction or is this also not ok.

More generally, to avoid having to write stagedfunction definitions in the case where I use them often, it would help if there would be the following functionality:

  • an ndims(a::AbstractArray) method that returns a result where the dimensionality is encoded as a type parameter, e.g. something like Val{N} but then maybe a dedicated type (e.g. Dim{N}?) that supports some basic arithmetic (min, max, +, -, ...). Of course, the only way I would know how to implement this currently would be with stagedfunction.
  • a function to create a tuple where the length and the type of values can be inferred, e.g. something like ntuple(::Dim{N}(), sometypedfunction).

Is this worth discussing in a separate issue? I guess @timholy will have some more interesting request to add to the list.

@JeffBezanson
Copy link
Member

Yes, it would be great to develop more tricks for writing code like this without the full sledgehammer of staged functions, e.g. #5560.

Generally what works now is "unary" encodings of numbers in tuple lengths, e.g. https://gist.github.com/JeffBezanson/10612461.

@Jutho
Copy link
Contributor Author

Jutho commented Feb 26, 2015

Thanks. I think I once saw that Gist once before, but had forgotten about it. This is neat.

I'd still like some specific advice on how to proceed with this PR. I still find it strange that these functions (ind2sub/sub2ind), which have hardly any use within Base, should not be stagedfunctions, whereas several of the more basic functions for multidimensional arrays that have recently been rewritten all use or depend on stagedfunction.

@tknopp
Copy link
Contributor

tknopp commented Feb 26, 2015

I use ind2sub in https://github.com/tknopp/NFFT.jl for implementing a slow reference implementation for the NDFT. There is also a test file.
When I wrote this, there was no 1D fallback for ind2sub. Not sure if this is included in this PR.

@Jutho
Copy link
Contributor Author

Jutho commented Feb 26, 2015

Ok, using tail on the dims argument I was able to write non-staged versions of these functions which improve on the existing definitions. With the test function above (but also including an N=6 case), master yields

test(1000000)
elapsed time: 3.65e-7 seconds (0 bytes allocated)
elapsed time: 0.001170862 seconds (0 bytes allocated)
elapsed time: 0.019838434 seconds (0 bytes allocated)
elapsed time: 0.294538723 seconds (228 MB allocated, 4.08% gc time in 10 pauses with 0 full sweep)
elapsed time: 0.372374532 seconds (305 MB allocated, 3.53% gc time in 14 pauses with 0 full sweep)
elapsed time: 0.460680693 seconds (389 MB allocated, 5.09% gc time in 18 pauses with 0 full sweep)

and this latest PR yields:

julia> test(1000000)
elapsed time: 2.8e-7 seconds (0 bytes allocated)
elapsed time: 0.001197907 seconds (0 bytes allocated)
elapsed time: 0.010452432 seconds (0 bytes allocated)
elapsed time: 0.060237796 seconds (0 bytes allocated)
elapsed time: 0.157890394 seconds (122 MB allocated, 2.92% gc time in 6 pauses with 0 full sweep)
elapsed time: 0.24761047 seconds (205 MB allocated, 5.06% gc time in 9 pauses with 0 full sweep)

So this PR is clearly faster, but has allocation overhead as soon as N>4, which will hopefully go away with better compilation. In particular, there is already a significant difference, since unlike with the definitions in master, the return type is correctly inferred also for N>= 4, and so this allocation overhead is not due to the type not being known.

@Jutho Jutho changed the title staged ind2sub/sub2ind improved ind2sub/sub2ind Feb 26, 2015
@timholy
Copy link
Member

timholy commented Feb 27, 2015

@JeffBezanson, to clarify: a statically-compiled program would still have type inference->native code generation, but not parsing/lowering? Otherwise I don't understand the distinction you're making, and would like to know more.

@timholy
Copy link
Member

timholy commented Feb 27, 2015

(The point being that

Obviously, this wouldn't work if the final code took a different code path

is relevant even for generic functions that have not yet been JITed.)

@Jutho
Copy link
Contributor Author

Jutho commented Feb 27, 2015

is relevant even for generic functions that have not yet been JITed.)

I was wondering about the same thing but didn't dare to ask as I really don't know how these things work deep down.

@jakebolewski
Copy link
Member

You would have to fallback on running code through an interpreter.

@JeffBezanson
Copy link
Member

That is correct. Then one problem is that we don't have an interpreter for the full language; e.g. llvm intrinsics are not supported. The other problem is that it would be extremely slow, especially since staged functions are typically used for performance.

@timholy
Copy link
Member

timholy commented Mar 4, 2015

Sorry to be dense, but let me ask for a direct answer to my real question: is JITing still available in a statically-compiled program? If not, then I still don't see any difference between staged functions and generic functions.

@timholy
Copy link
Member

timholy commented Mar 4, 2015

Oh, I think I now understand which part of my question @jakebolewski was responding to, and so my understanding is:

  • JIT is not supported in a statically-compiled program
  • One could run code with an interpreter
  • Currently interpreter support is limited

If this is right, then the concern about staged functions still seems to be a red herring: if you cache the code returned by the stagedfunction, by design it's going to be faster to run via the interpreter than whatever code the stagedfunction would have replaced---the whole point of a stagedfunction is to generate optimized julia code.

@StefanKarpinski
Copy link
Member

I believe that part of the confusion stems from there being two versions of "statically compiled": with and without JIT support. With JIT support, staged functions are no problem. Without JIT support, they are a problem. The difference between staged functions and macros in this respect is that you can know that there are no more macro expansions required, whereas it doesn't seem possible to know that you won't encounter more code generated by a staged function. For normal generic functions, you can always generate a slow, generic fallback, but for staged functions, you can't even do that because you don't yet have the code that you need to generate the slow generic fallback for.

@mbauman
Copy link
Member

mbauman commented Mar 5, 2015

For normal generic functions, you can always generate a slow, generic fallback, but for staged functions, you can't even do that because you don't yet have the code that you need to generate the slow generic fallback for.

Doesn't it just require two slow interpreted passes? One to generate the staged AST and a second to interpret it? There could be an extra optimization to cache AST instead of the JIT-ed function, of course, but I don't see why it's that different.

@Jutho
Copy link
Contributor Author

Jutho commented Mar 5, 2015

Otherwise, might it be an option to requie a non-staged fallback for every stagedfunction, which in the worst case just trows a MethodError but preferably contains a less efficient but still working alternative.

@JeffBezanson
Copy link
Member

Yes, technically an interpreter can be used (as already discussed above).

However there is another reason to discourage staged functions, which is that their type inference is basically all-or-nothing, to preserve type safety. This is a tough problem. We could, for example, store all results and check monotonicity directly, giving an error if we catch a staged function "lying". But this is pretty complicated, and also pretty hard on staged function authors. It also leaves you open to unanticipated errors far in the future.

@Jutho
Copy link
Contributor Author

Jutho commented Mar 6, 2015

For a non-specialist, what do you mean with the monotonicity argument?

And a completely different question: Even though I am enjoying the general discussion and learning a lot, I'd also like to know the specific opinion on the current PR? I believe even the non-staged version in the latest commit still provides an improvement over the original definitions, though the simple benchmark results are of course less convincing then those of the stagedfunction approach used initially. The major improvement over the original code is that the recursive definition allows to infer the correct return type of ind2sub for higher N.

@Jutho
Copy link
Contributor Author

Jutho commented Apr 29, 2015

@ehaber99 , the topic of this PR was not to improve the vectorized version of ind2sub and its counterpart, but just the versions for a single value of the arguments. I agree that also the vectorized version needs to be improved, but it was decided a few comments above that that would be the topic of a second PR. Currently, the vectorized form just calls the single value form in combination with some map construct.

The single value methods are certainly already performing much better with the definitions in the current commit. Essential to this is that the recursive definitions of ind2sub and sub2ind are getting inlined; the latter has an explicit @_inline_meta() call, the former did not. Remaining issues:

  • With the current commit, the definition of ind2sub does not seem to get inlined when called from within a @generated function. As discussed in my previous post, this seems to be fixed by also adding an explicit @_inline_meta() construct in that definition. Maybe inline_worthy is not called when compiling the expression body of a @generated function, but that's just a wild guess? Edit: I am no longer able to reproduce this. Not sure what was happening before.
  • The current definitions of the single argument ind2sub and sub2ind seem to combine less well with the map construction currently used in the definition of the vectorized methods. I have no idea why this is the case. So maybe we need to replace the definition of the vectorized functions straight away within this PR after all. I see two possibilities:
    1. The vectorized version still calls the single value methods ind2sub and sub2ind, but to get rid of any type uncertainty the vectorized method itself uses a @generated definition.
    2. Along the lines of the proposal of @ehaber99 , which might even be more cache friendly or susceptible to a simd speedup

@Jutho
Copy link
Contributor Author

Jutho commented Apr 29, 2015

I finally understand what @mbauman and probably @timholy were talking about. It's not the inlining of ind2sub that goes wrong, it's just that the @generated functions still have a vararg argument which is causing all the overhead, even though dedicated code is generated for every given argument length.

@timholy
Copy link
Member

timholy commented Apr 29, 2015

But don't be confused about what you see in a "naked" call from the REPL and how it will work out in practice when compiled into functions. If you copy the example in http://docs.julialang.org/en/latest/manual/metaprogramming/#an-advanced-example, then

g(dims, i1, i2, i3, i4, i5, i6) = sub2ind_gen(dims, i1, i2, i3, i4, i5, i6)
g (generic function with 1 method)

julia> @code_llvm g((2,2,2,2,2,2), 2, 2, 2, 2, 2, 2)

define i64 @julia_g_44425([6 x i64], i64, i64, i64, i64, i64, i64) {
top:
  %7 = extractvalue [6 x i64] %0, 0
  %8 = add i64 %2, -1
  %9 = extractvalue [6 x i64] %0, 1
  %10 = add i64 %3, -1
  %11 = extractvalue [6 x i64] %0, 2
  %12 = add i64 %4, -1
  %13 = extractvalue [6 x i64] %0, 3
  %14 = add i64 %5, -1
  %15 = extractvalue [6 x i64] %0, 4
  %16 = add i64 %6, -1
  %17 = mul i64 %16, %15
  %18 = add i64 %14, %17
  %19 = mul i64 %18, %13
  %20 = add i64 %12, %19
  %21 = mul i64 %20, %11
  %22 = add i64 %10, %21
  %23 = mul i64 %22, %9
  %24 = add i64 %8, %23
  %25 = mul i64 %24, %7
  %26 = add i64 %25, %1
  ret i64 %26
}

which is just gorgeous. In contrast, even if you put @inline in front of all the definitions in this PR, you don't get anything nearly so nice. This is why I think that, despite Jeff's reservations, we basically need to go with the generated function version until this gets fixed.

@Jutho
Copy link
Contributor Author

Jutho commented Apr 29, 2015

Yes that's exactly the point also for the post of @mbauman above. The reason that his normal function gave short LLVM code while the @generated ones did not was just because the short one had no more varargs whereas all of the different @generated versions were still using varargs. Further using those in another function which specifies the number of arguments also yields short code; that's what got me confused above. But most importantly, this is true with the current definitions of ind2sub and sub2ind, so these can remain as they are.

To make the vectorized definitions of these functions fast, I needed to implement new code along the lines suggested by @ehaber99 . I tweaked it some more and now obtain for this example, with the commit that I will push in a minute:

Julia sub2ind 2D elapsed time: 0.10945286 seconds
My sub2ind 2D    elapsed time: 0.12169844 seconds
Julia ind2sub 2D elapsed time: 0.355919331 seconds
My ind2sub 2D    elapsed time: 0.423536101 seconds
Julia sub2ind 3D elapsed time: 0.151631702 seconds
My sub2ind 3D    elapsed time: 0.165676097 seconds
Julia ind2sub 3D elapsed time: 0.704246112 seconds
My ind2sub 3D    elapsed time: 0.696756499 seconds
Julia sub2ind 5D elapsed time: 0.015101847 seconds
My sub2ind 5D    elapsed time: 0.031233035 seconds
Julia ind2sub 5D elapsed time: 0.118229111 seconds
My ind2sub 5D    elapsed time: 0.142487001 seconds

which looks pretty good to me.

@Jutho Jutho force-pushed the jh/stagedsub2ind branch from 7b1f14d to a22532d Compare April 30, 2015 00:07
@timholy
Copy link
Member

timholy commented Apr 30, 2015

I wonder if I'm misunderstanding you, but my impression is that we're saying opposite things. My point is that the best combination is to implement sub2ind as a generated function, and then let it be called by anything (including an ordinary function). You seem to be arguing the reverse. In my example above, sub2ind_gen is a generated function, and it gives great results when called from g. Unfortunately, the version in this PR gives much worse results.

@timholy
Copy link
Member

timholy commented Apr 30, 2015

Am I misunderstanding your timings? In every case the ones labeled "My" are slower than the ones labeled "Julia."

@Jutho
Copy link
Contributor Author

Jutho commented Apr 30, 2015

Am I misunderstanding your timings? In every case the ones labeled "My" are slower than the ones labeled "Julia."

Well, I just run the test file of @ehaber99 against my latest commit, so the "my" timings refer to the code of @ehaber99, the Julia timings are the ones corresponding to my new implementation in the latest commit. But apparently there is still an error in CI; I got to tired yesterday evening.

I wonder if I'm misunderstanding you, but my impression is that we're saying opposite things. My point is that the best combination is to implement sub2ind as a generated function, and then let it be called by anything (including an ordinary function). You seem to be arguing the reverse. In my example above, sub2ind_gen is a generated function, and it gives great results when called from g. Unfortunately, the version in this PR gives much worse results.

Why do you think that. The code generated by the current definitions of sub2ind and ind2sub is also as compact and fast as I think are possible in the examples I checked, but maybe I am overseeing something?
Now I see, it does not, it only works up to N=4. Well then I will restore the @generated definitions. I really was under the impression that with the new tuple redesign it worked well also for higher N.

@timholy
Copy link
Member

timholy commented Apr 30, 2015

If there's something small we can change to fix the current ones, we should do that. Maybe it would be worth inserting a breakpoint here to see what's happening.

@timholy
Copy link
Member

timholy commented Apr 30, 2015

Well, I just run the test file of @ehaber99 against my latest comment, so the "my" timings refer to the code of @ehaber99, the Julia timings are the ones corresponding to my new implementation in the latest commit.

That makes much more sense, thanks for explaining!

@Jutho Jutho force-pushed the jh/stagedsub2ind branch from a22532d to d0212fc Compare April 30, 2015 10:02
@Jutho
Copy link
Contributor Author

Jutho commented Apr 30, 2015

Ok, so that's the final, test-succeeding commit which does not use @generated functions. I can now replace them with the @generated definitions, but it might be useful to keep that as two separate commits for further reference.

@Jutho
Copy link
Contributor Author

Jutho commented May 1, 2015

I think this is more or less ready. I guess it could use a final review; it ain't much, just a few lines of code.

The only point I am not fully satisfied with is about the special casing of ind2sub for N=0. The general policy we decided on was not to throw BoundsErrors from ind2sub or sub2ind but just produce a consistent result (which would also be out of bounds and thus trigger an error when used further down to actually index an array). This is however impossible with ind2sub for a 0-dimensional array, where the only result can be the empty tuple (). I do not see a way around this.

@Jutho
Copy link
Contributor Author

Jutho commented May 1, 2015

One more fix after all; with this I guess the output of all ind2sub and sub2ind methods can be completely inferred.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it help to Expr(:meta, :inline) this to avoid the allocation of the tuple? Or does this always get inlined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably does not get inlined. Would it be necessary for sub2ind to be inlined? Before this was necessary because of the recursive definition. Probably it is still necessary if this is relied on in indexing? I will restore the exlicit inline behavior.

@timholy
Copy link
Member

timholy commented May 2, 2015

LGTM.

I don't know if it's a coincidence, but that travis segfault happened in arrayops, which is where ind2sub is tested. It's therefore possible that this is revealing a deeper problem, but since this also passes the tests frequently I wouldn't let that possibility hold this up.

It's interesting that we don't seem to test sub2ind. #10525 will add a lot of indirect tests for it, however.

@Jutho
Copy link
Contributor Author

Jutho commented May 3, 2015

It seems to be a 32 bit issue (which is why I didn't caught it), but it is likely caused by this final fix. I will rebase and change the git history such that the final fix is part of the first commit, and add the inline statement discussed in the line comments. This will cause a new CI run; let's see what happens.

@Jutho Jutho force-pushed the jh/stagedsub2ind branch from 9dc6142 to 8c718f1 Compare May 3, 2015 13:35
@Jutho
Copy link
Contributor Author

Jutho commented May 3, 2015

All looks good now.

timholy added a commit that referenced this pull request May 3, 2015
@timholy timholy merged commit c874559 into JuliaLang:master May 3, 2015
@timholy
Copy link
Member

timholy commented May 3, 2015

Thanks for doing this!

timholy added a commit that referenced this pull request Apr 28, 2016
Finishes #10337, and is the complement of #9256
@Jutho Jutho deleted the jh/stagedsub2ind branch August 23, 2018 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.