-
-
Notifications
You must be signed in to change notification settings - Fork 614
replace Base.tanh with faster tanh #1272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
SLEEFPirates doesn't seem to seem to do much on the machine I tested with, beyond the performance boost with
Running with GPU Arrays, however would need special handling
|
Yeah the dispatch with |
Could LV make the GPU passes no-ops? I would rather not lose the generic nature of code, esp since its diminishing returns for the majority of models where batchsize is usually not that large. |
You could just do |
If LV did that (even if as a wrapper macro), it could be used by downstream packages without adding special handling at least for different accelerators. I tested it with using I would also be interested in seeing how it would interact with more general use cases like Just listing things that we should have clarity on in order to understand what the tradeoffs are. |
I think it would be much harder for LV to do that since it's a macro, so it is done at parse time and not compile time and doesn't have type information. It could add type checks for everything that shows up in the expression or something, but that seems difficult to get right. |
@DhairyaLGandhi It's this already implemented in FluxML/NNlib.jl#199? |
The generated function of course has type information, but it also (currently) loses the original representation of the loops, and instead is only given a summary of what the loops did, meaning it wouldn't necessarily currently be able to reconstruct the original loops. Broadcasting doesn't use that right now, but I've been saying for a while that I was planning to switch it for the same approach. Currently:
It basically
With Unfortunately, I believe because AbstractDeviceArray{<:Base.HWReal} <: DenseArray{<:Base.HWReal}, that All the checks within the library assume |
Yes, we should really have a trait for whether things opt in or not, and start spreading that around the ecosystem to make this better supported. Guessing can get you pretty far, but at some point libraries need a way to tell you this information. |
(Cribbing my comment from the discourse thread) @AStupidBear #199 helps substantially for standalone Case in point:
Is there a generalizable way to hook into this kind of fused broadcast as well? |
If you do a 5-arg mul, then that is pulled into the matmul kernel and doesn't need to fuse with the other one. |
@chriselrod do I understand this correctly; overloading |
@DhairyaLGandhi I'll update broadcasting to use For example, I'd like LoopVectorization to support
But there'd have to be a place to define this. A library like EDIT:
|
A couple points here, (1) is that SLEEFPirates by itself relies on the autovectorizer for SIMD. This does not always work. For example:
It works with the simple loop, but inspecting the native/llvm code from the broadcast shows we didn't get SIMD. Starting with Julia >= 1.5, the
costs.jl defines costs, as well as an In the future, once it integrates more with the compiler so we can introspect functions to ensure they're defined for Recently, Tim Holy suggested something like this for defining costs of unknown functions
Not perfect -- both If the cost it uses doesn't represent the actual cost well, it may make bad unrolling decisions. The current assumption is that unknowns are expensive, so that unrolling wont be profitable. I figure it is better to play things safe. For a function
About 14.5x faster, roughly what we'd expect from AVX512 + single precision. You can test and benchmark how functions via:
Calculating 16 To be clear, in this broadcast:
It'll use |
The design goal of this piece to me is as follows:
I'm not sure exactly how it would interact with other ADs in the ecosystem just yet, but it is a fairly common trick to replace Zygote with something else to test for performance/ correctness etc. I would appreciate some feedback on this if these represent roughly the right expectations since it directly impacts the API for user defined functions.
Here we usually deal with |
I think the basic goal is to accelerate known functions, and supporting LV more generally across the framework, while not dropping support for existing language primitives like control flow. |
This is a speculative change based on some recent discussions to help speedup common tasks in Flux, and the tanh from SLEEFPirates was found to make a significant difference. This is to discuss the viability of doing this by default in Flux and any different kinds of considerations we would have while doing so.
cc @ChrisRackauckas
PR Checklist
@dhairyagandhi96
(for API changes).