-
Notifications
You must be signed in to change notification settings - Fork 557
Add vectorization in elementwise_util #9432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/swolchok/439/head
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9432
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit c0ad8ac with merge base 45008d6 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this makes sense but I left some questions around why we need can_use_vectorized
based approach
template <typename T> \ | ||
auto func_name(at::vec::Vectorized<T> vec) { \ | ||
if constexpr (!::executorch::runtime::is_floating_point<T>::value) { \ | ||
return at::vec::convert<float>(vec).func_name(); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a valid thing to do? that is convert say an instance of at::vec::Vectorized<int8_t>
to float and apply the func_name
? Maybe i am misunderstanding how this works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, if i'm reading https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec_convert.h correctly, this is not quite correct. good catch! (it will silently drop the int8_t instances that don't fit in the float vector; the right thing to do is convert to VectorizedN, do the computation, and then convert back.) better make sure we have tests here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be fixed in the latest update.
*/ | ||
#define ET_INTERNAL_VECTORIZED_FLOAT_UNARY_FUNC(func_name) \ | ||
namespace executorch { \ | ||
inline namespace math { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does inline to namespace do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in short, it causes the namespace to not actually exist in a lot of circumstances. see https://en.cppreference.com/w/cpp/language/namespace
* corresponding operator is a "float op" in TensorIterator parlance | ||
* (i.e., uses something like build_borrowing_binary_float_op()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know if anyone reading this code would understand what this means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this provides answer to my earlier question
kernels/portable/cpu/op_where.cpp
Outdated
@@ -47,7 +47,7 @@ Tensor& where_out( | |||
CTYPE_COMPUTE, | |||
op_name, | |||
utils::SupportedTensorDtypes::SAME_AS_COMMON>( | |||
[](const auto val_a, const auto val_b, const auto val_c) { | |||
[](const CTYPE_COMPUTE val_a, const CTYPE_COMPUTE val_b, const CTYPE_COMPUTE val_c) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this unrelated change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we can't vectorize this op
@@ -51,6 +56,34 @@ inline int64_t scalar_to<int64_t>(const Scalar& s) { | |||
} | |||
|
|||
namespace internal { | |||
template <typename Ignore, typename T> | |||
using ignore_first_yield_second = T; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these names
if constexpr (std::is_invocable_v< | ||
Op, | ||
ignore_first_yield_second<Args, Vec>...>) { | ||
// For bool, we will get a false positive if we rely on only the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean bool return type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool ctype_compute
((inputs.first->scalar_type() == | ||
CppTypeToScalarType<CTYPE_COMPUTE>::value) && | ||
...)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Man, this is a good reminder of all the template meta programming magic
...); | ||
if (!any_is_broadcasted) { | ||
using Vec = at::vec::Vectorized<CTYPE_COMPUTE>; | ||
::executorch::extension::parallel_for( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think doing this blindly for each op is a bit risky in that, no all multithreading is always better. some ops benefit from smaller grain size while others with larger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC xnnpack blindly parallelizes absolutely everything; we're doing strictly better here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I am not comparing with xnnpack. In fact one bit part of the reason why we ended up leveraging optimized op lib for some of the llama stuff for exactly that reason. That it blindly parallelized everything and that actually hurt perf
const auto vectorized_begin = | ||
begin + (Vec::size() - begin % Vec::size()) % Vec::size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like something that has chances of bug. Hope we test this enough. I would doubt if our test cases will exercise this code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although I do see you treat scalar left overs of both head and tails separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope we test this enough.
Good point; adding tests for each affected op with lengths sufficient to hit the vectorized path.
inputs.first->sizes(), out.sizes()) && | ||
...); | ||
if (!any_is_broadcasted) { | ||
using Vec = at::vec::Vectorized<CTYPE_COMPUTE>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the one point of contention for me is that why do we need vectorized_math.h which largely is doing trampoline to underlying vectorized methods. Mainly you dont even need to use can_use_vectorized
, because on non accelerated platforms Vectorized falls back to scalar impl even if Vec::size() != `. Maybe you said that the generated code would be worse if forced Vectorized, but I am not sure why. Rest makes sense.
However, place where I can potentially see that being useful is for dtype that Vectorized doesnt support but for float I am not sure. So maybe if you can clarify that it would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need vectorized_math.h which largely is doing trampoline to underlying vectorized methods
without it, you can't take the same lambda you already wrote for scalars and reuse it for Vectorized (the change isn't zero because you have to point at executorch::math, but crucially it doesn't require separate code)
Ok we synced offline. So the major value prop as I understood: You can write your lambdas without having to explicitly use Vectorized. Why explicitly using Vectorized might not be good? Because maybe Vectorized does not have everything you need to implement your lambda. So as op author you dont have to worry about Vectorized when writing your lambda (although do have to use executorch::math ops). And later if Vectorized added support for the ::math.. ops you get vectorization for free without having to go back and rewrite your lambda with Vectorized. So this is nice. |
Mixed dtype should be uncommon. Here is how we can specialize for the common case. Prepares us to tackle #9241 . Test Plan: automated tests on this PR verify we didn't break the now-deprecated runtime_out_dtypes mode; tests on the next PR will verify that everything works after migration. Also included migration for exactly one operator, op_mul, to verify that the new code compiles. To check performance, I edited examples/models/toy_model/model.py so that MulModule used inputs of size 3000, 2000 instead of 3, 2. I exported it with `python3 -m examples.portable.scripts.export --model_name mul` and saved the resulting `mul.pte`. Then I built in release mode with optimized kernels on, but with mul.out removed from kernels/optimized/optimized.yaml, so that we would use the optimized_portable_kernels build of kernels/portable/op_mul.cpp. Finally, I ran 3 trials on my M1 Macbook Pro using `cmake-out/executor_runner --model_path mul3kby2k.pte --num_executions 1000 --cpu_threads 2`. Resulting times for 1000 iterations in ms: Previous diff: 8295, 8187, 8139 This diff: 2953, 2806, 2861 (For comparison, the actual optimized mul kernel took around 1000 ms to run 1000 iterations, and #9432 later in the stack arrived at similar numbers.)
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
This is a first cut at #9241 . In this PR I've vectorized a small initial set of ops: atan2, clamp, fmod_Scalar, maximum, minimum, mul, pow, and sigmoid. In addition, the following ops should have gotten vectorized automatically because they already used generic lambdas: add, div, rsub, sub. I've left covering ops that use the
unary_ufunc_*
utilities in pattern.h for a follow-up push, because pattern.h and elementwise_util need some work before we can migrate pattern.h's utilities to be backed by elementwise_util.This PR adds an interesting testing problem: in theory, all operators might need test cases long enough to tickle vectorization, because we might accidentally vectorize ops unexpectedly and break their lambdas due to anticipated differences in semantics. I address this issue by using Vectorized for the scalar prologue/epilogue in debug mode (we run tests in both debug and release) so that we can detect broken lambdas. I additionally intentionally introduced a bug in the vectorized path in elementwise_util and manually verified that we saw test failures for each vectorized op called out above.