-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27684][SQL] Avoid conversion overhead for primitive types #24636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #105515 has finished for PR 24636 at commit
|
| val initArg = if (CatalystTypeConverters.isPrimitive(dt)) { | ||
| val convertedTerm = ctx.freshName("conv") | ||
| s""" | ||
| |${CodeGenerator.boxedType(dt)} $convertedTerm = ${eval.value}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why do we need this extra convertedTerm for the boxing? Could you instead do
Object $argTerm = ${eval.isNull} ? null : ${eval.value};
and avoid the use of an extra variable name? Or if you want more typechecking, do
${CodeGenerator.boxedType(dt)} $argTerm = ${eval.isNull} ? null : ${eval.value};
and used the boxed type as $argTerm's type?
To avoid repetition and more tightly scope the conditional part of the argument convert logic, we even might consider something like this
val boxedType = CodeGenerator.boxedType(dt)
val maybeConverted = if (CatalystTypeConverters.isPrimitive(dt)) {
eval.value
} else {
"$convertersTerm[$i].apply(${eval.value})"
}
s"$boxedType $argTerm = ${eval.isNull} ? null : $maybeConverted;"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, actually my first trial was exactly what you are suggesting here, but it didn't work: indeed it can cause compilation error (the error message is something like no common type for void and int). Then, I also tried:
val boxedType = CodeGenerator.boxedType(dt)
val maybeConverted = if (CatalystTypeConverters.isPrimitive(dt)) {
s"(${boxedType}) eval.value"
} else {
"$convertersTerm[$i].apply(${eval.value})"
}
s"$boxedType $argTerm = ${eval.isNull} ? null : $maybeConverted;"
but this fails too with a confusing error message. Honestly, I am not sure why this 2nd solution doesn't work, since I tried taking the code and compiling it with jdk and it worked. My best guess is that it is a janino bug which doesn't support it.
I did several trials but I haven't found any better alternative as this seemed the only syntax working with janino.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting!
I've filed a bug against Janino to report this issue: janino-compiler/janino#90
| val (funcArgs, initArgs) = evals.zipWithIndex.zip(children.map(_.dataType)).map { | ||
| case ((eval, i), dt) => | ||
| val argTerm = ctx.freshName("arg") | ||
| val initArg = if (CatalystTypeConverters.isPrimitive(dt)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use CodeGenerator.isPrimitiveType? We can save the change to CatalystTypeConverters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't work, you can see the UT failures. For types like timestamp it is not the same.
|
Test build #105532 has finished for PR 24636 at commit
|
caea444 to
ad57acf
Compare
|
Test build #105533 has finished for PR 24636 at commit
|
This reverts commit ad57acf.
|
Test build #105537 has finished for PR 24636 at commit
|
|
retest this please |
|
Test build #105562 has finished for PR 24636 at commit
|
|
Good PR. I will review this carefully. One minor comment: I like performance improvements using |
|
Thanks for you comment @kiszk . I added the warm-up, and the result is barely the same. Here is the code: and the results are: Any more comments/concerns? |
| doRunBenchmarkWithPrimitiveTypes(sampleUDF, cardinality) | ||
| } | ||
|
|
||
| codegenBenchmark("long/nullable int/string to primitive", cardinality) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a typo -- should this be "long/nullable int to primitive"
| doRunBenchmarkWithMixedTypes(sampleUDF, cardinality) | ||
| } | ||
|
|
||
| codegenBenchmark("long/nullable int to primitive", cardinality) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"long/nullable int to primitive" , change to "long/nullable int/string to primitive"
| * Results will be written to "benchmarks/UDFBenchmark-results.txt". | ||
| * }}} | ||
| */ | ||
| object UDFBenchmark extends SqlBasedBenchmark { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mgaido91 for your work on this performance improvement. I'm curious if you tried the JIRA test case from @JoshRosen with your changes. How close does this get us? Also do you think it might be worthwhile to add that test in this benchmark suite as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, everything can be added. If you think it is critical, we can add it. Indeed, the test reported in the description is very similar, as it is doing a + 1, which is not so different from an identity. I think the point here is to identify how much from the overhead is saved, and the tests performed show that the overhead is reduced significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway I added it, as shown in the results the overhead is now ~ 20% instead of ~50%
|
Test build #105673 has finished for PR 24636 at commit
|
|
Test build #105675 has finished for PR 24636 at commit
|
|
retest this please |
|
Test build #105690 has finished for PR 24636 at commit
|
|
retest this please |
|
Test build #105693 has finished for PR 24636 at commit
|
| val convertedTerm = ctx.freshName("conv") | ||
| s""" | ||
| |${CodeGenerator.boxedType(dt)} $convertedTerm = ${eval.value}; | ||
| |Object $argTerm = ${eval.isNull} ? null : $convertedTerm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove ${eval.isNull} ? ... if ${eval.isNull} is compile-time constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO maybe we can do this in a separate PR?
I wouldn't be surprised if there's other places in Spark (beyond this method / file) where we could apply similar fixes (and if we're going to apply this in a lot of places then it might even be nice to write some sort of helper for generating / managing null checks.
Does this PR look good to you otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @JoshRosen . It would be better to define a generic approach for addressing this and there are several instances of it. I can create another JIRA and start a PR for that if you're ok with it.
|
cc @ueshin |
|
@kiszk @ueshin @gatorsmile, does this PR now look good to you? If so, I'd like to get this merged soon so that it doesn't go stale. |
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for the delay.
LGTM.
|
I've merged this to master. Thanks @mgaido91 (and to everyone who helped with review)! |
|
thank you all! |
What changes were proposed in this pull request?
As outlined in the JIRA by @JoshRosen, our conversion mechanism from catalyst types to scala ones is pretty inefficient for primitive data types. Indeed, in these cases, most of the times we are adding useless calls to
identityfunction or anyway to functions which return the same value. Using the information we have when we generate the code, we can avoid most of these overheads.How was this patch tested?
Here is a simple test which shows the benefit that this PR can bring:
The output before the patch is:
after, we get:
which is ~5X faster.
Moreover a benchmark has been added for Scala UDF. The output after the patch can be seen in this PR, before the patch, the output was:
The improvements are particularly visible in the second case, ie. when only primitive types are used as inputs.