-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-43393][SQL] Address sequence expression overflow bug. #41072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @gengliangwang FYI |
4224247 to
a40285c
Compare
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Outdated
Show resolved
Hide resolved
d642ca6 to
b8d44fa
Compare
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if tests pass
| } | ||
| val len = if (stop == start) 1L else Math.addExact(1L, (delta / estimatedStep.toLong)) | ||
| if (len > MAX_ROUNDED_ARRAY_LENGTH) { | ||
| throw new IllegalArgumentException(s"Too long sequence: $len. Should be <= " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the exception is an user-facing error, let's introduce an error class, and raise a Spark exception w/ it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. What is the reasoning behind this change? It seems that all errors thrown in this file prefer Java defined exceptions over their Spark wrappers in SparkException.scala.
If the decision is to use Spark wrappers for this expression only, is SparkIllegalArgumentException the right wrapper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I originally misunderstood your comment and with @ankurdave's help I learned about the errors defined in QueryExecutionErrors.scala and that they are actually used in this file. I thought createArrayWithElementsExceedLimitError might be the most fitting.
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Outdated
Show resolved
Hide resolved
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
| case _: ArithmeticException => | ||
| val safeLen = | ||
| BigInt(1) + (BigInt(stop) - BigInt(start)) / BigInt(step) | ||
| if (safeLen > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just use an assert? Assertion error is also treated as internal errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally like the current exception better since it's more descriptive of the actual problem -- trying to create too large an array (with the user's intended size) and what the limit is. If strong opinion, I can change to an assertion.
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Outdated
Show resolved
Hide resolved
|
thanks, merging to master/3.5! |
Spark has a (long-standing) overflow bug in the `sequence` expression.
Consider the following operations:
```
spark.sql("CREATE TABLE foo (l LONG);")
spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});")
spark.sql("SELECT sequence(0, l) FROM foo;").collect()
```
The result of these operations will be:
```
Array[org.apache.spark.sql.Row] = Array([WrappedArray()])
```
an unintended consequence of overflow.
The sequence is applied to values `0` and `Long.MaxValue` with a step size of `1` which uses a length computation defined [here](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451). In this calculation, with `start = 0`, `stop = Long.MaxValue`, and `step = 1`, the calculated `len` overflows to `Long.MinValue`. The computation, in binary looks like:
```
0111111111111111111111111111111111111111111111111111111111111111
- 0000000000000000000000000000000000000000000000000000000000000000
------------------------------------------------------------------
0111111111111111111111111111111111111111111111111111111111111111
/ 0000000000000000000000000000000000000000000000000000000000000001
------------------------------------------------------------------
0111111111111111111111111111111111111111111111111111111111111111
+ 0000000000000000000000000000000000000000000000000000000000000001
------------------------------------------------------------------
1000000000000000000000000000000000000000000000000000000000000000
```
The following [check](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454) passes as the negative `Long.MinValue` is still `<= MAX_ROUNDED_ARRAY_LENGTH`. The following cast to `toInt` uses this representation and [truncates the upper bits](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457) resulting in an empty length of `0`.
Other overflows are similarly problematic.
This PR addresses the issue by checking numeric operations in the length computation for overflow.
There is a correctness bug from overflow in the `sequence` expression.
No.
Tests added in `CollectionExpressionsSuite.scala`.
Closes #41072 from thepinetree/spark-sequence-overflow.
Authored-by: Deepayan Patra <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit afc4c49)
Signed-off-by: Wenchen Fan <[email protected]>
|
Thank you, @thepinetree and @cloud-fan . Given that this is a long-standing overflow bug, do you think we can have this fix in other live release branches, |
|
SGTM. @thepinetree can you help to create backport PRs? thanks! |
|
Oh this seems to break branch-3.5.
Let me revert this from branch-3.5. Given the situation, we can start backport from branch-3.5 to branch-3.3 as three separate PRs, @thepinetree . |
|
Thank you so much! |
|
Could you fix the compilation of your PRs, @thepinetree ? |
|
Gentle ping, @thepinetree . All backporting PRs are broken at the compilation stage. |
|
Hi @dongjoon-hyun! Yes, sorry, I thought they'd be a simple back port apart from a simple import conflict. I'll fix the compilation errors appropriately when I have a chance tonight. |
|
Thank you so much, @thepinetree ! |
beliefer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
late LGTM.
|
quick update @dongjoon-hyun, looks like the 3.4/3.5 backports should be good to go after some flakiness is resolved (documentation and one spark connect suite). 3.4/3.5 had an older version of the errors (same across both versions) and I made changes accordingly. |

What changes were proposed in this pull request?
Spark has a (long-standing) overflow bug in the
sequenceexpression.Consider the following operations:
The result of these operations will be:
an unintended consequence of overflow.
The sequence is applied to values
0andLong.MaxValuewith a step size of1which uses a length computation defined here. In this calculation, withstart = 0,stop = Long.MaxValue, andstep = 1, the calculatedlenoverflows toLong.MinValue. The computation, in binary looks like:The following check passes as the negative
Long.MinValueis still<= MAX_ROUNDED_ARRAY_LENGTH. The following cast totoIntuses this representation and truncates the upper bits resulting in an empty length of0.Other overflows are similarly problematic.
This PR addresses the issue by checking numeric operations in the length computation for overflow.
Why are the changes needed?
There is a correctness bug from overflow in the
sequenceexpression.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Tests added in
CollectionExpressionsSuite.scala.