Skip to content

Conversation

@thepinetree
Copy link
Contributor

What changes were proposed in this pull request?

Spark has a (long-standing) overflow bug in the sequence expression.

Consider the following operations:

spark.sql("CREATE TABLE foo (l LONG);")
spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});")
spark.sql("SELECT sequence(0, l) FROM foo;").collect()

The result of these operations will be:

Array[org.apache.spark.sql.Row] = Array([WrappedArray()])

an unintended consequence of overflow.

The sequence is applied to values 0 and Long.MaxValue with a step size of 1 which uses a length computation defined here. In this calculation, with start = 0, stop = Long.MaxValue, and step = 1, the calculated len overflows to Long.MinValue. The computation, in binary looks like:

  0111111111111111111111111111111111111111111111111111111111111111
- 0000000000000000000000000000000000000000000000000000000000000000 
------------------------------------------------------------------
  0111111111111111111111111111111111111111111111111111111111111111
/ 0000000000000000000000000000000000000000000000000000000000000001
------------------------------------------------------------------
  0111111111111111111111111111111111111111111111111111111111111111
+ 0000000000000000000000000000000000000000000000000000000000000001
------------------------------------------------------------------
  1000000000000000000000000000000000000000000000000000000000000000

The following check passes as the negative Long.MinValue is still <= MAX_ROUNDED_ARRAY_LENGTH. The following cast to toInt uses this representation and truncates the upper bits resulting in an empty length of 0.

Other overflows are similarly problematic.

This PR addresses the issue by checking numeric operations in the length computation for overflow.

Why are the changes needed?

There is a correctness bug from overflow in the sequence expression.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tests added in CollectionExpressionsSuite.scala.

@github-actions github-actions bot added the SQL label May 6, 2023
@thepinetree thepinetree changed the title [WIP][SPARK-43393][SQL] Address sequence expression overflow bug. [SPARK-43393][SQL] Address sequence expression overflow bug. May 6, 2023
@HyukjinKwon
Copy link
Member

cc @gengliangwang FYI

@thepinetree thepinetree force-pushed the spark-sequence-overflow branch from d642ca6 to b8d44fa Compare May 10, 2023 19:39
Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if tests pass

}
val len = if (stop == start) 1L else Math.addExact(1L, (delta / estimatedStep.toLong))
if (len > MAX_ROUNDED_ARRAY_LENGTH) {
throw new IllegalArgumentException(s"Too long sequence: $len. Should be <= " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the exception is an user-facing error, let's introduce an error class, and raise a Spark exception w/ it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. What is the reasoning behind this change? It seems that all errors thrown in this file prefer Java defined exceptions over their Spark wrappers in SparkException.scala.

If the decision is to use Spark wrappers for this expression only, is SparkIllegalArgumentException the right wrapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I originally misunderstood your comment and with @ankurdave's help I learned about the errors defined in QueryExecutionErrors.scala and that they are actually used in this file. I thought createArrayWithElementsExceedLimitError might be the most fitting.

@github-actions
Copy link

github-actions bot commented Oct 2, 2023

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 2, 2023
@github-actions github-actions bot closed this Oct 3, 2023
@cloud-fan cloud-fan reopened this Nov 14, 2023
@cloud-fan cloud-fan removed the Stale label Nov 14, 2023
case _: ArithmeticException =>
val safeLen =
BigInt(1) + (BigInt(stop) - BigInt(start)) / BigInt(step)
if (safeLen > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just use an assert? Assertion error is also treated as internal errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like the current exception better since it's more descriptive of the actual problem -- trying to create too large an array (with the user's intended size) and what the limit is. If strong opinion, I can change to an assertion.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.5!

@cloud-fan cloud-fan closed this in afc4c49 Nov 15, 2023
cloud-fan pushed a commit that referenced this pull request Nov 15, 2023
Spark has a (long-standing) overflow bug in the `sequence` expression.

Consider the following operations:
```
spark.sql("CREATE TABLE foo (l LONG);")
spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});")
spark.sql("SELECT sequence(0, l) FROM foo;").collect()
```

The result of these operations will be:
```
Array[org.apache.spark.sql.Row] = Array([WrappedArray()])
```
an unintended consequence of overflow.

The sequence is applied to values `0` and `Long.MaxValue` with a step size of `1` which uses a length computation defined [here](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451). In this calculation, with `start = 0`, `stop = Long.MaxValue`, and `step = 1`, the calculated `len` overflows to `Long.MinValue`. The computation, in binary looks like:

```
  0111111111111111111111111111111111111111111111111111111111111111
- 0000000000000000000000000000000000000000000000000000000000000000
------------------------------------------------------------------
  0111111111111111111111111111111111111111111111111111111111111111
/ 0000000000000000000000000000000000000000000000000000000000000001
------------------------------------------------------------------
  0111111111111111111111111111111111111111111111111111111111111111
+ 0000000000000000000000000000000000000000000000000000000000000001
------------------------------------------------------------------
  1000000000000000000000000000000000000000000000000000000000000000
```

The following [check](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454) passes as the negative `Long.MinValue` is still `<= MAX_ROUNDED_ARRAY_LENGTH`. The following cast to `toInt` uses this representation and [truncates the upper bits](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457) resulting in an empty length of `0`.

Other overflows are similarly problematic.

This PR addresses the issue by checking numeric operations in the length computation for overflow.

There is a correctness bug from overflow in the `sequence` expression.

No.

Tests added in `CollectionExpressionsSuite.scala`.

Closes #41072 from thepinetree/spark-sequence-overflow.

Authored-by: Deepayan Patra <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit afc4c49)
Signed-off-by: Wenchen Fan <[email protected]>
@dongjoon-hyun
Copy link
Member

Thank you, @thepinetree and @cloud-fan . Given that this is a long-standing overflow bug, do you think we can have this fix in other live release branches, branch-3.4 and branch-3.3? Especially, I'm interested in branch-3.4 as the release manager of Apache Spark 3.4.2.

@cloud-fan
Copy link
Contributor

SGTM. @thepinetree can you help to create backport PRs? thanks!

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 15, 2023

Oh this seems to break branch-3.5.

Screenshot 2023-11-15 at 9 31 53 AM

Let me revert this from branch-3.5. Given the situation, we can start backport from branch-3.5 to branch-3.3 as three separate PRs, @thepinetree .

@thepinetree
Copy link
Contributor Author

@dongjoon-hyun
Copy link
Member

Thank you so much!

@dongjoon-hyun
Copy link
Member

Could you fix the compilation of your PRs, @thepinetree ?

@dongjoon-hyun
Copy link
Member

Gentle ping, @thepinetree . All backporting PRs are broken at the compilation stage.

@thepinetree
Copy link
Contributor Author

Hi @dongjoon-hyun! Yes, sorry, I thought they'd be a simple back port apart from a simple import conflict. I'll fix the compilation errors appropriately when I have a chance tonight.

@dongjoon-hyun
Copy link
Member

Thank you so much, @thepinetree !

Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

late LGTM.

@thepinetree
Copy link
Contributor Author

quick update @dongjoon-hyun, looks like the 3.4/3.5 backports should be good to go after some flakiness is resolved (documentation and one spark connect suite).

3.4/3.5 had an older version of the errors (same across both versions) and I made changes accordingly.
3.3 had even older error definitions and needed some more changes on top of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants