[SPARK-43393][SQL] Address sequence expression overflow bug. #41072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

thepinetree wants to merge 15 commits into apache:master from thepinetree:spark-sequence-overflow

+83 −20

Contributor

thepinetree commented May 6, 2023

What changes were proposed in this pull request?

Spark has a (long-standing) overflow bug in the sequence expression.

Consider the following operations:

spark.sql("CREATE TABLE foo (l LONG);")
spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});")
spark.sql("SELECT sequence(0, l) FROM foo;").collect()

The result of these operations will be:

Array[org.apache.spark.sql.Row] = Array([WrappedArray()])

an unintended consequence of overflow.

The sequence is applied to values 0 and Long.MaxValue with a step size of 1 which uses a length computation defined here. In this calculation, with start = 0, stop = Long.MaxValue, and step = 1, the calculated len overflows to Long.MinValue. The computation, in binary looks like:

  0111111111111111111111111111111111111111111111111111111111111111
- 0000000000000000000000000000000000000000000000000000000000000000 
------------------------------------------------------------------
  0111111111111111111111111111111111111111111111111111111111111111
/ 0000000000000000000000000000000000000000000000000000000000000001
------------------------------------------------------------------
  0111111111111111111111111111111111111111111111111111111111111111
+ 0000000000000000000000000000000000000000000000000000000000000001
------------------------------------------------------------------
  1000000000000000000000000000000000000000000000000000000000000000

The following check passes as the negative Long.MinValue is still <= MAX_ROUNDED_ARRAY_LENGTH. The following cast to toInt uses this representation and truncates the upper bits resulting in an empty length of 0.

Other overflows are similarly problematic.

This PR addresses the issue by checking numeric operations in the length computation for overflow.

Why are the changes needed?

There is a correctness bug from overflow in the sequence expression.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tests added in CollectionExpressionsSuite.scala.

github-actions bot added the SQL label

thepinetree changed the title ~~[WIP][SPARK-43393][SQL] Address sequence expression overflow bug.~~ [SPARK-43393][SQL] Address sequence expression overflow bug.

Member

HyukjinKwon commented May 8, 2023

cc @gengliangwang FYI

github-actions bot added BUILD CONNECT CORE DOCS INFRA KUBERNETES PYTHON STRUCTURED STREAMING labels

thepinetree force-pushed the spark-sequence-overflow branch from 4224247 to a40285c Compare

May 8, 2023 06:07

github-actions bot removed KUBERNETES PYTHON BUILD INFRA STRUCTURED STREAMING CORE DOCS CONNECT labels

cloud-fan reviewed

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala Outdated Show resolved Hide resolved

cloud-fan reviewed

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala Outdated Show resolved Hide resolved

cloud-fan reviewed

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala Outdated Show resolved Hide resolved

thepinetree added 4 commits

May 10, 2023 12:38


          Address overflow bug.

efdf36e


          Add comment to point to issue.

63d45fa


          Add edge-case test.

cd7dd84


          Address PR comments

b8d44fa

thepinetree force-pushed the spark-sequence-overflow branch from d642ca6 to b8d44fa Compare

May 10, 2023 19:39

cloud-fan approved these changes

View reviewed changes

Contributor

cloud-fan left a comment

LGTM if tests pass

MaxGekk requested changes

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala Outdated

    
                    }

                    val len = if (stop == start) 1L else Math.addExact(1L, (delta / estimatedStep.toLong))

                    if (len > MAX_ROUNDED_ARRAY_LENGTH) {

                      throw new IllegalArgumentException(s"Too long sequence: $len. Should be <= " +

Member

MaxGekk May 16, 2023

If the exception is an user-facing error, let's introduce an error class, and raise a Spark exception w/ it.

Contributor Author

thepinetree May 31, 2023

Sorry for the delay. What is the reasoning behind this change? It seems that all errors thrown in this file prefer Java defined exceptions over their Spark wrappers in SparkException.scala.

If the decision is to use Spark wrappers for this expression only, is SparkIllegalArgumentException the right wrapper?

Contributor Author

thepinetree Jun 9, 2023

Ah, I originally misunderstood your comment and with @ankurdave's help I learned about the errors defined in QueryExecutionErrors.scala and that they are actually used in this file. I thought createArrayWithElementsExceedLimitError might be the most fitting.


          Merge branch 'master' into spark-sequence-overflow

c3a6c77

cloud-fan reviewed

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala Outdated Show resolved Hide resolved

github-actions bot commented Oct 2, 2023

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added the Stale label

github-actions bot closed this

thepinetree added 5 commits

November 13, 2023 13:31


          Merge.

a05fe02


          Attempt addressing comments.

1100b28


          Fixes.

b9592c8


          Fixes.

85a793a


          Remove dead line

5cf0fc3

cloud-fan reopened this

cloud-fan removed the Stale label

cloud-fan reviewed

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

    
                    case _: ArithmeticException =>

                      val safeLen =

                        BigInt(1) + (BigInt(stop) - BigInt(start)) / BigInt(step)

                      if (safeLen > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {

Contributor

cloud-fan Nov 14, 2023

maybe just use an assert? Assertion error is also treated as internal errors.

Contributor Author

thepinetree Nov 14, 2023

I personally like the current exception better since it's more descriptive of the actual problem -- trying to create too large an array (with the user's intended size) and what the limit is. If strong opinion, I can change to an assertion.

cloud-fan reviewed

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala Outdated Show resolved Hide resolved


          Address comment.

83d56d4

Contributor

cloud-fan commented Nov 15, 2023

thanks, merging to master/3.5!

cloud-fan closed this in

afc4c49

cloud-fan pushed a commit that referenced this pull request


          [SPARK-43393][SQL] Address sequence expression overflow bug

41a7a4a

Spark has a (long-standing) overflow bug in the `sequence` expression.

Consider the following operations:
```
spark.sql("CREATE TABLE foo (l LONG);")
spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});")
spark.sql("SELECT sequence(0, l) FROM foo;").collect()
```

The result of these operations will be:
```
Array[org.apache.spark.sql.Row] = Array([WrappedArray()])
```
an unintended consequence of overflow.

The sequence is applied to values `0` and `Long.MaxValue` with a step size of `1` which uses a length computation defined [here](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451). In this calculation, with `start = 0`, `stop = Long.MaxValue`, and `step = 1`, the calculated `len` overflows to `Long.MinValue`. The computation, in binary looks like:

```
  0111111111111111111111111111111111111111111111111111111111111111
- 0000000000000000000000000000000000000000000000000000000000000000
------------------------------------------------------------------
  0111111111111111111111111111111111111111111111111111111111111111
/ 0000000000000000000000000000000000000000000000000000000000000001
------------------------------------------------------------------
  0111111111111111111111111111111111111111111111111111111111111111
+ 0000000000000000000000000000000000000000000000000000000000000001
------------------------------------------------------------------
  1000000000000000000000000000000000000000000000000000000000000000
```

The following [check](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454) passes as the negative `Long.MinValue` is still `<= MAX_ROUNDED_ARRAY_LENGTH`. The following cast to `toInt` uses this representation and [truncates the upper bits](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457) resulting in an empty length of `0`.

Other overflows are similarly problematic.

This PR addresses the issue by checking numeric operations in the length computation for overflow.

There is a correctness bug from overflow in the `sequence` expression.

No.

Tests added in `CollectionExpressionsSuite.scala`.

Closes #41072 from thepinetree/spark-sequence-overflow.

Authored-by: Deepayan Patra <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit afc4c49)
Signed-off-by: Wenchen Fan <[email protected]>

Member

dongjoon-hyun commented Nov 15, 2023

Thank you, @thepinetree and @cloud-fan . Given that this is a long-standing overflow bug, do you think we can have this fix in other live release branches, branch-3.4 and branch-3.3? Especially, I'm interested in branch-3.4 as the release manager of Apache Spark 3.4.2.

Contributor

cloud-fan commented Nov 15, 2023

SGTM. @thepinetree can you help to create backport PRs? thanks!

Member

dongjoon-hyun commented Nov 15, 2023 •

edited

Loading

Oh this seems to break branch-3.5.

https://github.com/apache/spark/actions/runs/6873765275

Screenshot 2023-11-15 at 9 31 53 AM

Let me revert this from branch-3.5. Given the situation, we can start backport from branch-3.5 to branch-3.3 as three separate PRs, @thepinetree .

Contributor Author

thepinetree commented Nov 15, 2023

@dongjoon-hyun @cloud-fan

Backport PRs:

Member

dongjoon-hyun commented Nov 15, 2023

Thank you so much!

Member

dongjoon-hyun commented Nov 15, 2023

Could you fix the compilation of your PRs, @thepinetree ?

Member

dongjoon-hyun commented Nov 16, 2023

Gentle ping, @thepinetree . All backporting PRs are broken at the compilation stage.

Contributor Author

thepinetree commented Nov 16, 2023

Hi @dongjoon-hyun! Yes, sorry, I thought they'd be a simple back port apart from a simple import conflict. I'll fix the compilation errors appropriately when I have a chance tonight.

Member

dongjoon-hyun commented Nov 16, 2023

Thank you so much, @thepinetree !

beliefer reviewed

View reviewed changes

Contributor

beliefer left a comment

late LGTM.

Contributor Author

thepinetree commented Nov 17, 2023

quick update @dongjoon-hyun, looks like the 3.4/3.5 backports should be good to go after some flakiness is resolved (documentation and one spark connect suite).

3.4/3.5 had an older version of the errors (same across both versions) and I made changes accordingly.
3.3 had even older error definitions and needed some more changes on top of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

SQL