[SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily #34369

ankurdave · 2021-10-22T14:07:44Z

What changes were proposed in this pull request?

The previous PR #34245 assumed task completion listeners are registered bottom-up. ParquetFileFormat#buildReaderWithPartitionValues() violates this assumption by registering a task completion listener to close its output iterator lazily. Since task completion listeners are executed in reverse order of registration, this listener always runs before other listeners. When the downstream operator contains a Python UDF and the off-heap vectorized Parquet reader is enabled, this results in a use-after-free that causes a segfault.

The fix is to close the output iterator using FileScanRDD's task completion listener.

Why are the changes needed?

Without this PR, the Python tests introduced in #34245 are flaky (see details in thread). They intermittently fail with a segfault.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Repeatedly ran one of the Python tests introduced in #34245 using the commands below. Previously, the test was flaky and failed after about 50 runs. With this PR, the test has not failed after 1000 runs.

./build/sbt -Phive clean package && ./build/sbt test:compile
seq 1000 | parallel -j 8 --halt now,fail=1 'echo {#}; python/run-tests --testnames pyspark.sql.tests.test_udf'

…etFileFormat

…in ParquetFileFormat" This reverts commit 4badb76.

…ner lazily

ankurdave · 2021-10-22T14:09:25Z

cc @cloud-fan @HyukjinKwon @dongjoon-hyun @mswit-databricks

cloud-fan · 2021-10-22T14:22:47Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

+          iter.asInstanceOf[Iterator[InternalRow]]
+        } catch {
+          case e: Throwable =>
+            // SPARK-23457: In case there is an exception in initialization, close the iterator to


shall we let the caller FileScanRDD close the iterator when hitting errors?

In general, I think FileScanRDD does close the iterator when hitting exceptions, because it uses a task completion listener to do so. The only case where it will not close the iterator is when the exception prevents FileScanRDD from getting a reference to the iterator, as is the case here.

SparkQA · 2021-10-22T14:57:05Z

Test build #144538 has finished for PR 34369 at commit 29f5e77.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-22T15:31:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49009/

SparkQA · 2021-10-22T16:00:25Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49011/

SparkQA · 2021-10-22T16:13:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49009/

SparkQA · 2021-10-22T16:55:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49015/

SparkQA · 2021-10-22T17:54:23Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49015/

SparkQA · 2021-10-22T18:06:06Z

Test build #144543 has finished for PR 34369 at commit c09cf5e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-10-22T19:41:36Z

Thank you for pinging me, @ankurdave .

dongjoon-hyun · 2021-10-22T19:44:48Z

cc @sunchao

dongjoon-hyun · 2021-10-22T19:55:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

+            iter.closeIfNeeded()
+          case iter: Closeable =>
+            iter.close()
+          case _ => // do nothing


When does this happen? Only when currentIterator is null?

There are currently two cases aside from null:

OrcFileFormat produces an ordinary non-Closeable Iterator due to unwrapOrcStructs().

The user can create a FileScanRDD with an arbitrary readFunction that does not return a Closeable Iterator.

It would be ideal if we could disallow these cases and require the iterator to be Closeable, but it seems that would require changing public APIs.

SparkQA · 2021-10-22T20:45:54Z

Test build #144544 has finished for PR 34369 at commit 559224d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao

LGTM - just one minor question

sunchao · 2021-10-22T20:41:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

-              private lazy val internalIter = readCurrentFile()
+              // vectorized Parquet reader. Here we use a lazily initialized variable to delay the
+              // creation of iterator so that we will throw exception in `getNext`.
+              private var internalIter: Iterator[InternalRow] = null


hm why is this change necessary?

If the downstream operator never pulls any rows from the iterator, then the first time we access internalIter will be when close() is called. If internalIter is a lazy val, this will trigger a call to readCurrentFile(), which is unnecessary and may throw. Changing internalIter from a lazy val to a var lets us avoid this unnecessary call.

Several tests fail without this change, including AvroV1Suite.

Got it. Thanks

HyukjinKwon · 2021-10-25T01:00:48Z

Merged to master and branch-3.2.

…ner lazily ### What changes were proposed in this pull request? The previous PR #34245 assumed task completion listeners are registered bottom-up. `ParquetFileFormat#buildReaderWithPartitionValues()` violates this assumption by registering a task completion listener to close its output iterator lazily. Since task completion listeners are executed in reverse order of registration, this listener always runs before other listeners. When the downstream operator contains a Python UDF and the off-heap vectorized Parquet reader is enabled, this results in a use-after-free that causes a segfault. The fix is to close the output iterator using FileScanRDD's task completion listener. ### Why are the changes needed? Without this PR, the Python tests introduced in #34245 are flaky ([see details in thread](#34245 (comment))). They intermittently fail with a segfault. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Repeatedly ran one of the Python tests introduced in #34245 using the commands below. Previously, the test was flaky and failed after about 50 runs. With this PR, the test has not failed after 1000 runs. ```sh ./build/sbt -Phive clean package && ./build/sbt test:compile seq 1000 | parallel -j 8 --halt now,fail=1 'echo {#}; python/run-tests --testnames pyspark.sql.tests.test_udf' ``` Closes #34369 from ankurdave/SPARK-37089. Authored-by: Ankur Dave <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 1fc1d07) Signed-off-by: Hyukjin Kwon <[email protected]>

…ner lazily ### What changes were proposed in this pull request? The previous PR apache#34245 assumed task completion listeners are registered bottom-up. `ParquetFileFormat#buildReaderWithPartitionValues()` violates this assumption by registering a task completion listener to close its output iterator lazily. Since task completion listeners are executed in reverse order of registration, this listener always runs before other listeners. When the downstream operator contains a Python UDF and the off-heap vectorized Parquet reader is enabled, this results in a use-after-free that causes a segfault. The fix is to close the output iterator using FileScanRDD's task completion listener. ### Why are the changes needed? Without this PR, the Python tests introduced in apache#34245 are flaky ([see details in thread](apache#34245 (comment))). They intermittently fail with a segfault. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Repeatedly ran one of the Python tests introduced in apache#34245 using the commands below. Previously, the test was flaky and failed after about 50 runs. With this PR, the test has not failed after 1000 runs. ```sh ./build/sbt -Phive clean package && ./build/sbt test:compile seq 1000 | parallel -j 8 --halt now,fail=1 'echo {#}; python/run-tests --testnames pyspark.sql.tests.test_udf' ``` Closes apache#34369 from ankurdave/SPARK-37089. Authored-by: Ankur Dave <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 1fc1d07) Signed-off-by: Hyukjin Kwon <[email protected]>

ankurdave added 3 commits October 21, 2021 15:50

[SPARK-37089][SQL] Register task completion listener eagerly in Parqu…

4badb76

…etFileFormat

Revert "[SPARK-37089][SQL] Register task completion listener eagerly …

a4b82a9

…in ParquetFileFormat" This reverts commit 4badb76.

[SPARK-37089][SQL] Do not register ParquetFileFormat completion liste…

29f5e77

…ner lazily

github-actions bot added the SQL label Oct 22, 2021

cloud-fan reviewed Oct 22, 2021

View reviewed changes

cloud-fan approved these changes Oct 22, 2021

View reviewed changes

ankurdave added 2 commits October 22, 2021 11:50

Close current iterator before advancing to next one

2eddeb6

Prevent iterator leak due to RecordReaderIterator#map()

c09cf5e

Avoid reading file in close() (fixes AvroV1Suite)

559224d

dongjoon-hyun reviewed Oct 22, 2021

View reviewed changes

dongjoon-hyun approved these changes Oct 22, 2021

View reviewed changes

sunchao approved these changes Oct 22, 2021

View reviewed changes

ankurdave mentioned this pull request Oct 24, 2021

[SPARK-37088][PYSPARK][SQL] Writer thread must not access input after task completion listener returns #34245

Closed

HyukjinKwon approved these changes Oct 24, 2021

View reviewed changes

HyukjinKwon closed this in 1fc1d07 Oct 25, 2021

[SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily #34369

[SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily #34369

Uh oh!

Conversation

ankurdave commented Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ankurdave commented Oct 22, 2021

Uh oh!

cloud-fan Oct 22, 2021

Choose a reason for hiding this comment

Uh oh!

ankurdave Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 22, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 22, 2021

Uh oh!

SparkQA commented Oct 22, 2021

Uh oh!

SparkQA commented Oct 22, 2021

Uh oh!

SparkQA commented Oct 22, 2021

Uh oh!

SparkQA commented Oct 22, 2021

Uh oh!

SparkQA commented Oct 22, 2021

Uh oh!

SparkQA commented Oct 22, 2021

Uh oh!

dongjoon-hyun commented Oct 22, 2021

Uh oh!

dongjoon-hyun commented Oct 22, 2021

Uh oh!

dongjoon-hyun Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankurdave Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 22, 2021

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao Oct 22, 2021

Choose a reason for hiding this comment

Uh oh!

ankurdave Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao Oct 22, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ankurdave commented Oct 22, 2021 •

edited

Loading

ankurdave Oct 22, 2021 •

edited

Loading

dongjoon-hyun Oct 22, 2021 •

edited

Loading

ankurdave Oct 22, 2021 •

edited

Loading

ankurdave Oct 22, 2021 •

edited

Loading

HyukjinKwon commented Oct 25, 2021 •

edited

Loading