[SPARK-45592][SQL] Correctness issue in AQE with InMemoryTableScanExec #43435

eejbyfeldt · 2023-10-18T13:58:11Z

What changes were proposed in this pull request?

Fixes correctness issue in 3.5.0. The problem seems to be that when AQEShuffleRead does a coalesced read it can return a HashPartitioning with the coalesced number of partitions. This causes a correctness bug as the partitioning is not compatible for joins with other HashPartitioning even though the number of partitions matches. This is resolved in this patch by introducing CoalescedHashPartitioning and making AQEShuffleRead return that instead.

The fix was suggested by @cloud-fan

AQEShuffleRead should probably return a different partitioning, e.g. CoalescedHashPartitioning. It still satisfies ClusterDistribution, so Aggregate is fine and there will be no shuffle. For joins, two CoalescedHashPartitionings are compatible if they have the same original partition number and coalesce boundaries, and CoalescedHashPartitioning is not compatible with HashPartitioning.

Why are the changes needed?

Correctness bug.

Does this PR introduce any user-facing change?

Yes, fixed correctness issue.

How was this patch tested?

New and existing unit test.

Was this patch authored or co-authored using generative AI tooling?

No

eejbyfeldt · 2023-10-18T13:59:31Z

Tagging @cloud-fan and @ulysses-you since they created PRs in this area and might not a better way of fixing the bug.

ulysses-you · 2023-10-19T03:40:35Z

I think the issue is that, we propagate a coalesced shuffle exchange through InMemoryTableScanExec, and then the EnsureRequirements use the coalesced shuffle exchange to create other side shuffle exchange.
However, the shuffle exchanges are actually not compatible. i.e., One side shuffle is from HashPartitioning(200) and then coalesce to HashPartitioning(10) and other side shuffle is HashPartitioning(10). So it causes the join data issue.

                      Scan
                       |
                   Shuffle(200)
                       |
  Scan           AQEShuffleRead(10)
   |                   |
Shuffle(10)   InMemoryTableScanExec
    \            /
         Join

BTW, if you set spark.sql.shuffle.partitions=5 , I think this issue should be resolved.

There are two code place related to this issue:

AQEShuffleRead always think the coalesced partitioning is not changed, so just refresh the partition number. I think it is based on the assumption that all the initial shuffle partition numbers are same but it seems not. The EnsureRequirements support shouldConsiderMinParallelism which cause different initial shuffle partition number in one query execution.
The InMemoryTableScanExec propagates the output partitioning. InMemoryTableScanExec would introduce one more query execution which also breaks the assumption of AQEShuffleRead

cloud-fan · 2023-10-19T05:16:38Z

AQEShuffleRead should probably return a different partitioning, e.g. CoalescedHashPartitioning. It still satisfies ClusterDistribution, so Aggregate is fine and there will be no shuffle. For joins, two CoalescedHashPartitionings are compatible if they have the same original partition number and coalesce boundaries, and CoalescedHashPartitioning is not compatible with HashPartitioning.

eejbyfeldt · 2023-10-20T09:06:59Z

@cloud-fan I took a stab at implementing your suggestion, but the reproduction of the bug still fails. So either I made some mistake or missed some other part of the code that needs to be updated. Would be great if you could provide some feedback.

cloud-fan · 2023-10-20T14:07:28Z

@ulysses-you can you take a look when you have time?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

ulysses-you

Looks fine to me, cc @cloud-fan

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/DistributionSuite.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ShuffleSpecSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEShuffleReadExec.scala

eejbyfeldt · 2023-10-30T09:18:04Z

We don't need this hack anymore and can safely remove the if branch.

Is the suggestion to do that in this PR or is it better to do it in a follow up?

cloud-fan · 2023-10-30T09:22:00Z

@eejbyfeldt nvm, I made a mistake. This is for coalesce, we can add a new partitioning for skew join handling (split and replicate partitions). It's unrelated to this PR and we can do it latter.

cloud-fan · 2023-10-30T12:58:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+  override val numPartitions: Int = partitions.length
+
+  override def toString: String = from.toString
+  override def sql: String = from.sql


After a second thought, why do we need to hide CoalescedHashPartitioning? Can we run some example queries and check EXPLAIN and SQL web UI?

cloud-fan

LGTM with only one minor comment

maryannxue · 2023-10-30T16:10:52Z

@eejbyfeldt Can you briefly describe the triggering condition of this bug? Does it only occur when coalescing happens to produce just the exact number of partitions as the other side of the join?

In the meantime, I'm wondering if it would be better to:

not coalesce for the top/last shuffle of the physical plan of InMemoryTableScan
have coalesce rule deal with InMemoryTableScan from the caller side (user of the cache)

This PR, just to address the correctness issue, only needs to do 1. And we can do 2 (a little trickier I suppose) for performance improvement.

maryannxue · 2023-10-31T00:51:40Z

Synced with @cloud-fan offline, (2) in the above suggestion wouldn't work. Let's go ahead with current fix.

cloud-fan · 2023-10-31T03:18:27Z

The failed streaming test is unrelated, and my last comment is quite minor, let's merge it first to fix the correctness bug. Thanks for you great work!

Fixes correctness issue in 3.5.0. The problem seems to be that when AQEShuffleRead does a coalesced read it can return a HashPartitioning with the coalesced number of partitions. This causes a correctness bug as the partitioning is not compatible for joins with other HashPartitioning even though the number of partitions matches. This is resolved in this patch by introducing CoalescedHashPartitioning and making AQEShuffleRead return that instead. The fix was suggested by cloud-fan > AQEShuffleRead should probably return a different partitioning, e.g. CoalescedHashPartitioning. It still satisfies ClusterDistribution, so Aggregate is fine and there will be no shuffle. For joins, two CoalescedHashPartitionings are compatible if they have the same original partition number and coalesce boundaries, and CoalescedHashPartitioning is not compatible with HashPartitioning. Correctness bug. Yes, fixed correctness issue. New and existing unit test. No Closes #43435 from eejbyfeldt/SPARK-45592. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 2be03d8) Signed-off-by: Wenchen Fan <[email protected]>

dongjoon-hyun · 2023-10-31T16:46:19Z

Thank you, @eejbyfeldt and all.

Fixes correctness issue in 3.5.0. The problem seems to be that when AQEShuffleRead does a coalesced read it can return a HashPartitioning with the coalesced number of partitions. This causes a correctness bug as the partitioning is not compatible for joins with other HashPartitioning even though the number of partitions matches. This is resolved in this patch by introducing CoalescedHashPartitioning and making AQEShuffleRead return that instead. The fix was suggested by cloud-fan > AQEShuffleRead should probably return a different partitioning, e.g. CoalescedHashPartitioning. It still satisfies ClusterDistribution, so Aggregate is fine and there will be no shuffle. For joins, two CoalescedHashPartitionings are compatible if they have the same original partition number and coalesce boundaries, and CoalescedHashPartitioning is not compatible with HashPartitioning. Correctness bug. Yes, fixed correctness issue. New and existing unit test. No Closes apache#43435 from eejbyfeldt/SPARK-45592. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 2be03d8) Signed-off-by: Wenchen Fan <[email protected]>

…MemoryTableScanExec ### What changes were proposed in this pull request? This backports #43435 SPARK-45592 to the 3.4 branch. This is because it was already reported there as SPARK-45282 but it required enabling some extra configuration to hit the bug. ### Why are the changes needed? Fix correctness issue. ### Does this PR introduce _any_ user-facing change? Yes, fixing correctness issue. ### How was this patch tested? New tests based on the reproduction example in SPARK-45282 ### Was this patch authored or co-authored using generative AI tooling? No Closes #43729 from eejbyfeldt/SPARK-45282. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

JJACOB0806 · 2024-01-12T05:33:11Z

Hello, is there a timeline for 3.5.1 release? We are facing the issue in 3.5.0 and would like to know when the next stable version will be rolled out.

deepakcv · 2024-01-12T05:39:12Z

Hi, is there a tentative timeline for releasing spark-3.5.1 with these changes?

### What changes were proposed in this pull request? #43435 and #43760 are fixing a correctness issue which will be triggered when AQE applied on cached query plan, specifically, when AQE coalescing the final result stage of the cached plan. The current semantic of `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning` ([source code](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L403-L411)): when true, we enable AQE, but disable coalescing final stage (default) when false, we disable AQE But let’s revisit the semantic of this config: actually for caller the only thing that matters is whether we change the output partitioning of the cached plan. And we should only try to apply AQE if possible. Thus we want to modify the semantic of spark.sql.optimizer.canChangeCachedPlanOutputPartitioning when true, we enable AQE and allow coalescing final: this might lead to perf regression, because it introduce extra shuffle when false, we enable AQE, but disable coalescing final stage. (this is actually the `true` semantic of old behavior) Also, to keep the default behavior unchanged, we might want to flip the default value of spark.sql.optimizer.canChangeCachedPlanOutputPartitioning to `false` ### Why are the changes needed? To allow AQE coalesce final stage in SQL cached plan. Also make the semantic of `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning` more reasonable. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Updated UTs. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45054 from liuzqt/SPARK-46995. Authored-by: Ziqi Liu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…MemoryTableScanExec ### What changes were proposed in this pull request? This backports apache#43435 SPARK-45592 to the 3.4 branch. This is because it was already reported there as SPARK-45282 but it required enabling some extra configuration to hit the bug. ### Why are the changes needed? Fix correctness issue. ### Does this PR introduce _any_ user-facing change? Yes, fixing correctness issue. ### How was this patch tested? New tests based on the reproduction example in SPARK-45282 ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43729 from eejbyfeldt/SPARK-45282. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Fixes correctness issue in 3.5.0. The problem seems to be that when AQEShuffleRead does a coalesced read it can return a HashPartitioning with the coalesced number of partitions. This causes a correctness bug as the partitioning is not compatible for joins with other HashPartitioning even though the number of partitions matches. This is resolved in this patch by introducing CoalescedHashPartitioning and making AQEShuffleRead return that instead. The fix was suggested by cloud-fan > AQEShuffleRead should probably return a different partitioning, e.g. CoalescedHashPartitioning. It still satisfies ClusterDistribution, so Aggregate is fine and there will be no shuffle. For joins, two CoalescedHashPartitionings are compatible if they have the same original partition number and coalesce boundaries, and CoalescedHashPartitioning is not compatible with HashPartitioning. Correctness bug. Yes, fixed correctness issue. New and existing unit test. No Closes apache#43435 from eejbyfeldt/SPARK-45592. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 2be03d8) Signed-off-by: Wenchen Fan <[email protected]>

eejbyfeldt changed the base branch from branch-3.5 to master October 18, 2023 13:58

github-actions bot added the SQL label Oct 18, 2023

ulysses-you reviewed Oct 23, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala Outdated Show resolved Hide resolved

ulysses-you reviewed Oct 23, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala Outdated Show resolved Hide resolved

ulysses-you reviewed Oct 23, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala Outdated Show resolved Hide resolved

ulysses-you reviewed Oct 23, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala Outdated Show resolved Hide resolved

eejbyfeldt force-pushed the SPARK-45592 branch from 82456fb to dbef196 Compare October 23, 2023 07:07

ulysses-you reviewed Oct 23, 2023

View reviewed changes

eejbyfeldt force-pushed the SPARK-45592 branch from 60e64ae to b676e75 Compare October 23, 2023 11:09

eejbyfeldt changed the title ~~[SPARK-45592][SQL][WIP] Correctness issue in AQE with InMemoryTableScanExec~~ [SPARK-45592][SQL] Correctness issue in AQE with InMemoryTableScanExec Oct 23, 2023

eejbyfeldt requested a review from ulysses-you October 30, 2023 08:40

cloud-fan reviewed Oct 30, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 30, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 30, 2023

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/DistributionSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 30, 2023

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ShuffleSpecSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 30, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEShuffleReadExec.scala Outdated Show resolved Hide resolved

Emil Ejbyfeldt added 6 commits October 30, 2023 10:42

SPARK-45592: AQE correctness issue

17a3758

Revert sloppy fix

f05366d

Implement CoalescedHashPartitioning

cebba4c

Clean up CoalescedHPSpec should not create partitioning

3ea57d8

Fix WriteDistributionAndOrderingSuit

408cb92

More PR comments

4f6bd1d

eejbyfeldt force-pushed the SPARK-45592 branch from 947234a to 4f6bd1d Compare October 30, 2023 09:47

More review comments

7402821

cloud-fan reviewed Oct 30, 2023

View reviewed changes

cloud-fan approved these changes Oct 30, 2023

View reviewed changes

cloud-fan closed this in 2be03d8 Oct 31, 2023

eejbyfeldt mentioned this pull request Nov 9, 2023

[SPARK-45592][SPARK-45282][SQL][3.4] Correctness issue in AQE with InMemoryTableScanExec #43729

Closed

liuzqt mentioned this pull request Feb 7, 2024

[SPARK-46995][SQL] Allow AQE coalesce final stage in SQL cached plan #45054

Closed

Tom-Newton mentioned this pull request Nov 15, 2024

[SPARK-45592][SPARK-45282][SQL] Correctness issue in AQE with InMemoryTableScanExec #43760

Closed

[SPARK-45592][SQL] Correctness issue in AQE with InMemoryTableScanExec #43435

[SPARK-45592][SQL] Correctness issue in AQE with InMemoryTableScanExec #43435

Uh oh!

Conversation

eejbyfeldt commented Oct 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

eejbyfeldt commented Oct 18, 2023

Uh oh!

ulysses-you commented Oct 19, 2023

Uh oh!

cloud-fan commented Oct 19, 2023

Uh oh!

eejbyfeldt commented Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Oct 20, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ulysses-you left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eejbyfeldt commented Oct 30, 2023

Uh oh!

cloud-fan commented Oct 30, 2023

Uh oh!

cloud-fan Oct 30, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

maryannxue commented Oct 30, 2023

Uh oh!

maryannxue commented Oct 31, 2023

Uh oh!

cloud-fan commented Oct 31, 2023

Uh oh!

dongjoon-hyun commented Oct 31, 2023

Uh oh!

JJACOB0806 commented Jan 12, 2024

Uh oh!

deepakcv commented Jan 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

eejbyfeldt commented Oct 18, 2023 •

edited

Loading

eejbyfeldt commented Oct 20, 2023 •

edited

Loading

deepakcv commented Jan 12, 2024 •

edited

Loading