[SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys #29389

leanken-zz · 2020-08-08T09:46:24Z

What changes were proposed in this pull request?

In SPARK-32290, we introduced several new types of HashedRelation.

EmptyHashedRelation
EmptyHashedRelationWithAllNullKeys

They were all limited to used only in NAAJ scenario. These new HashedRelation could be applied to other scenario for performance improvements.

EmptyHashedRelation could also be used in Normal AntiJoin for fast stop
While AQE is on and buildSide is EmptyHashedRelationWithAllNullKeys, can convert NAAJ to a Empty LocalRelation to skip meaningless data iteration since in Single-Key NAAJ, if null key exists in BuildSide, will drop all records in streamedSide.

This Patch including two changes.

using EmptyHashedRelation to do fast stop for common anti join as well
In AQE, eliminate BroadcastHashJoin(NAAJ) if buildSide is a EmptyHashedRelationWithAllNullKeys

Why are the changes needed?

LeftAntiJoin could apply fast stop when BuildSide is EmptyHashedRelation, While within AQE with EmptyHashedRelationWithAllNullKeys, we can eliminate the NAAJ. This should be a performance improvement in AntiJoin.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

added case in AdaptiveQueryExecSuite.
added case in HashedRelationSuite.
Make sure SubquerySuite JoinSuite SQLQueryTestSuite passed.

In [SPARK-32290](https://issues.apache.org/jira/browse/SPARK-32290), we introduced several new types of HashedRelation. * EmptyHashedRelation * EmptyHashedRelationWithAllNullKeys They were all limited to used only in NAAJ scenario. But as for a improvement, EmptyHashedRelation could also be used in Normal AntiJoin for fast stop, and as for in AQE, we can even eliminate anti join when we knew that buildSide is empty. This Patch including two changes. * In Non-AQE, using EmptyHashedRelation to do fast stop for common anti join as well * In AQE, eliminate anti join if buildSide is a EmptyHashedRelation of ShuffleWriteRecord is 0 LeftAntiJoin could apply `fast stop` when BuildSide is Empty, While within AQE, we can even eliminate the anti join. This should be a performance improvement in AntiJoin. No. * added case in AdaptiveQueryExecSuite. * added case in HashedRelationSuite. * Make sure SubquerySuite JoinSuite SQLQueryTestSuite passed. Change-Id: I718227725e319a54b99c91e6a0d1b9f022343b16

Change-Id: I47ebe0cc2e751b018a6ee02fc3018601805c4060

leanken-zz · 2020-08-08T09:46:46Z

@cloud-fan could you please have a look at this followup?

maropu · 2020-08-09T01:13:04Z

ok to test

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateAntiJoin.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

Change-Id: Icfe57ed02de8dd1b414f4d0c81a51353d49a6ead

leanken-zz · 2020-08-09T02:20:46Z

@maropu updated. thanks for doing this in weekend. ^_^

maropu · 2020-08-09T02:38:12Z

You, too, @leanken ;) thanks for the work, anyway.

SparkQA · 2020-08-09T05:33:47Z

Test build #127231 has finished for PR 29389 at commit 25dd444.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-09T06:53:12Z

Test build #127233 has finished for PR 29389 at commit 2063288.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateAntiJoin.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

Change-Id: If5d776916fc12674068c486d605ff9a1fc28526f

leanken-zz · 2020-08-10T01:30:56Z

@dongjoon-hyun updated, thanks

cloud-fan · 2020-08-10T04:27:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala

    "buildTime" -> SQLMetrics.createTimingMetric(sparkContext, "time to build"),
    "broadcastTime" -> SQLMetrics.createTimingMetric(sparkContext, "time to broadcast"))

+  private var knownRowCount: Option[BigInt] = None


can we be consistent and use metrics to hold rowCount? You can follow how we track dataSize.

Sure will do the update.

cloud-fan · 2020-08-10T04:29:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala


            val dataSize = relation match {
+              case EmptyHashedRelation =>
+                knownRowCount = Some(0L)


See L118. We already know the row count and we even use it to check numRows >= MAX_BROADCAST_TABLE_ROWS

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

cloud-fan · 2020-08-10T04:32:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala


 /**
 * A special HashedRelation indicates it built from a empty input:Iterator[InternalRow].
+ * get & getValue will return null just like


BTW where do we call these methods for EmptyHashedRelation?

yes. It will be called. EmptyHashedRelation now is also applied at all joinType. I confirmed it while running UT.

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala

cloud-fan · 2020-08-10T05:42:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateAntiJoin.scala

+  def apply(plan: LogicalPlan): LogicalPlan = plan.transformDown {
+    // If the right side is empty, LeftAntiJoin simply returns the left side.
+    // Eliminate Join with left LogicalPlan instead.
+    case Join(left, right, LeftAnti, _, _) if canEliminate(right) =>


This is not very useful, as the anti join is already very fast with the empty right side.

https://github.com/apache/spark/pull/29389/files#r467683621 is more useful, as for now we still need to consume all the left side data and return nothing. It will be a big improvement if we can directly convert the join node to an empty relation with AQE.

This is not very useful, as the anti join is already very fast with the empty right side.

https://github.com/apache/spark/pull/29389/files#r467683621 is more useful, as for now we still need to consume all the left side data and return nothing. It will be a big improvement if we can directly convert the join node to an empty relation with AQE.

In fact, we need to consume all the left side data and return all the rows when it's an EmptyHashedRelation.

In fact, we need to consume all the left side data and return all the rows when it's an EmptyHashedRelation.

Yes, but this rule doesn't change it. By returning the left side plan, we still need to consume all the data of the left side.

eliminate naaj which buildSide is EmptyHashedRelationWithAllNullKeys while AQE is on, should gain more Performance improvement. Change-Id: I13c667628220da923fc59718281642460c80f71d

Change-Id: I0d0a96da107145b9fc681e15f2533abd165f344a

Change-Id: I3039efd9e472431131b52bec0d0619b84e2c6ee5

SparkQA · 2020-08-10T07:05:02Z

Test build #127249 has finished for PR 29389 at commit 257745a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

leanken-zz · 2020-08-10T07:06:34Z

@cloud-fan updated.

leanken-zz · 2020-08-10T07:08:07Z

keep EmptyHashedRelation for common anti join fast stop both for NON-AQE and AQE
rewrite NAAJ to Empty LocalRelation when BuildSide is EmptyHashedRelationWithAllNullKeys with AQE on.

cloud-fan · 2020-08-10T07:53:27Z

@leanken please also update the PR description.

leanken-zz · 2020-08-10T08:12:04Z

Update PR title => Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys

leanken-zz · 2020-08-10T08:12:43Z

Update RP desc and Jira desc accordingly, @cloud-fan

SparkQA · 2020-08-10T12:12:44Z

Test build #127270 has finished for PR 29389 at commit 381db11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-10T14:03:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala

  protected override def codegenAnti(ctx: CodegenContext, input: Seq[ExprCode]): String = {
-    if (isNullAwareAntiJoin) {
-      val (broadcastRelation, relationTerm) = prepareBroadcast(ctx)
+    val (broadcastRelation, relationTerm) = prepareBroadcast(ctx)


After a second look, I think this is incorrect. We may go to the else branch, which calls super.codegenAnti and call prepareBroadcast again.

I think we should keep this method untouched, and update HashJoin.codegenAnti to fast stop with EmptyHashedRelation.

I thought about this problem too.
HashJoin.scala does not have the API prepareBroadcast, only the following prepareRelation available, which will not return HashedRelation.

protected def prepareRelation(ctx: CodegenContext): (String, Boolean)

I was thinking that call prepareBroadcast twice is OK since broadcast.Broadcast[HashedRelation] is a future call, and the collect and build action will only happen once, but the relationTerm will be generated twice

is it ok to just change prepareRelation to return a (String, Boolean, HashedRelation) ??
tell me when you decided which way to go. @cloud-fan

I was also consider doing it inside BHJ because in ShuffledHashJoinExec
the hashedRelation is built among generated-code, it might be hard to just return hashedRelation in prepareRelation call.

protected override def prepareRelation(ctx: CodegenContext): (String, Boolean) = { val thisPlan = ctx.addReferenceObj("plan", this) val clsName = classOf[HashedRelation].getName // Inline mutable state since not many join operations in a task val relationTerm = ctx.addMutableState(clsName, "relation", v => s"$v = $thisPlan.buildHashedRelation(inputs[1]);", forceInline = true) (relationTerm, false) }

Change-Id: Ieb631235d4bc8f1446387eb22f0c1a1cc1d3ed27

leanken-zz · 2020-08-10T15:43:44Z

@cloud-fan updated. refine prepareRelation to return

private[joins] case class HashedRelationInfo(
    relationTerm: String,
    keyIsUnique: Boolean,
    isEmptyHashedRelation: Boolean)

Code change int BHJ reverted, and do fast stop in HashJoin.scala codegenAnti using isEmptyHashedRelation

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala

cloud-fan

LGTM except a few nits.

SparkQA · 2020-08-10T21:00:06Z

Test build #127286 has finished for PR 29389 at commit c8eae30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Change-Id: I726afddeea0c49d5299b7dc9e8e27f0021411418

SparkQA · 2020-08-11T04:28:33Z

Test build #127296 has finished for PR 29389 at commit 5309b66.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

leanken-zz · 2020-08-11T04:30:38Z

@cloud-fan Test passed, ready to merge.

cloud-fan · 2020-08-11T06:23:42Z

thanks, merging to master!

leanken-zz added 2 commits August 8, 2020 17:41

remove useless import.

25dd444

Change-Id: I47ebe0cc2e751b018a6ee02fc3018601805c4060

probot-autolabeler bot added the SQL label Aug 8, 2020

maropu reviewed Aug 9, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateAntiJoin.scala Outdated Show resolved Hide resolved

maropu reviewed Aug 9, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala Outdated Show resolved Hide resolved

maropu reviewed Aug 9, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala Outdated Show resolved Hide resolved

Nit & Code Style.

2063288

Change-Id: Icfe57ed02de8dd1b414f4d0c81a51353d49a6ead

dongjoon-hyun reviewed Aug 10, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateAntiJoin.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Aug 10, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Aug 10, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala Outdated Show resolved Hide resolved

nit & indentation.

257745a

Change-Id: If5d776916fc12674068c486d605ff9a1fc28526f

cloud-fan reviewed Aug 10, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala Show resolved Hide resolved

cloud-fan reviewed Aug 10, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 10, 2020

View reviewed changes

leanken-zz added 3 commits August 10, 2020 14:30

eliminate normal anti join when EmptyHashedRelation is not neccesary.

817de1e

eliminate naaj which buildSide is EmptyHashedRelationWithAllNullKeys while AQE is on, should gain more Performance improvement. Change-Id: I13c667628220da923fc59718281642460c80f71d

file renamed to EliminateNullAwareAntiJoin since it's for NAAJ.

caec89e

Change-Id: I0d0a96da107145b9fc681e15f2533abd165f344a

remove SQLConf in EliminateNullAwareAntiJoin.

381db11

Change-Id: I3039efd9e472431131b52bec0d0619b84e2c6ee5

cloud-fan approved these changes Aug 10, 2020

View reviewed changes

leanken-zz changed the title ~~[SPARK-32573][SQL] Eliminate Anti Join when BuildSide is Empty~~ [SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys Aug 10, 2020

cloud-fan reviewed Aug 10, 2020

View reviewed changes

refine prepareRelation to avoid call prepareBroadcast twice.

c8eae30

Change-Id: Ieb631235d4bc8f1446387eb22f0c1a1cc1d3ed27

cloud-fan reviewed Aug 10, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 10, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Aug 10, 2020

View reviewed changes

nit.

5309b66

Change-Id: I726afddeea0c49d5299b7dc9e8e27f0021411418

cloud-fan closed this in c37357a Aug 11, 2020

cloud-fan mentioned this pull request Aug 20, 2020

[SPARK-32649][SQL] Optimize BHJ/SHJ inner/semi join with empty hashed relation #29484

Closed

[SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys #29389

[SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys #29389

Uh oh!

Conversation

leanken-zz commented Aug 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

leanken-zz commented Aug 8, 2020

Uh oh!

maropu commented Aug 9, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leanken-zz commented Aug 9, 2020

Uh oh!

maropu commented Aug 9, 2020

Uh oh!

SparkQA commented Aug 9, 2020

Uh oh!

SparkQA commented Aug 9, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leanken-zz commented Aug 10, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 10, 2020

Uh oh!

leanken-zz commented Aug 10, 2020

Uh oh!

leanken-zz commented Aug 10, 2020

Uh oh!

cloud-fan commented Aug 10, 2020

Uh oh!

leanken-zz commented Aug 10, 2020

Uh oh!

leanken-zz commented Aug 10, 2020

Uh oh!

SparkQA commented Aug 10, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leanken-zz commented Aug 10, 2020

Uh oh!

Uh oh!

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 10, 2020

Uh oh!

SparkQA commented Aug 11, 2020

Uh oh!

leanken-zz commented Aug 8, 2020 •

edited

Loading