-
Notifications
You must be signed in to change notification settings - Fork 408
[WIP] shuffle early delete feature for Spark #3569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…able reuse shuffle id when early deletion feature turned on
| PbReportMissingShuffleIdResponse.newBuilder().setSuccess(ret).build() | ||
| context.reply(pbReportMissingShuffleIdResponse) | ||
| /* | ||
| val latestUpstreamShuffleId = shuffleIds.maxBy(_._2._1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original dedup logic may suffer from a race condition described as following
stage A depends on shuffle 1, due to "too early deletion", the missing report is sent and handled for shuffle 1, at this point, a new shuffle id is generated, so latestUpstreamShuffleId._2._1 is no longer UNKNOWN_MISSING_CELEBORN_SHUFFLE_ID... the missing report is handled again... then mess up everything
| // be cleaned up as it is entirely unusuable | ||
| if (determinate && !isBarrierStage && !isCelebornSkewShuffleOrChildShuffle( | ||
| appShuffleId)) { | ||
| appShuffleId) && !conf.clientShuffleEarlyDeletion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cannot reuse the shuffle id when this feature is turned on, think about the following
stage B.0 depends on shuffle 1 which was written by stage A.0
due to "too early deletion", shuffle 1 id is lost, we need to run A.1 , now , shuffle 1 has been deleted from "registered shuffle" , if we reuse 1 as the id and send to tasks of A.1, we will fall into errors like "shuffle not registered"
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR resolve a correctness bug?
Does this PR introduce any user-facing change?
How was this patch tested?