[SPARK-34840][SHUFFLE] Fixes cases of corruption in merged shuffle … #31934

otterc · 2021-03-22T21:37:23Z

What changes were proposed in this pull request?

This PR fixes bugs that causes corruption of push-merged blocks when a client terminates while pushing block. RemoteBlockPushResolver was introduced in #30062 (SPARK-32916).

There are 2 scenarios where the merged blocks get corrupted:

StreamCallback.onFailure() is called more than once. Initially we assumed that the onFailure callback will be called just once per stream. However, we observed that this is called twice when a client connection is reset. When the client connection is reset then there are 2 events that get triggered in this order.

exceptionCaught. This event is propagated to StreamInterceptor. StreamInterceptor.exceptionCaught() invokes callback.onFailure(streamId, cause). This is the first time StreamCallback.onFailure() will be invoked.
channelInactive. Since the channel closes, the channelInactive event gets triggered which again is propagated to StreamInterceptor. StreamInterceptor.channelInactive() invokes callback.onFailure(streamId, new ClosedChannelException()). This is the second time StreamCallback.onFailure() will be invoked.

The flag isWriting is set prematurely to true. This introduces an edge case where a stream that is trying to merge a duplicate block (created because of a speculative task) may interfere with an active stream if the duplicate stream fails.

Also adding additional changes that improve the code.

Using positional writes all the time because this simplifies the code and with microbenchmarking haven't seen any performance impact.
Additional minor changes suggested by @mridulm during an internal review.

Why are the changes needed?

These are bug fixes and simplify the code.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests. I have also tested these changes in Linkedin's internal fork on a cluster.

Co-authored-by: Chandni Singh [email protected]
Co-authored-by: Min Shen [email protected]

otterc · 2021-03-22T21:43:22Z

@tgravescs @Ngone51 @attilapiros @mridulm @Victsm
Please help review these bug fixes

tgravescs · 2021-03-22T21:48:43Z

does this cause data corruption if people use it with spark 3.1.1 release? Or are these block somehow caught and shuffle ends up failing?

otterc · 2021-03-22T21:53:33Z

does this cause data corruption if people use it with spark 3.1.1 release? Or are these block somehow caught and shuffle ends up failing?

In 3.1.1 push-based shuffle is not complete. As in, the changes needed to fetch merged shuffle blocks are not there. So, users in 3.1.1 can't use push-based shuffle.
In our implementation of fetch merged shuffle blocks (which is not in 3.1.1), if the client encounters any issues with the merged blocks then it falls-back to fetching original unmerged blocks that made that merged block.

AmplabJenkins · 2021-03-22T22:08:21Z

Can one of the admins verify this patch?

otterc · 2021-03-23T00:02:13Z

This test failure is unrelated:

[info] *** 1 TEST FAILED ***
[error] Failed: Total 2998, Failed 1, Errors 0, Passed 2997, Ignored 7, Canceled 1
[error] Failed tests:
[error] 	org.apache.spark.storage.FallbackStorageSuite

dongjoon-hyun

Hi, @otterc .
SPARK-32916 is already released with Fix Version: 3.1.0. To have an independent Fix Version, this should have a new JIRA issue.

Ngone51

hmm..I try to understand how those 2 scenarios cause the merged block corrupted.

Do you mean called StreamCallback.onFailure() for 2 times cause the block corrupted?Seems like the thing onFailure does is only to setCurrentMapIndex(-1) and setEncounteredFailure(true). And they don't touch files, e.g., reset position or truncate.
I can see how the duplicate stream may interfere with an active stream. e.g., the active stream may see getCurrentMapIndex < 0 and isEncounteredFailure=true while writing normally itself. But it seems like the active stream is able to heal itself with the current framework.

I properly missed some details. Could you elaborate more about how corruption happens? Thanks.

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java

otterc · 2021-03-23T16:12:10Z

Hi, @otterc .
SPARK-32916 is already released with Fix Version: 3.1.0. To have an independent Fix Version, this should have a new JIRA issue.

I created SPARK-34840 to address this. cc. @dongjoon-hyun

dongjoon-hyun · 2021-03-23T16:28:18Z

Thank you, @otterc !

New JIRA is created.

otterc · 2021-03-23T16:49:53Z

hmm..I try to understand how those 2 scenarios cause the merged block corrupted.

Do you mean called StreamCallback.onFailure() for 2 times cause the block corrupted?Seems like the thing onFailure does is only to setCurrentMapIndex(-1) and setEncounteredFailure(true). And they don't touch files, e.g., reset position or truncate.

I can see how the duplicate stream may interfere with an active stream. e.g., the active stream may see getCurrentMapIndex < 0 and isEncounteredFailure=true while writing normally itself. But it seems like the active stream is able to heal itself with the current framework.

I properly missed some details. Could you elaborate more about how corruption happens? Thanks.

In both the scenarios, the currentMapId of the shuffle partition is modified to -1 which can interfere with an active stream (stream that is writing). By interfering, I mean it gives a chance to another stream which is waiting to merge to same shuffle partition to start writing without the active stream completing successfully or with failure.

Providing examples for both of these:

When on onFailure is called twice

Say stream1 merging shufflePush_0_1_2 wrote some data and has isWriting=true. Now it failed, so it sets currentMapId of partition_0_2 to -1.
Another stream2 which wants to merge shufflePush_0_2_2 can now start merging its bufs to partition_0_2 and it sets currentMapId of partition_0_2 to 2.
Another stream3 which wants to merge shufflePush_0_3_2 will defer its buffers because stream2 is the active one right now (currentMapId is 2).
stream2 has only merged few bufs, but then stream1.onFailure() is invoked again and that will change the currentMapId of partiton_0_2 to -1. This becomes a problem because stream2 hasn't completed successfully (or with failure) and now stream3 is allowedToWrite. If stream3 starts writing buffers when stream2 has not appended all its buffers, then the data of shufflePush_0_2_2 will be corrupted.

Duplicate stream.

Say stream1 merging shufflePush_0_1_2 wrote some data and has isWriting=true. It completed successfully and then sets currentMapId of partition_0_2 to -1.
Now stream1duplicate which is also trying to merge shufflePush_0_1_2 will be allowedToWrite because the currentMapId of partition_0_2 is -1 and it sets isWriting=true. However, we identify that it is a duplication stream and just return without modifying currentMapId.
stream2 which tries to merge shufflePush_0_2_2 will be allowedToWrite because currentMapId=-1. It sets currentMapId=2 and start writing.
If stream1Duplicate encounters a failure now, it has isWriting on and so can reset currentMapId of partition_0_2. This again gives a chance to another stream say stream3 to allowedToWrite without stream2 to complete.

I have added UTs for both these cases as well with similar examples.
@Ngone51

Ngone51 · 2021-03-24T01:59:45Z

@otterc Thanks for the explanation. Now I understand the cause.

To confirm, for the example 2, I think the first 2 steps are not necessary, right?

otterc · 2021-03-24T04:44:16Z

To confirm, for the example 2, I think the first 2 steps are not necessary, right?

@Ngone51 I think the first 2 steps are necessary because in this edge case this can only happen when a stream is trying to merge a duplicate block, which was stream1duplicate in my example, and fails. The problem with that is that we were setting isWriting=true early in such cases. So when it fails then it can unset currentMapId.

Let me know if I am missing some other cases. I can add UTs for them as well.

Ngone51 · 2021-03-24T04:49:28Z

If stream2Duplicate encounters a failure now, it has isWriting on and so can reset currentMapId of partition_0_2. This again gives a chance to another stream say stream3 to allowedToWrite without stream2 to complete.

So, it should be stream1Duplicate instead of stream2Duplicate here?

otterc · 2021-03-24T04:51:47Z

If stream2Duplicate encounters a failure now, it has isWriting on and so can reset currentMapId of partition_0_2. This again gives a chance to another stream say stream3 to allowedToWrite without stream2 to complete.

So, it should be stream1Duplicate instead of stream2Duplicate here?

Right, that was a typo. Yes, it should be stream1Duplicate. Thanks for pointing it out.
I edited the example as well for others if they go through it.

Ngone51 · 2021-03-24T04:57:03Z

Ok, I get it now.

Ngone51

LGTM, except one minor comment.

Ngone51 · 2021-03-24T06:49:05Z

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java

          }
        }
      }
+      isWriting = false;


Move this into the if condition scope?

I can move this to if scope and that would not change the behavior or cause any issues. The only reason I had it outside because it was consistent with where this flag is unset in onComplete. I understand that is a very trivial cosmetic reason so can move this.

Ok, keeping it consistent sounds fine. we can leave it as it is since it's trivial.

mridulm · 2021-03-25T17:46:30Z

LGTM, thanks @otterc.
Merging to master and 3.1

### What changes were proposed in this pull request? This PR fixes bugs that causes corruption of push-merged blocks when a client terminates while pushing block. `RemoteBlockPushResolver` was introduced in #30062 (SPARK-32916). There are 2 scenarios where the merged blocks get corrupted: 1. `StreamCallback.onFailure()` is called more than once. Initially we assumed that the onFailure callback will be called just once per stream. However, we observed that this is called twice when a client connection is reset. When the client connection is reset then there are 2 events that get triggered in this order. - `exceptionCaught`. This event is propagated to `StreamInterceptor`. `StreamInterceptor.exceptionCaught()` invokes `callback.onFailure(streamId, cause)`. This is the first time StreamCallback.onFailure() will be invoked. - `channelInactive`. Since the channel closes, the `channelInactive` event gets triggered which again is propagated to `StreamInterceptor`. `StreamInterceptor.channelInactive()` invokes `callback.onFailure(streamId, new ClosedChannelException())`. This is the second time StreamCallback.onFailure() will be invoked. 2. The flag `isWriting` is set prematurely to true. This introduces an edge case where a stream that is trying to merge a duplicate block (created because of a speculative task) may interfere with an active stream if the duplicate stream fails. Also adding additional changes that improve the code. 1. Using positional writes all the time because this simplifies the code and with microbenchmarking haven't seen any performance impact. 2. Additional minor changes suggested by mridulm during an internal review. ### Why are the changes needed? These are bug fixes and simplify the code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests. I have also tested these changes in Linkedin's internal fork on a cluster. Co-authored-by: Chandni Singh chsinghlinkedin.com Co-authored-by: Min Shen mshenlinkedin.com Closes #31934 from otterc/SPARK-32916-followup. Lead-authored-by: Chandni Singh <[email protected]> Co-authored-by: Min Shen <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 6d88212) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

mridulm · 2021-03-25T17:51:59Z

Thanks for the reviews @Ngone51, @dongjoon-hyun !

### What changes were proposed in this pull request? This PR fixes bugs that causes corruption of push-merged blocks when a client terminates while pushing block. `RemoteBlockPushResolver` was introduced in apache#30062 (SPARK-32916). There are 2 scenarios where the merged blocks get corrupted: 1. `StreamCallback.onFailure()` is called more than once. Initially we assumed that the onFailure callback will be called just once per stream. However, we observed that this is called twice when a client connection is reset. When the client connection is reset then there are 2 events that get triggered in this order. - `exceptionCaught`. This event is propagated to `StreamInterceptor`. `StreamInterceptor.exceptionCaught()` invokes `callback.onFailure(streamId, cause)`. This is the first time StreamCallback.onFailure() will be invoked. - `channelInactive`. Since the channel closes, the `channelInactive` event gets triggered which again is propagated to `StreamInterceptor`. `StreamInterceptor.channelInactive()` invokes `callback.onFailure(streamId, new ClosedChannelException())`. This is the second time StreamCallback.onFailure() will be invoked. 2. The flag `isWriting` is set prematurely to true. This introduces an edge case where a stream that is trying to merge a duplicate block (created because of a speculative task) may interfere with an active stream if the duplicate stream fails. Also adding additional changes that improve the code. 1. Using positional writes all the time because this simplifies the code and with microbenchmarking haven't seen any performance impact. 2. Additional minor changes suggested by mridulm during an internal review. ### Why are the changes needed? These are bug fixes and simplify the code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests. I have also tested these changes in Linkedin's internal fork on a cluster. Co-authored-by: Chandni Singh chsinghlinkedin.com Co-authored-by: Min Shen mshenlinkedin.com Closes apache#31934 from otterc/SPARK-32916-followup. Lead-authored-by: Chandni Singh <[email protected]> Co-authored-by: Min Shen <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 6d88212) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

### What changes were proposed in this pull request? This PR fixes bugs that causes corruption of push-merged blocks when a client terminates while pushing block. `RemoteBlockPushResolver` was introduced in apache#30062 (SPARK-32916). There are 2 scenarios where the merged blocks get corrupted: 1. `StreamCallback.onFailure()` is called more than once. Initially we assumed that the onFailure callback will be called just once per stream. However, we observed that this is called twice when a client connection is reset. When the client connection is reset then there are 2 events that get triggered in this order. - `exceptionCaught`. This event is propagated to `StreamInterceptor`. `StreamInterceptor.exceptionCaught()` invokes `callback.onFailure(streamId, cause)`. This is the first time StreamCallback.onFailure() will be invoked. - `channelInactive`. Since the channel closes, the `channelInactive` event gets triggered which again is propagated to `StreamInterceptor`. `StreamInterceptor.channelInactive()` invokes `callback.onFailure(streamId, new ClosedChannelException())`. This is the second time StreamCallback.onFailure() will be invoked. 2. The flag `isWriting` is set prematurely to true. This introduces an edge case where a stream that is trying to merge a duplicate block (created because of a speculative task) may interfere with an active stream if the duplicate stream fails. Also adding additional changes that improve the code. 1. Using positional writes all the time because this simplifies the code and with microbenchmarking haven't seen any performance impact. 2. Additional minor changes suggested by mridulm during an internal review. ### Why are the changes needed? These are bug fixes and simplify the code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests. I have also tested these changes in Linkedin's internal fork on a cluster. Co-authored-by: Chandni Singh chsinghlinkedin.com Co-authored-by: Min Shen mshenlinkedin.com Closes apache#31934 from otterc/SPARK-32916-followup. Lead-authored-by: Chandni Singh <[email protected]> Co-authored-by: Min Shen <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

### What changes were proposed in this pull request? This PR fixes bugs that causes corruption of push-merged blocks when a client terminates while pushing block. `RemoteBlockPushResolver` was introduced in #30062 (SPARK-32916). There are 2 scenarios where the merged blocks get corrupted: 1. `StreamCallback.onFailure()` is called more than once. Initially we assumed that the onFailure callback will be called just once per stream. However, we observed that this is called twice when a client connection is reset. When the client connection is reset then there are 2 events that get triggered in this order. - `exceptionCaught`. This event is propagated to `StreamInterceptor`. `StreamInterceptor.exceptionCaught()` invokes `callback.onFailure(streamId, cause)`. This is the first time StreamCallback.onFailure() will be invoked. - `channelInactive`. Since the channel closes, the `channelInactive` event gets triggered which again is propagated to `StreamInterceptor`. `StreamInterceptor.channelInactive()` invokes `callback.onFailure(streamId, new ClosedChannelException())`. This is the second time StreamCallback.onFailure() will be invoked. 2. The flag `isWriting` is set prematurely to true. This introduces an edge case where a stream that is trying to merge a duplicate block (created because of a speculative task) may interfere with an active stream if the duplicate stream fails. Also adding additional changes that improve the code. 1. Using positional writes all the time because this simplifies the code and with microbenchmarking haven't seen any performance impact. 2. Additional minor changes suggested by mridulm during an internal review. ### Why are the changes needed? These are bug fixes and simplify the code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests. I have also tested these changes in Linkedin's internal fork on a cluster. Co-authored-by: Chandni Singh chsinghlinkedin.com Co-authored-by: Min Shen mshenlinkedin.com Closes #31934 from otterc/SPARK-32916-followup. Lead-authored-by: Chandni Singh <[email protected]> Co-authored-by: Min Shen <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

github-actions bot added the CORE label Mar 22, 2021

dongjoon-hyun previously requested changes Mar 23, 2021

View reviewed changes

Ngone51 reviewed Mar 23, 2021

View reviewed changes

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java Outdated Show resolved Hide resolved

otterc and others added 2 commits March 23, 2021 09:08

[SPARK-34840] Fixes cases of corruption in merged shuffle blocks

9382fdc

empty commit by Min Shen

c33f961

otterc force-pushed the SPARK-32916-followup branch from 5aaaa6f to c33f961 Compare March 23, 2021 16:09

otterc changed the title ~~[SPARK-32916][FOLLOW-UP][SHUFFLE] Fixes cases of corruption in merged shuffle …~~ [SPARK-34840][SHUFFLE] Fixes cases of corruption in merged shuffle … Mar 23, 2021

Ngone51 reviewed Mar 24, 2021

View reviewed changes

Removed unused import

7c38487

Ngone51 approved these changes Mar 25, 2021

View reviewed changes

asfgit closed this in 6d88212 Mar 25, 2021

[SPARK-34840][SHUFFLE] Fixes cases of corruption in merged shuffle … #31934

[SPARK-34840][SHUFFLE] Fixes cases of corruption in merged shuffle … #31934

Uh oh!

Conversation

otterc commented Mar 22, 2021 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

otterc commented Mar 22, 2021

Uh oh!

tgravescs commented Mar 22, 2021

Uh oh!

otterc commented Mar 22, 2021

Uh oh!

AmplabJenkins commented Mar 22, 2021

Uh oh!

otterc commented Mar 23, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Ngone51 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

otterc commented Mar 23, 2021

Uh oh!

dongjoon-hyun commented Mar 23, 2021

Uh oh!

otterc commented Mar 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ngone51 commented Mar 24, 2021

Uh oh!

otterc commented Mar 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ngone51 commented Mar 24, 2021

Uh oh!

otterc commented Mar 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ngone51 commented Mar 24, 2021

Uh oh!

Ngone51 left a comment

Choose a reason for hiding this comment

Uh oh!

Ngone51 Mar 24, 2021

Choose a reason for hiding this comment

Uh oh!

otterc Mar 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ngone51 Mar 25, 2021

Choose a reason for hiding this comment

Uh oh!

mridulm commented Mar 25, 2021

Uh oh!

mridulm commented Mar 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

otterc commented Mar 22, 2021 •

edited by dongjoon-hyun

Loading

otterc commented Mar 23, 2021 •

edited

Loading

otterc commented Mar 24, 2021 •

edited

Loading

otterc commented Mar 24, 2021 •

edited

Loading

otterc Mar 24, 2021 •

edited

Loading