[SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file #32401

Ngone51 · 2021-04-29T16:22:28Z

What changes were proposed in this pull request?

This is the initial work of add checksum support of shuffle. This is a piece of #32385. And this PR only adds checksum functionality at the shuffle writer side.

Basically, the idea is to wrap a MutableCheckedOutputStream* upon the FileOutputStream while the shuffle writer generating the shuffle data. But the specific wrapping places are a bit different among the shuffle writers due to their different implementation:

BypassMergeSortShuffleWriter - wrap on each partition file
UnsafeShuffleWriter - wrap on each spill files directly since they doesn't require aggregation, sorting
SortShuffleWriter - wrap on the ShufflePartitionPairsWriter after merged spill files since they might require aggregation, sorting

* MutableCheckedOutputStream is a variant of java.util.zip.CheckedOutputStream which can change the checksum calculator at runtime.

And we use the Adler32, which uses the CRC-32 algorithm but much faster, to calculate the checksum as the same as Broadcast's checksum.

Why are the changes needed?

Does this PR introduce any user-facing change?

Yes, added a new conf: spark.shuffle.checksum.

How was this patch tested?

Added unit tests.

SparkQA · 2021-04-29T16:50:21Z

Test build #138086 has finished for PR 32401 at commit b346876.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class CountingWritableChannel implements WritableByteChannel
case class ShuffleChecksumBlockId(shuffleId: Int, mapId: Long, reduceId: Int) extends BlockId

SparkQA · 2021-04-29T17:26:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42606/

SparkQA · 2021-04-29T17:31:51Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42606/

Ngone51 · 2021-06-08T17:05:23Z

I have resolved the issue (regression, make it a built-in feature) as mentioned by @tgravescs at #32385 (comment). So I think it's ready for review.
cc @tgravescs @mridulm @otterc @cloud-fan

SparkQA · 2021-06-08T18:26:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44024/

SparkQA · 2021-06-08T18:39:04Z

Test build #139498 has finished for PR 32401 at commit 3267fba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MutableCheckedOutputStream(out: OutputStream) extends OutputStream
case class ShuffleChecksumBlockId(shuffleId: Int, mapId: Long, reduceId: Int) extends BlockId

SparkQA · 2021-06-08T19:00:34Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44024/

cloud-fan · 2021-06-09T08:37:09Z

core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java

How do we write the shuffle index file right now? In this method, we also calculate the partition lengths but we don't write the index file immediately.

Oh actually we did, but it's done by ShuffleMapOutputWriter.commitAllPartitions. Does this checksum file work with custom shuffle extensions?

SparkQA · 2021-06-10T15:48:39Z

Test build #139649 has finished for PR 32401 at commit 2ca071f.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-10T16:27:55Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44177/

mridulm · 2021-06-10T18:11:01Z

I will try to complete the review of this today or tomorrow @Ngone51
Sorry for the delay, and thanks for working on this !

cloud-fan · 2021-06-16T04:44:28Z

core/src/main/java/org/apache/spark/shuffle/api/ShuffleMapOutputWriter.java

TBH I don't think the current shuffle API provides enough abstraction to do checksum. I'm OK with this change as the shuffle API is still private, but we should revisit the shuffle API later, so that checksum can be done at the shuffle implementation side.

The current issue I see is, Spark writes local spill files and then asks the shuffle implementation to "transfer" the spill files. Then Spark has to do checksum by itself during spill file writing, to reduce the perf overhead.

We can discuss it later.

mridulm

Some of my comments from the other PR is relevant here as well, added a few more comments on this PR.
Thanks for working on this @Ngone51 !
Btw, do you have any details on what the overhead of introducing this is ? On both checksum generation time, and (in case of failures), validation time ?

mridulm · 2021-06-16T21:32:37Z

core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java

Given the validation happens in shuffle service, we have to allow specifying checksum's to be configurable with future evolution in mind.
Also, this will need to be conveyed to ESS (filename extension ?) and allow to be configurable in spark.

The requirement that the checksum itself is a long and the checksum algo should extend from zip.Checksum sounds fine to me.

Also, this will need to be conveyed to ESS (filename extension ?) and allow to be configurable in spark.

hmm..I'm not sure if get this concern. Could you elaborate more?

The checksum algorithm being used is specified at ESS and spark executor as Adler32 in code currently - from an evolution point of view, if we have to change/support other checksum algo's in future, this will become an incompatible change : which will require ESS and client to be upgraded together to support it.

Thoughts ?

yeah I agree it would be nice to be configurable and recorded. perhaps either extension on checksum file or metadata in the checksum file. I would expect ESS to indicate if its supported - ie perhaps error is unknown or unsupported checksum type when trying to diagnose. I think the other part might be in the push based shuffle. I assume if its merging files it may have to recalculate these so needs to know the algorithm to use?

+CC @otterc for @tgravescs's on push based shuffle.

Yes, for push-based shuffle the server would need to know the which algorithm to use. Just thinking out loud, the server can use this checksum to verify that a block which was pushed to it was corrupt or not by calculating and comparing the checksum.

This verification may be expensive though while the blocks are getting merged. So, maybe we can think of doing it asynchronously.

Also may require adding checksum to block push message which would be backward incompatible so maybe we create a new message for it.

Anyways, so if we do want the algorithm to be configurable, can we leverage the RegisterExecutor message for it?
We can sent the additional information about what algorithm is being used and if the shuffle server doesn't support it, checksum is not calculated. This would be similar to how we were sending the attemptId information to the shuffle server for push-based shuffle.

Just a thought.

Anyways, so if we do want the algorithm to be configurable, can we leverage the RegisterExecutor message for it?

The reason I was initially proposing adding checksum algo to the file name itself as a suffix is to minimize the state required to reason about which algorithm is being used. We wont need to pass it from container to ESS, or persist it across ESS restarts, etc.

Tom's additional suggestion of including it in checksum file itself as metadata also works - given the current index'ing into the file for a given partition (8 * partition_id), metadata at end of file might be more convenient place to record this.

The reason I suggested to leverage the RegistorExecutor message to communicate the checksum algo to ESS is because for push-based shuffle, the method of communicating the checksum algorithm to the ESS via checksum file, which is generated when the shuffle data is generated, will not work. This is if we want to validate that a pushed block is corrupt or not at the remote ESS in future.

However, leveraging this for push-based shuffle is not part of this change but since it was mentioned I thought how push-based shuffle can utilize it.

Excellent point @otterc - this would handle the case of merger not having any local shuffle files (from executors on that node), but only data pushed to it.

core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java

core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java

core/src/main/scala/org/apache/spark/internal/config/package.scala

core/src/main/scala/org/apache/spark/io/MutableCheckedOutputStream.scala

core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala

core/src/main/scala/org/apache/spark/shuffle/ShufflePartitionPairsWriter.scala

core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala

Ngone51 · 2021-06-17T16:17:06Z

Btw, do you have any details on what the overhead of introducing this is ? On both checksum generation time, and (in case of failures), validation time ?

@mridulm I have run the TPCDS benchmark with 3tb datasets internally and there's no regression. I didn't count the accurate generation time, but it's surly trival according to the benchmark results.

For the validation time, I didn't pay attention to it previously since the major issue was checksum calculation at that time. I can take a look later. However, I think the validation is a part of error handling so I think won't count it for performance influence.

SparkQA · 2021-06-17T17:09:27Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44464/

SparkQA · 2021-06-17T18:17:22Z

Test build #139937 has finished for PR 32401 at commit 47983ca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2021-06-18T14:46:14Z

so what did the benchmarking numbers look like? Was there an average hit across or mostly just noise?

core/src/main/java/org/apache/spark/shuffle/api/ShuffleMapOutputWriter.java

core/src/main/scala/org/apache/spark/internal/config/package.scala

core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java

tgravescs · 2021-06-18T15:24:01Z

core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java

yeah I agree it would be nice to be configurable and recorded. perhaps either extension on checksum file or metadata in the checksum file. I would expect ESS to indicate if its supported - ie perhaps error is unknown or unsupported checksum type when trying to diagnose. I think the other part might be in the push based shuffle. I assume if its merging files it may have to recalculate these so needs to know the algorithm to use?

SparkQA · 2021-07-05T09:58:57Z

Test build #140652 has finished for PR 32401 at commit b5a2235.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2021-07-05T10:06:28Z

so what did the benchmarking numbers look like? Was there an average hit across or mostly just noise?

@tgravescs Our internal benchmark runs between the baseline (master) and target (master + checksum changes). It measures the end-to-end execution time of TPC-DS queries. So the numbers are actually the execution time of queries. And from the result, it mostly just noise.

SparkQA · 2021-07-05T10:45:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45162/

SparkQA · 2021-07-05T11:19:10Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45162/

SparkQA · 2021-07-15T04:45:49Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45566/

SparkQA · 2021-07-15T05:37:15Z

Test build #141051 has finished for PR 32401 at commit caaf76d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-15T08:38:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45576/

SparkQA · 2021-07-15T09:10:26Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45576/

SparkQA · 2021-07-15T10:31:16Z

Test build #141061 has finished for PR 32401 at commit 8fc7193.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-16T07:03:10Z

Test build #141141 has finished for PR 32401 at commit 4bdde58.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-16T08:39:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45651/

SparkQA · 2021-07-16T09:13:53Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45651/

mridulm · 2021-07-16T14:37:05Z

I plan to merge this later today. In case there are ongoing reviews, please do comment so that I can hold off.
+CC @HeartSaVioR, @tgravescs, @otterc, @cloud-fan, @Ngone51

…checksum file ### What changes were proposed in this pull request? This is the initial work of add checksum support of shuffle. This is a piece of #32385. And this PR only adds checksum functionality at the shuffle writer side. Basically, the idea is to wrap a `MutableCheckedOutputStream`* upon the `FileOutputStream` while the shuffle writer generating the shuffle data. But the specific wrapping places are a bit different among the shuffle writers due to their different implementation: * `BypassMergeSortShuffleWriter` - wrap on each partition file * `UnsafeShuffleWriter` - wrap on each spill files directly since they doesn't require aggregation, sorting * `SortShuffleWriter` - wrap on the `ShufflePartitionPairsWriter` after merged spill files since they might require aggregation, sorting \* `MutableCheckedOutputStream` is a variant of `java.util.zip.CheckedOutputStream` which can change the checksum calculator at runtime. And we use the `Adler32`, which uses the CRC-32 algorithm but much faster, to calculate the checksum as the same as `Broadcast`'s checksum. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes, added a new conf: `spark.shuffle.checksum`. ### How was this patch tested? Added unit tests. Closes #32401 from Ngone51/add-checksum-files. Authored-by: yi.wu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 4783fb7) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

mridulm · 2021-07-17T05:25:35Z

Merged to master and branch-3.2
Thanks for working on this @Ngone51 !

Thanks for the reviews @cloud-fan, @otterc, @tgravescs, @HeartSaVioR

mridulm · 2021-07-17T05:25:48Z

+CC @gengliangwang

dongjoon-hyun · 2021-07-17T09:51:57Z

Thank you, @Ngone51 , @mridulm and all.

Ngone51 · 2021-07-19T02:43:52Z

@mridulm Thanks for the merge!

I saw you also merged this into 3.2 but you may see that new confs in this PR are versioned as 3.3.0. Because my original thought was only merging this feature to master branch given that more pieces are required for this feature and 3.2.0 is going to release soon. So, I'm afraid a partially completed feature won't help in 3.2.

HeartSaVioR · 2021-07-19T03:12:18Z

I haven't looked into the code deeply (I just helped to fix RAT issue) but IMHO the value of this PR (worth to ship 3.2 or not) depends on whether we "verify" the checksum or not.

If we only write the checksum and not yet verify it, nothing is changed yet in end users' point of view and we should wait for next PR(s) to be completed. If this PR introduces the checksum verification as well (and proper error message), personally this PR itself seems to worth to ship without waiting for other PRs.

Ngone51 · 2021-07-19T03:32:41Z

Thanks @HeartSaVioR That's actually my concern. This PR only writes checksum but without verification. Verification is planned to be implemented in a separate PR and I worry we can't complete it before the 3.2 release.

mridulm · 2021-07-19T14:51:01Z

@Ngone51 I was assuming we will be completing the verification for 3.2, given the earlier WIP pr which had the full impl :-)
But if you are concerned this wont make it by 3.2 release, can you revert it from the branch please ?

mridulm · 2021-07-20T01:49:37Z

@gengliangwang what sort of timeline do we have before we go into RC ?
If the pr is reasonably close to being done @Ngone51, we should be able to focus and prioritize on the reviews for the pr and get it pushed through in less than a week given bulk of the work is done and only validation is pending ? Thoughts ?

cloud-fan · 2021-07-20T02:45:13Z

I think there are around 2 weeks left and it seems promising to merge the verification PR before RC. +1 to have this feature in 3.2 to improve stability.

Ngone51 · 2021-07-20T03:01:47Z

Thanks, @mridulm @cloud-fan I'll try my best to push the validation PR first (I'm working on it right now). We could revert this later if we can't get the validation PR in.

gengliangwang · 2021-07-20T03:49:05Z

@Ngone51 Yes let's see if we can make it before 3.2. Thanks for the work!

mridulm · 2021-07-20T05:07:00Z

Thanks for the clarifications ! This sounds good.

…checksum file This is the initial work of add checksum support of shuffle. This is a piece of apache#32385. And this PR only adds checksum functionality at the shuffle writer side. Basically, the idea is to wrap a `MutableCheckedOutputStream`* upon the `FileOutputStream` while the shuffle writer generating the shuffle data. But the specific wrapping places are a bit different among the shuffle writers due to their different implementation: * `BypassMergeSortShuffleWriter` - wrap on each partition file * `UnsafeShuffleWriter` - wrap on each spill files directly since they doesn't require aggregation, sorting * `SortShuffleWriter` - wrap on the `ShufflePartitionPairsWriter` after merged spill files since they might require aggregation, sorting \* `MutableCheckedOutputStream` is a variant of `java.util.zip.CheckedOutputStream` which can change the checksum calculator at runtime. And we use the `Adler32`, which uses the CRC-32 algorithm but much faster, to calculate the checksum as the same as `Broadcast`'s checksum. Yes, added a new conf: `spark.shuffle.checksum`. Added unit tests. Closes apache#32401 from Ngone51/add-checksum-files. Authored-by: yi.wu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 4783fb7) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

…checksum file ### What changes were proposed in this pull request? This is the initial work of add checksum support of shuffle. This is a piece of apache#32385. And this PR only adds checksum functionality at the shuffle writer side. Basically, the idea is to wrap a `MutableCheckedOutputStream`* upon the `FileOutputStream` while the shuffle writer generating the shuffle data. But the specific wrapping places are a bit different among the shuffle writers due to their different implementation: * `BypassMergeSortShuffleWriter` - wrap on each partition file * `UnsafeShuffleWriter` - wrap on each spill files directly since they doesn't require aggregation, sorting * `SortShuffleWriter` - wrap on the `ShufflePartitionPairsWriter` after merged spill files since they might require aggregation, sorting \* `MutableCheckedOutputStream` is a variant of `java.util.zip.CheckedOutputStream` which can change the checksum calculator at runtime. And we use the `Adler32`, which uses the CRC-32 algorithm but much faster, to calculate the checksum as the same as `Broadcast`'s checksum. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes, added a new conf: `spark.shuffle.checksum`. ### How was this patch tested? Added unit tests. Closes apache#32401 from Ngone51/add-checksum-files. Authored-by: yi.wu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 4783fb7) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>

github-actions bot added the CORE label Apr 29, 2021

Ngone51 mentioned this pull request Apr 29, 2021

[WIP][SPARK-35275][CORE] Add checksum for shuffle blocks and diagnose corruption #32385

Closed

Ngone51 force-pushed the add-checksum-files branch from b346876 to 3267fba Compare June 8, 2021 16:41

Ngone51 changed the title ~~[WIP][SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file~~ [SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file Jun 8, 2021

cloud-fan reviewed Jun 9, 2021

View reviewed changes

Ngone51 force-pushed the add-checksum-files branch from 3267fba to 2ca071f Compare June 10, 2021 15:06

cloud-fan reviewed Jun 16, 2021

View reviewed changes

mridulm reviewed Jun 16, 2021

View reviewed changes

Ngone51 force-pushed the add-checksum-files branch from 2ca071f to 47983ca Compare June 17, 2021 16:10

github-actions bot added the BUILD label Jun 17, 2021

tgravescs reviewed Jun 18, 2021

View reviewed changes

Ngone51 force-pushed the add-checksum-files branch from 47983ca to b5a2235 Compare July 5, 2021 09:52

Ngone51 added 2 commits July 15, 2021 10:21

fix java lint

68af8c4

fix doc

caaf76d

fix test

8fc7193

fix java lint

4bdde58

asfgit closed this in 4783fb7 Jul 17, 2021

mridulm mentioned this pull request Jul 28, 2021

[SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum #33451

Closed

[SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file #32401

[SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file #32401

Uh oh!

Conversation

Ngone51 commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Apr 29, 2021

Uh oh!

SparkQA commented Apr 29, 2021

Uh oh!

SparkQA commented Apr 29, 2021

Uh oh!

Ngone51 commented Jun 8, 2021

Uh oh!

SparkQA commented Jun 8, 2021

Uh oh!

SparkQA commented Jun 8, 2021

Uh oh!

SparkQA commented Jun 8, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 10, 2021

Uh oh!

SparkQA commented Jun 10, 2021

Uh oh!

mridulm commented Jun 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm Jun 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ngone51 commented Jun 17, 2021

Uh oh!

SparkQA commented Jun 17, 2021

Uh oh!

SparkQA commented Jun 17, 2021

Uh oh!

tgravescs commented Jun 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ngone51 commented Apr 29, 2021 •

edited

Loading

mridulm Jun 21, 2021 •

edited

Loading

tgravescs commented Jun 18, 2021 •

edited

Loading

mridulm commented Jul 20, 2021 •

edited

Loading