Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Mar 31, 2021

What changes were proposed in this pull request?

This PR proposes to add a workflow that allows developers to run benchmarks and download the results files. After this PR, developers can run benchmarks in GitHub Actions in their fork.

Why are the changes needed?

  1. Very easy to use.
  2. We can use the (almost) same environment to run the benchmarks. Given my few experiments and observation, the CPU, cores, and memory are same.
  3. Does not burden ASF's resource at GitHub Actions.

Does this PR introduce any user-facing change?

No, dev-only.

How was this patch tested?

Manually tested in HyukjinKwon#31.

Entire benchmarks are being run as below:

How do developers use it in their fork?

  1. Go to Actions in your fork, and click "Run benchmarks"

    Screen Shot 2021-03-31 at 10 15 13 PM

  2. Run the benchmarks with JDK 8 or 11 with benchmark classes to run. Glob pattern is supported just like testOnly in SBT

    Screen Shot 2021-04-02 at 8 35 02 PM

  3. After finishing the jobs, the benchmark results are available on the top in the underlying workflow:

    Screen Shot 2021-03-31 at 10 17 21 PM

  4. After downloading it, unzip and untar at Spark git root directory:

    cd .../spark
    mv ~/Downloads/benchmark-results-8.zip .
    unzip benchmark-results-8.zip
    tar -xvf benchmark-results-8.tar
  5. Check the results:

    git status
    ...
        modified:   core/benchmarks/MapStatusesSerDeserBenchmark-results.txt
    

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Mar 31, 2021

Note that I tested subset of benchmarks, verified that it works, and now I am waiting for the final results of running all benchmarks:

@HyukjinKwon HyukjinKwon changed the title [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork [WIP][SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork Mar 31, 2021
@HyukjinKwon
Copy link
Member Author

BTW, I will document this in https://spark.apache.org/developer-tools.html, and add the link into our docs (and probably in GItHub PR template?)

@wangyum
Copy link
Member

wangyum commented Mar 31, 2021

It seems we can't run TPCDSQueryBenchmark. But we can support it in the future.

@HyukjinKwon
Copy link
Member Author

Yeah, good point. I will make it separate for now.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@maropu
Copy link
Member

maropu commented Mar 31, 2021

Cool, it looks useful.

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork Mar 31, 2021
@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Mar 31, 2021

GA is unstable now (https://www.githubstatus.com/). I will retrigger the full benchmarks tomorrow ..

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

num-splits:
description: 'Number of job splits'
required: true
default: '1'
Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add this parameter because GitHub Actions' limits job's timeout as 6 hours (workflow is 72 hours), and sequential running of benchmarks takes up to 50 hours. In this way, it runs the benchmarks in parallel so I think it's okay .. although it might expose too many parameters to control.

For example, I am now running all benchmarks in 20 splits (with JDK 11) at here:

Screen Shot 2021-04-02 at 8 42 31 PM

which results in 20 jobs that runs benchmarks in parallel (hashed by 20)

Screen Shot 2021-04-02 at 8 42 43 PM

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

jsonInDS(50 * 1000 * 1000, numIters)
jsonInFile(50 * 1000 * 1000, numIters)
datetimeBenchmark(rowsNum = 10 * 1000 * 1000, numIters)
schemaInferring(5 * 1000 * 1000, numIters)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk I had to reduce the size here. Otherwise GA job dies with complaining no disk space

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, all benchmarks should pass now .. I will wait for the results before merging it in.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Apr 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41437/

@SparkQA
Copy link

SparkQA commented Apr 2, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41437/

@SparkQA
Copy link

SparkQA commented Apr 2, 2021

Test build #136859 has finished for PR 32015 at commit e6beeb5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon marked this pull request as ready for review April 3, 2021 11:54
@HyukjinKwon
Copy link
Member Author

It's all passed:

I am merging this to master.

Thank you all for reviews and approvals

@HyukjinKwon
Copy link
Member Author

Merged to master.

MaxGekk pushed a commit that referenced this pull request Apr 3, 2021
…GitHub Actions machines

### What changes were proposed in this pull request?

#32015 added a way to run benchmarks much more easily in the same GitHub Actions build. This PR updates the benchmark results by using the way.

**NOTE** that looks like GitHub Actions use four types of CPU given my observations:

- Intel(R) Xeon(R) Platinum 8171M CPU  2.60GHz
- Intel(R) Xeon(R) CPU E5-2673 v4  2.30GHz
- Intel(R) Xeon(R) CPU E5-2673 v3  2.40GHz
- Intel(R) Xeon(R) Platinum 8272CL CPU  2.60GHz

Given my quick research, seems like they perform roughly similarly:

![Screen Shot 2021-04-03 at 9 31 23 PM](https://user-images.githubusercontent.com/6477701/113478478-f4b57b80-94c3-11eb-9047-f81ca8c59672.png)

I couldn't find enough information about Intel(R) Xeon(R) Platinum 8272CL CPU  2.60GHz but the performance seems roughly similar given the numbers.

So shouldn't be a big deal especially given that this way is much easier, encourages contributors to run more and guarantee the same number of cores and same memory with the same softwares.

### Why are the changes needed?

To have a base line of the benchmarks accordingly.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

It was generated from:

- [Run benchmarks: * (JDK 11)](https://github.com/HyukjinKwon/spark/actions/runs/713575465)
- [Run benchmarks: * (JDK 8)](https://github.com/HyukjinKwon/spark/actions/runs/713154337)

Closes #32044 from HyukjinKwon/SPARK-34950.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
@HyukjinKwon HyukjinKwon deleted the SPARK-34821-pr branch January 4, 2022 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants