Skip to content

Conversation

@byungsoo-oh
Copy link

What changes were proposed in this pull request?

This PR fixes an error in BenchmarkBase.scala that occurs when creating a benchmark file in a non-existent directory.

Why are the changes needed?

When submitting a benchmark job using org.apache.spark.benchmark.Benchmarks class with SPARK_GENERATE_BENCHMARK_FILES=1 option, an exception is raised if the directory where the benchmark file will be generated does not exist.
For more information, please refer to SPARK-35266.

Does this PR introduce any user-facing change?

No

How was this patch tested?

After building Spark, manually tested with the following command:

SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit --class \
    org.apache.spark.benchmark.Benchmarks --jars \
    "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \
    "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
    "org.apache.spark.ml.linalg.BLASBenchmark"

It successfully generated the benchmark result files.

Why it is sufficient:
As illustrated in the comments in Benchmarks.scala, the command below runs all benchmarks and generates the results:

SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit --class \
    org.apache.spark.benchmark.Benchmarks --jars \
    "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \
    "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
    "*"

Of all the benchmarks (55 benchmarks in total), only BLASBenchmark fails due to the proposed issue for the current code in the master branch. Thus, it is currently sufficient to test BLASBenchmark to validate this change.

@github-actions github-actions bot added the CORE label Apr 29, 2021
if (!dir.exists()) {
dir.mkdirs()
}
val file = new File(s"${dir}$resultFileName")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. the new benchmark were added at SPARK-33882 and SPARK-35150, that was after #32015 and #32044.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the point. Thank you!

val file = new File(s"${prefix}benchmarks/$resultFileName")
val dir = new File(s"${prefix}benchmarks/")
if (!dir.exists()) {
dir.mkdirs()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add println and say the directory is going to be created? e.g.)

// scalastyle:off println
println(s"Creating ${dir.getAbsolutePath} for benchmark results.")
// scalastyle:on println

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that the benchmark directory is based on jars paths which are flaky. Might be better to explicitly show.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment :) I added println as suggested.

@HyukjinKwon
Copy link
Member

@byungsoo-oh:

  1. would you mind checking https://github.com/apache/spark/pull/32394/checks?check_run_id=2464430892 and enable GitHub Actions in your forked repository?

  2. It would be great if you're interested in submitting another PR to generate and update the results added at SPARK-33882 and SPARK-35150 (after this PR is merge). It's pretty straightforward to generate them: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks

@HyukjinKwon
Copy link
Member

ok to test

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good otherwise.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 3, 2021

@srowen and @zhengruifeng FYI from 9244066 and 5b77ebb. I think it was perfectly fine without including benchmark results (but codes only) because It was a bit weird to upload the results based on different spec machines.

Now there have been some latest changes at #32015 and #32044, and now the PR authors can run the benchmarks in similar specification very easily (https://spark.apache.org/developer-tools.html#github-workflow-benchmarks), and it makes more sense to include benchmark results in a PR :).

@HyukjinKwon
Copy link
Member

ok to test

@byungsoo-oh
Copy link
Author

@byungsoo-oh:

  1. would you mind checking https://github.com/apache/spark/pull/32394/checks?check_run_id=2464430892 and enable GitHub Actions in your forked repository?
  2. It would be great if you're interested in submitting another PR to generate and update the results added at SPARK-33882 and SPARK-35150 (after this PR is merge). It's pretty straightforward to generate them: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks

@HyukjinKwon
Thank you for the comments :)

  1. I enabled it in my forked repository.
  2. I would be delighted if I could submit another PR to add the benchmark results.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

Thanks for your first contribution and congrats for being a contributor!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants