Skip to content

Conversation

@jessecai
Copy link
Contributor

What changes were proposed in this pull request?

The _prepare_for_python_RDD method currently broadcasts a pickled command if its length is greater than the hardcoded value 1 << 20 (1M). This change sets this value as a Spark conf instead.

How was this patch tested?

Unit tests, manual tests.

@gatorsmile
Copy link
Member

ok to test

private[spark] val BROADCAST_UDF_THRESHOLD = ConfigBuilder("spark.broadcast.UDFThreshold")
.doc("The threshold at which a serialized command is compressed by broadcast, in " +
"bytes unless otherwise specified")
.bytesConf(ByteUnit.BYTE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add checkValue?

@maropu maropu changed the title [SPARK-28355] Use Spark conf for threshold at which command is compressed by broadcast [SPARK-28355][CORE][PYTHON] Use Spark conf for threshold at which command is compressed by broadcast Jul 11, 2019
"mechanisms to guarantee data won't be corrupted during broadcast")
.booleanConf.createWithDefault(true)

private[spark] val BROADCAST_UDF_THRESHOLD = ConfigBuilder("spark.broadcast.UDFThreshold")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name looks confusing. COMMAND_COMPRESSION_THRESHOLD?

@gatorsmile
Copy link
Member

cc @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Jul 12, 2019

Test build #107560 has finished for PR 25123 at commit 2590b59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


private[spark] val BROADCAST_FOR_UDF_COMPRESSION_THRESHOLD =
ConfigBuilder("spark.broadcast.UDFCompressionThreshold")
.doc("The threshold at which a a user-defined function (UDF) is compressed by broadcast, " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also applies to RDD APIs. We can just say, for instance, The threshold at which Python commands for RDD APIs and user-defined function (UDF) are serialized by broadcast .... Feel free to change wording.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the description to include this.

@SparkQA
Copy link

SparkQA commented Jul 12, 2019

Test build #107564 has finished for PR 25123 at commit ca7de40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 12, 2019

Test build #107603 has finished for PR 25123 at commit c716eb7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2019

Test build #107623 has finished for PR 25123 at commit 1ec3a4a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Jul 13, 2019

LGTM

Thanks! Merged to master.

@HyukjinKwon
Copy link
Member

LGTM too

vinodkc pushed a commit to vinodkc/spark that referenced this pull request Jul 18, 2019
…mand is compressed by broadcast

## What changes were proposed in this pull request?

The `_prepare_for_python_RDD` method currently broadcasts a pickled command if its length is greater than the hardcoded value `1 << 20` (1M). This change sets this value as a Spark conf instead.

## How was this patch tested?

Unit tests, manual tests.

Closes apache#25123 from jessecai/SPARK-28355.

Authored-by: Jesse Cai <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants