[SPARK-23380][PYTHON] Adds a conf for Arrow fallback in toPandas/createDataFrame with Pandas DataFrame #20678

HyukjinKwon · 2018-02-26T15:10:30Z

What changes were proposed in this pull request?

This PR adds a configuration to control the fallback of Arrow optimization for toPandas and createDataFrame with Pandas DataFrame.

How was this patch tested?

Manually tested and unit tests added.

You can test this by:

createDataFrame

spark.conf.set("spark.sql.execution.arrow.enabled", False)
pdf = spark.createDataFrame([[{'a': 1}]]).toPandas()
spark.conf.set("spark.sql.execution.arrow.enabled", True)
spark.conf.set("spark.sql.execution.arrow.fallback.enabled", True)
spark.createDataFrame(pdf, "a: map<string, int>")

spark.conf.set("spark.sql.execution.arrow.enabled", False)
pdf = spark.createDataFrame([[{'a': 1}]]).toPandas()
spark.conf.set("spark.sql.execution.arrow.enabled", True)
spark.conf.set("spark.sql.execution.arrow.fallback.enabled", False)
spark.createDataFrame(pdf, "a: map<string, int>")

toPandas

spark.conf.set("spark.sql.execution.arrow.enabled", True)
spark.conf.set("spark.sql.execution.arrow.fallback.enabled", True)
spark.createDataFrame([[{'a': 1}]]).toPandas()

spark.conf.set("spark.sql.execution.arrow.enabled", True)
spark.conf.set("spark.sql.execution.arrow.fallback.enabled", False)
spark.createDataFrame([[{'a': 1}]]).toPandas()

HyukjinKwon · 2018-02-26T15:13:54Z

python/pyspark/sql/dataframe.py

Likewise, the change from here is due to removed else: block.

…s DataFrame

gatorsmile · 2018-02-26T17:20:01Z

cc @ueshin

SparkQA · 2018-02-26T18:47:05Z

Test build #87673 has finished for PR 20678 at commit ff9d38b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-26T18:56:27Z

Test build #87674 has finished for PR 20678 at commit 7f87d25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

This looks great @HyukjinKwon , thanks for doing it! I just had a comment regarding testing.

BryanCutler · 2018-02-26T22:41:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.execution.arrow.fallback.enabled")
+      .doc("When true, the optimization by 'spark.sql.execution.arrow.enabled' " +
+        "could be disabled when it is unable to be used, and fallback to " +
+        "non-optimization.")


Just a suggestion: "When true, optimizations enabled by 'spark.sql.execution.arrow.enabled' will fallback automatically to non-optimized implementations if an error occurs."

BryanCutler · 2018-02-26T22:51:48Z

python/pyspark/sql/tests.py

-            with self.assertRaisesRegexp(Exception, 'Unsupported type'):
-                df.toPandas()
+    @contextmanager
+    def arrow_fallback(self, enabled):


I think it would be best to disable fallback for all the tests on setup/teardown. That way if something goes wrong elsewhere, the tests won't start passing due to falling back. For the test where it is enabled, you could do that explicitly. What do you think?

Yup, makes sense. Will give a shot.

BTW, while we are here, I was thinking of adding a more generalized version of an util like arrow_fallback to reduce configuration specific codes in the test scope but was hesitant because this approach is new to PySpark. WDTY? I will do another PR for this cleanup if we all feel in the same way.

@ueshin, would you have some input for ^ too?

BryanCutler · 2018-02-26T22:53:19Z

python/pyspark/sql/dataframe.py

+                        "toPandas attempted Arrow optimization because "
+                        "'spark.sql.execution.arrow.enabled' is set to true; however, "
+                        "failed by the reason below:\n"
+                        "  %s\n"


I think it would be fine to move this line to the previous to make it a little more compact, but up to you.

No problem at all.

ueshin · 2018-02-27T01:32:48Z

python/pyspark/sql/dataframe.py

-            pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)

-            dtype = {}
+                if self.sql_ctx.getConf("spark.sql.execution.arrow.fallback.enabled", "false") \


We should use the same default value "true" as the default value defined in SQLConf.

Argh, this was my mistake during testing by multiple combinations. Will fix it.

ueshin · 2018-02-27T01:38:10Z

python/pyspark/sql/session.py

-                    # Fallback to create DataFrame without arrow if raise some exception
+                    from pyspark.util import _exception_message
+
+                    if self.conf.get("spark.sql.execution.arrow.fallback.enabled", "false") \


HyukjinKwon · 2018-02-27T03:45:25Z

python/pyspark/sql/tests.py

+                    self.assertPandasEqual(pdf, pd.DataFrame({u'map': [{u'a': 1}]}))
+
+    def test_toPandas_fallback_disabled(self):
+        with self.sql_conf("spark.sql.execution.arrow.fallback.enabled", False):


Hey @ueshin and @BryanCutler, do you guys like this idea?

Seems good, but how about using dict for setting multiple configs at the same time?

Yea, I was thinking that too. I took a quick look for the rest of tests and seems we are fine with a single pair for now. Will fix it as so in place in the future if you are okay with that too.

Yeah, good idea! +1 on using a dict

Will fix it for using a dict here soon.

viirya · 2018-02-27T03:56:18Z

python/pyspark/sql/dataframe.py

+                        "toPandas attempted Arrow optimization because "
+                        "'spark.sql.execution.arrow.enabled' is set to true; however, "
+                        "failed unexpectedly:\n"
+                        "  %s" % _exception_message(e))


No need to mention fallback mode in the message like above?

viirya · 2018-02-27T04:00:56Z

docs/sql-programming-guide.md

 ## Upgrading From Spark SQL 2.3 to 2.4

  - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively.
+  - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unabled to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched by `spark.sql.execution.arrow.fallback.enabled`.


Not only in migration section, I think we should also document this config in the section like PySpark Usage Guide for Pandas with Apache Arrow.

Yup, added.

which can be switched by -> which can be switched off by or which can be switched off with or which can be turned off with

viirya · 2018-02-27T04:03:41Z

python/pyspark/sql/dataframe.py

+                    msg = (
+                        "toPandas attempted Arrow optimization because "
+                        "'spark.sql.execution.arrow.enabled' is set to true; however, "
+                        "failed by the reason below:\n  %s\n"


toPandas attempted Arrow optimization because... repeats three times here, maybe we can dedup it.

Hm ... I tried to like make a "toPandas attempted Arrow optimization because ... %s" and reuse it but seems a little bit overkill.

ueshin · 2018-02-27T04:34:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .booleanConf
      .createWithDefault(false)

+  val ARROW_FALLBACK_ENABLE =


ARROW_FALLBACK_ENABLED instead of ARROW_FALLBACK_ENABLE?

SparkQA · 2018-02-27T06:15:58Z

Test build #87696 has finished for PR 20678 at commit cfb08a1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-27T06:53:08Z

Test build #87695 has finished for PR 20678 at commit 7641fd0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-27T16:06:26Z

Test build #87719 has finished for PR 20678 at commit 229a5f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-02-28T00:20:27Z

python/pyspark/sql/dataframe.py

            timezone = None

        if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", "false").lower() == "true":
+            should_fallback = False


This variable name is a little confusing to me while I'm tracing the code. How about "use_arrow" and swap the meanings? Because right now if a user doesn't have arrow enabled we skip the arrow conversion because of the value of should_fallback which seems.... odd.

holdenk · 2018-02-28T00:21:48Z

python/pyspark/sql/dataframe.py

+                        "'spark.sql.execution.arrow.fallback.enabled'." % _exception_message(e))
+                    raise RuntimeError(msg)
+
+            if not should_fallback:


So if I'm tracing the logic correctly, if arrow optimizations are enabled and there is an exception parsing the schema and we don't have fall back enabled we go down this code path or if we don't have arrow enabled we also go down this code path? It might make sense to add a comment here with what the intended times to go down this path are?

Correct, but there's one more - we fallback if PyArrow is not installed (or version is different). Will add some comments to make this easier to read.

holdenk

Seems like a useful improvement. I've got a few questions as well :)

holdenk · 2018-02-28T00:22:31Z

docs/sql-programming-guide.md

 the Spark configuration 'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.

+In addition, optimizations enabled by 'spark.sql.execution.arrow.enabled' will fallback automatically
+to non-optimized implementations if an error occurs. This can be controlled by


So we need to be clear that we only do this if an error occurs in schema parsing, not any error.

Let me try to rephrase this doc a bit. The point I was trying to make in this fallback (for now) was, to only do the fallback before the actual distributed computation within Spark.

holdenk · 2018-02-28T00:23:31Z

python/pyspark/sql/dataframe.py

+                        "toPandas attempted Arrow optimization because "
+                        "'spark.sql.execution.arrow.enabled' is set to true; however, "
+                        "failed unexpectedly:\n  %s\n"
+                        "Note that 'spark.sql.execution.arrow.fallback.enabled' does "


+1 good job having this explanation in the exception

HyukjinKwon · 2018-02-28T04:20:13Z

Will try to clean up soon.

viirya · 2018-02-28T06:51:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

-  def arrowEnable: Boolean = getConf(ARROW_EXECUTION_ENABLE)
+  def arrowEnable: Boolean = getConf(ARROW_EXECUTION_ENABLED)
+
+  def arrowFallbackEnable: Boolean = getConf(ARROW_FALLBACK_ENABLED)


nit: Have we used this arrowFallbackEnable definition?

viirya · 2018-02-28T06:52:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

  def rangeExchangeSampleSizePerPartition: Int = getConf(RANGE_EXCHANGE_SAMPLE_SIZE_PER_PARTITION)

-  def arrowEnable: Boolean = getConf(ARROW_EXECUTION_ENABLE)
+  def arrowEnable: Boolean = getConf(ARROW_EXECUTION_ENABLED)


Actually seems we don't use arrowEnable too.

SparkQA · 2018-02-28T19:07:41Z

Test build #87785 has finished for PR 20678 at commit ed30c20.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-03T03:16:13Z

gentle ping, I believe this is ready for another look.

ueshin

LGTM except for a nit.

ueshin · 2018-03-05T04:09:37Z

python/pyspark/sql/dataframe.py

+                    raise RuntimeError(msg)
+
+            # Try to use Arrow optimization when the schema is supported and the required version
+            # of PyArrow is found, if 'spark.sql.execution.arrow.fallback.enabled' is enabled.


spark.sql.execution.arrow.enabled instead of spark.sql.execution.arrow.fallback.enabled?

BryanCutler

LGTM

BryanCutler · 2018-03-05T17:34:22Z

python/pyspark/sql/tests.py

+                    self.assertPandasEqual(pdf, pd.DataFrame({u'map': [{u'a': 1}]}))
+
+    def test_toPandas_fallback_disabled(self):
+        with self.sql_conf({"spark.sql.execution.arrow.fallback.enabled": False}):


Do you still want this since it is disabled in setUpClass? It doesn't hurt to have it, but just thought I'd ask

Hm .. yup. I don't feel strongly. Will remove it out.

SparkQA · 2018-03-05T17:48:06Z

Test build #87962 has finished for PR 20678 at commit af60cb7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-06T17:16:14Z

Test build #88007 has finished for PR 20678 at commit b5bea82.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-03-06T17:35:15Z

retest this please

SparkQA · 2018-03-06T21:00:09Z

Test build #88016 has finished for PR 20678 at commit b5bea82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-03-07T06:30:16Z

docs/sql-programming-guide.md

 the Spark configuration 'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.

+In addition, optimizations enabled by 'spark.sql.execution.arrow.enabled' could fallback automatically
+to non-optimized implementations if an error occurs before the actual computation within Spark.


very minor nit: non-optimized implementations --> non-Arrow optimization implementation

this matches the description in the paragraph below

felixcheung · 2018-03-07T06:31:16Z

docs/sql-programming-guide.md

 ## Upgrading From Spark SQL 2.3 to 2.4

  - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively.
+  - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unabled to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched by `spark.sql.execution.arrow.fallback.enabled`.


which can be switched by -> which can be switched off by or which can be switched off with or which can be turned off with

SparkQA · 2018-03-08T03:40:52Z

Test build #88065 has finished for PR 20678 at commit 4ccaa81.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-08T11:22:18Z

Merged to master.

HyukjinKwon mentioned this pull request Feb 26, 2018

[SPARK-23380][PYTHON] Make toPandas fallback to non-Arrow optimization if possible #20567

Closed

HyukjinKwon commented Feb 26, 2018

View reviewed changes

python/pyspark/sql/dataframe.py Outdated

Copy link

Member Author

HyukjinKwon Feb 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, the change from here is due to removed else: block.

Adds a conf for Arrow fallback in toPandas/createDataFrame with Panda…

7f87d25

…s DataFrame

HyukjinKwon force-pushed the SPARK-23380-conf branch from ff9d38b to 7f87d25 Compare February 26, 2018 15:18

BryanCutler reviewed Feb 26, 2018

View reviewed changes

ueshin reviewed Feb 27, 2018

View reviewed changes

Address comments

7641fd0

HyukjinKwon commented Feb 27, 2018

View reviewed changes

Fix some comments

cfb08a1

viirya reviewed Feb 27, 2018

View reviewed changes

ueshin reviewed Feb 27, 2018

View reviewed changes

Address comments

229a5f7

holdenk reviewed Feb 28, 2018

View reviewed changes

viirya reviewed Feb 28, 2018

View reviewed changes

Address comments

ed30c20

ueshin reviewed Mar 5, 2018

View reviewed changes

Fix a nit

af60cb7

BryanCutler approved these changes Mar 5, 2018

View reviewed changes

Fix a nit

b5bea82

felixcheung approved these changes Mar 7, 2018

View reviewed changes

Fix nits

4ccaa81

asfgit closed this in d6632d1 Mar 8, 2018

HyukjinKwon deleted the SPARK-23380-conf branch October 16, 2018 12:45

[SPARK-23380][PYTHON] Adds a conf for Arrow fallback in toPandas/createDataFrame with Pandas DataFrame #20678

[SPARK-23380][PYTHON] Adds a conf for Arrow fallback in toPandas/createDataFrame with Pandas DataFrame #20678

Uh oh!

Conversation

HyukjinKwon commented Feb 26, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Feb 26, 2018

Uh oh!

SparkQA commented Feb 26, 2018

Uh oh!

SparkQA commented Feb 26, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Mar 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 27, 2018

Uh oh!

SparkQA commented Feb 27, 2018

Uh oh!

SparkQA commented Feb 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 27, 2018 •

edited

Loading

felixcheung Mar 7, 2018 •

edited

Loading

HyukjinKwon Feb 28, 2018 •

edited

Loading