-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23380][PYTHON] Adds a conf for Arrow fallback in toPandas/createDataFrame with Pandas DataFrame #20678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, the change from here is due to removed else: block.
ff9d38b to
7f87d25
Compare
|
cc @ueshin |
|
Test build #87673 has finished for PR 20678 at commit
|
|
Test build #87674 has finished for PR 20678 at commit
|
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @HyukjinKwon , thanks for doing it! I just had a comment regarding testing.
| buildConf("spark.sql.execution.arrow.fallback.enabled") | ||
| .doc("When true, the optimization by 'spark.sql.execution.arrow.enabled' " + | ||
| "could be disabled when it is unable to be used, and fallback to " + | ||
| "non-optimization.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion: "When true, optimizations enabled by 'spark.sql.execution.arrow.enabled' will fallback automatically to non-optimized implementations if an error occurs."
python/pyspark/sql/tests.py
Outdated
| with self.assertRaisesRegexp(Exception, 'Unsupported type'): | ||
| df.toPandas() | ||
| @contextmanager | ||
| def arrow_fallback(self, enabled): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be best to disable fallback for all the tests on setup/teardown. That way if something goes wrong elsewhere, the tests won't start passing due to falling back. For the test where it is enabled, you could do that explicitly. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, makes sense. Will give a shot.
BTW, while we are here, I was thinking of adding a more generalized version of an util like arrow_fallback to reduce configuration specific codes in the test scope but was hesitant because this approach is new to PySpark. WDTY? I will do another PR for this cleanup if we all feel in the same way.
@ueshin, would you have some input for ^ too?
python/pyspark/sql/dataframe.py
Outdated
| "toPandas attempted Arrow optimization because " | ||
| "'spark.sql.execution.arrow.enabled' is set to true; however, " | ||
| "failed by the reason below:\n" | ||
| " %s\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be fine to move this line to the previous to make it a little more compact, but up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem at all.
python/pyspark/sql/dataframe.py
Outdated
| pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) | ||
|
|
||
| dtype = {} | ||
| if self.sql_ctx.getConf("spark.sql.execution.arrow.fallback.enabled", "false") \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the same default value "true" as the default value defined in SQLConf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh, this was my mistake during testing by multiple combinations. Will fix it.
python/pyspark/sql/session.py
Outdated
| # Fallback to create DataFrame without arrow if raise some exception | ||
| from pyspark.util import _exception_message | ||
|
|
||
| if self.conf.get("spark.sql.execution.arrow.fallback.enabled", "false") \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
python/pyspark/sql/tests.py
Outdated
| self.assertPandasEqual(pdf, pd.DataFrame({u'map': [{u'a': 1}]})) | ||
|
|
||
| def test_toPandas_fallback_disabled(self): | ||
| with self.sql_conf("spark.sql.execution.arrow.fallback.enabled", False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ueshin and @BryanCutler, do you guys like this idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good, but how about using dict for setting multiple configs at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I was thinking that too. I took a quick look for the rest of tests and seems we are fine with a single pair for now. Will fix it as so in place in the future if you are okay with that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea! +1 on using a dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix it for using a dict here soon.
python/pyspark/sql/dataframe.py
Outdated
| "toPandas attempted Arrow optimization because " | ||
| "'spark.sql.execution.arrow.enabled' is set to true; however, " | ||
| "failed unexpectedly:\n" | ||
| " %s" % _exception_message(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to mention fallback mode in the message like above?
docs/sql-programming-guide.md
Outdated
| ## Upgrading From Spark SQL 2.3 to 2.4 | ||
|
|
||
| - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. | ||
| - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unabled to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched by `spark.sql.execution.arrow.fallback.enabled`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only in migration section, I think we should also document this config in the section like PySpark Usage Guide for Pandas with Apache Arrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which can be switched by -> which can be switched off by or which can be switched off with or which can be turned off with
| msg = ( | ||
| "toPandas attempted Arrow optimization because " | ||
| "'spark.sql.execution.arrow.enabled' is set to true; however, " | ||
| "failed by the reason below:\n %s\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toPandas attempted Arrow optimization because... repeats three times here, maybe we can dedup it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm ... I tried to like make a "toPandas attempted Arrow optimization because ... %s" and reuse it but seems a little bit overkill.
| .booleanConf | ||
| .createWithDefault(false) | ||
|
|
||
| val ARROW_FALLBACK_ENABLE = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARROW_FALLBACK_ENABLED instead of ARROW_FALLBACK_ENABLE?
|
Test build #87696 has finished for PR 20678 at commit
|
|
Test build #87695 has finished for PR 20678 at commit
|
|
Test build #87719 has finished for PR 20678 at commit
|
python/pyspark/sql/dataframe.py
Outdated
| timezone = None | ||
|
|
||
| if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", "false").lower() == "true": | ||
| should_fallback = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable name is a little confusing to me while I'm tracing the code. How about "use_arrow" and swap the meanings? Because right now if a user doesn't have arrow enabled we skip the arrow conversion because of the value of should_fallback which seems.... odd.
python/pyspark/sql/dataframe.py
Outdated
| "'spark.sql.execution.arrow.fallback.enabled'." % _exception_message(e)) | ||
| raise RuntimeError(msg) | ||
|
|
||
| if not should_fallback: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I'm tracing the logic correctly, if arrow optimizations are enabled and there is an exception parsing the schema and we don't have fall back enabled we go down this code path or if we don't have arrow enabled we also go down this code path? It might make sense to add a comment here with what the intended times to go down this path are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, but there's one more - we fallback if PyArrow is not installed (or version is different). Will add some comments to make this easier to read.
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a useful improvement. I've got a few questions as well :)
docs/sql-programming-guide.md
Outdated
| the Spark configuration 'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. | ||
|
|
||
| In addition, optimizations enabled by 'spark.sql.execution.arrow.enabled' will fallback automatically | ||
| to non-optimized implementations if an error occurs. This can be controlled by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we need to be clear that we only do this if an error occurs in schema parsing, not any error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to rephrase this doc a bit. The point I was trying to make in this fallback (for now) was, to only do the fallback before the actual distributed computation within Spark.
| "toPandas attempted Arrow optimization because " | ||
| "'spark.sql.execution.arrow.enabled' is set to true; however, " | ||
| "failed unexpectedly:\n %s\n" | ||
| "Note that 'spark.sql.execution.arrow.fallback.enabled' does " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 good job having this explanation in the exception
|
Will try to clean up soon. |
| def arrowEnable: Boolean = getConf(ARROW_EXECUTION_ENABLE) | ||
| def arrowEnable: Boolean = getConf(ARROW_EXECUTION_ENABLED) | ||
|
|
||
| def arrowFallbackEnable: Boolean = getConf(ARROW_FALLBACK_ENABLED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Have we used this arrowFallbackEnable definition?
| def rangeExchangeSampleSizePerPartition: Int = getConf(RANGE_EXCHANGE_SAMPLE_SIZE_PER_PARTITION) | ||
|
|
||
| def arrowEnable: Boolean = getConf(ARROW_EXECUTION_ENABLE) | ||
| def arrowEnable: Boolean = getConf(ARROW_EXECUTION_ENABLED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually seems we don't use arrowEnable too.
|
Test build #87785 has finished for PR 20678 at commit
|
|
gentle ping, I believe this is ready for another look. |
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for a nit.
python/pyspark/sql/dataframe.py
Outdated
| raise RuntimeError(msg) | ||
|
|
||
| # Try to use Arrow optimization when the schema is supported and the required version | ||
| # of PyArrow is found, if 'spark.sql.execution.arrow.fallback.enabled' is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql.execution.arrow.enabled instead of spark.sql.execution.arrow.fallback.enabled?
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
python/pyspark/sql/tests.py
Outdated
| self.assertPandasEqual(pdf, pd.DataFrame({u'map': [{u'a': 1}]})) | ||
|
|
||
| def test_toPandas_fallback_disabled(self): | ||
| with self.sql_conf({"spark.sql.execution.arrow.fallback.enabled": False}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still want this since it is disabled in setUpClass? It doesn't hurt to have it, but just thought I'd ask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm .. yup. I don't feel strongly. Will remove it out.
|
Test build #87962 has finished for PR 20678 at commit
|
|
Test build #88007 has finished for PR 20678 at commit
|
|
retest this please |
|
Test build #88016 has finished for PR 20678 at commit
|
docs/sql-programming-guide.md
Outdated
| the Spark configuration 'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. | ||
|
|
||
| In addition, optimizations enabled by 'spark.sql.execution.arrow.enabled' could fallback automatically | ||
| to non-optimized implementations if an error occurs before the actual computation within Spark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very minor nit: non-optimized implementations --> non-Arrow optimization implementation
this matches the description in the paragraph below
docs/sql-programming-guide.md
Outdated
| ## Upgrading From Spark SQL 2.3 to 2.4 | ||
|
|
||
| - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. | ||
| - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unabled to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched by `spark.sql.execution.arrow.fallback.enabled`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which can be switched by -> which can be switched off by or which can be switched off with or which can be turned off with
|
Test build #88065 has finished for PR 20678 at commit
|
|
Merged to master. |
What changes were proposed in this pull request?
This PR adds a configuration to control the fallback of Arrow optimization for
toPandasandcreateDataFramewith Pandas DataFrame.How was this patch tested?
Manually tested and unit tests added.
You can test this by:
createDataFrametoPandas