-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34972][PYTHON] Make pandas-on-Spark doctests work. #32069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34972][PYTHON] Make pandas-on-Spark doctests work. #32069
Conversation
|
Test build #136963 has finished for PR 32069 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #136964 has finished for PR 32069 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
| - >- | ||
| pyspark-pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can rebalance the pyspark tests after we finish porting unit tests.
| Examples | ||
| -------- | ||
| >>> pp.range(1001).style # doctest: +ELLIPSIS | ||
| >>> pp.range(1001).style # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style needs an optional dependency Jinja2.
| ... 'c': [3, 5, 2, 5, 1, 2, 6, 4, 3, 6]}, | ||
| ... columns=['a', 'b', 'c'], | ||
| ... index=[7, 2, 4, 1, 3, 4, 9, 10, 5, 6]) | ||
| ... index=[7, 2, 3, 1, 3, 4, 9, 10, 5, 6]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to reduce the test flakiness.
| import doctest | ||
| import sys | ||
| from pyspark.sql import SparkSession | ||
| import pyspark.pandas.sql_processor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed the file name from sql.py to sql_processor.py to avoid the name conflict between sql module and pp.sql function.
|
Test build #136968 has finished for PR 32069 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #136976 has finished for PR 32069 at commit
|
|
Let me merge this. These tests are flaky. All tests passed (considering the preivous runs). |
…ch could be flaky ### What changes were proposed in this pull request? This is a follow-up of #32069. Makes some doctests which could be flaky skip. ### Why are the changes needed? Some doctests in `pyspark.pandas` module enabled at #32069 could be flaky because the result row order is nondeterministic. - groupby-apply with UDF which has a return type annotation will lose its index. - `Index.symmetric_difference` uses `DataFrame.intersect` and `subtract` internally. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32116 from ueshin/issues/SPARK-34972/fix_flaky_tests. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
Now that we merged the Koalas main code into PySpark code base (#32036), we should enable doctests on the Spark's infrastructure.
Why are the changes needed?
Currently the pandas-on-Spark modules are not tested at all.
We should enable doctests first, and we will port other unit tests separately later.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Enabled the whole doctests.