[SPARK-34972][PYTHON] Make pandas-on-Spark doctests work. #32069

ueshin · 2021-04-06T18:10:03Z

What changes were proposed in this pull request?

Now that we merged the Koalas main code into PySpark code base (#32036), we should enable doctests on the Spark's infrastructure.

Why are the changes needed?

Currently the pandas-on-Spark modules are not tested at all.
We should enable doctests first, and we will port other unit tests separately later.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Enabled the whole doctests.

SparkQA · 2021-04-06T18:12:10Z

Test build #136963 has finished for PR 32069 at commit 06635fa.

This patch fails due to an unknown error code, 255.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SQLProcessor(object):

SparkQA · 2021-04-06T18:56:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41540/

SparkQA · 2021-04-06T18:56:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41540/

SparkQA · 2021-04-06T19:08:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41541/

SparkQA · 2021-04-06T19:08:38Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41541/

SparkQA · 2021-04-06T20:28:57Z

Test build #136964 has finished for PR 32069 at commit e86ce1b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-07T00:13:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41545/

SparkQA · 2021-04-07T00:13:26Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41545/

ueshin · 2021-04-07T00:21:22Z

.github/workflows/build_and_test.yml

+          - >-
+            pyspark-pandas


Maybe we can rebalance the pyspark tests after we finish porting unit tests.

ueshin · 2021-04-07T00:23:57Z

python/pyspark/pandas/frame.py

        Examples
        --------
-        >>> pp.range(1001).style  # doctest: +ELLIPSIS
+        >>> pp.range(1001).style  # doctest: +SKIP


style needs an optional dependency Jinja2.

ueshin · 2021-04-07T00:24:45Z

python/pyspark/pandas/groupby.py

        ...                    'c': [3, 5, 2, 5, 1, 2, 6, 4, 3, 6]},
        ...                   columns=['a', 'b', 'c'],
-        ...                   index=[7, 2, 4, 1, 3, 4, 9, 10, 5, 6])
+        ...                   index=[7, 2, 3, 1, 3, 4, 9, 10, 5, 6])


This is to reduce the test flakiness.

ueshin · 2021-04-07T00:27:25Z

python/pyspark/pandas/sql_processor.py

+    import doctest
+    import sys
+    from pyspark.sql import SparkSession
+    import pyspark.pandas.sql_processor


Renamed the file name from sql.py to sql_processor.py to avoid the name conflict between sql module and pp.sql function.

SparkQA · 2021-04-07T01:35:27Z

Test build #136968 has finished for PR 32069 at commit dd1e366.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-07T02:16:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41553/

SparkQA · 2021-04-07T02:16:34Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41553/

SparkQA · 2021-04-07T04:14:03Z

Test build #136976 has finished for PR 32069 at commit 2651cfb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-04-07T11:50:06Z

Let me merge this. These tests are flaky. All tests passed (considering the preivous runs).

…ch could be flaky ### What changes were proposed in this pull request? This is a follow-up of #32069. Makes some doctests which could be flaky skip. ### Why are the changes needed? Some doctests in `pyspark.pandas` module enabled at #32069 could be flaky because the result row order is nondeterministic. - groupby-apply with UDF which has a return type annotation will lose its index. - `Index.symmetric_difference` uses `DataFrame.intersect` and `subtract` internally. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32116 from ueshin/issues/SPARK-34972/fix_flaky_tests. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

Make pandas-on-Spark doctests work.

06635fa

Fix.

e86ce1b

github-actions bot added BUILD CORE INFRA PYTHON labels Apr 6, 2021

ueshin added 2 commits April 6, 2021 15:37

Fix.

dd1e366

Skip if pandas and pyarrow do not satisfy the requirements.

2651cfb

ueshin commented Apr 7, 2021

View reviewed changes

ueshin marked this pull request as ready for review April 7, 2021 03:14

ueshin requested a review from HyukjinKwon April 7, 2021 03:15

srowen approved these changes Apr 7, 2021

View reviewed changes

HyukjinKwon approved these changes Apr 7, 2021

View reviewed changes

HyukjinKwon closed this in 2635c38 Apr 7, 2021

ueshin mentioned this pull request Apr 10, 2021

[SPARK-34972][PYTHON][TEST][FOLLOWUP] Fix pyspark.pandas doctests which could be flaky. #32116

Closed

[SPARK-34972][PYTHON] Make pandas-on-Spark doctests work. #32069

[SPARK-34972][PYTHON] Make pandas-on-Spark doctests work. #32069

Uh oh!

Conversation

ueshin commented Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 7, 2021

Uh oh!

SparkQA commented Apr 7, 2021

Uh oh!

ueshin Apr 7, 2021

Choose a reason for hiding this comment

Uh oh!

ueshin Apr 7, 2021

Choose a reason for hiding this comment

Uh oh!

ueshin Apr 7, 2021

Choose a reason for hiding this comment

Uh oh!

ueshin Apr 7, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 7, 2021

Uh oh!

SparkQA commented Apr 7, 2021

Uh oh!

SparkQA commented Apr 7, 2021

Uh oh!

SparkQA commented Apr 7, 2021

Uh oh!

HyukjinKwon commented Apr 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ueshin commented Apr 6, 2021 •

edited

Loading