Skip to content

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Apr 6, 2021

What changes were proposed in this pull request?

Now that we merged the Koalas main code into PySpark code base (#32036), we should enable doctests on the Spark's infrastructure.

Why are the changes needed?

Currently the pandas-on-Spark modules are not tested at all.
We should enable doctests first, and we will port other unit tests separately later.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Enabled the whole doctests.

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Test build #136963 has finished for PR 32069 at commit 06635fa.

  • This patch fails due to an unknown error code, 255.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class SQLProcessor(object):

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41540/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41540/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41541/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41541/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Test build #136964 has finished for PR 32069 at commit e86ce1b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41545/

@SparkQA
Copy link

SparkQA commented Apr 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41545/

Comment on lines +164 to +165
- >-
pyspark-pandas
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can rebalance the pyspark tests after we finish porting unit tests.

Examples
--------
>>> pp.range(1001).style # doctest: +ELLIPSIS
>>> pp.range(1001).style # doctest: +SKIP
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style needs an optional dependency Jinja2.

... 'c': [3, 5, 2, 5, 1, 2, 6, 4, 3, 6]},
... columns=['a', 'b', 'c'],
... index=[7, 2, 4, 1, 3, 4, 9, 10, 5, 6])
... index=[7, 2, 3, 1, 3, 4, 9, 10, 5, 6])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to reduce the test flakiness.

import doctest
import sys
from pyspark.sql import SparkSession
import pyspark.pandas.sql_processor
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the file name from sql.py to sql_processor.py to avoid the name conflict between sql module and pp.sql function.

@SparkQA
Copy link

SparkQA commented Apr 7, 2021

Test build #136968 has finished for PR 32069 at commit dd1e366.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41553/

@SparkQA
Copy link

SparkQA commented Apr 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41553/

@ueshin ueshin marked this pull request as ready for review April 7, 2021 03:14
@ueshin ueshin requested a review from HyukjinKwon April 7, 2021 03:15
@SparkQA
Copy link

SparkQA commented Apr 7, 2021

Test build #136976 has finished for PR 32069 at commit 2651cfb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Let me merge this. These tests are flaky. All tests passed (considering the preivous runs).

HyukjinKwon pushed a commit that referenced this pull request Apr 11, 2021
…ch could be flaky

### What changes were proposed in this pull request?

This is a follow-up of #32069.

Makes some doctests which could be flaky skip.

### Why are the changes needed?

Some doctests in `pyspark.pandas` module enabled at #32069 could be flaky because the result row order is nondeterministic.

- groupby-apply with UDF which has a return type annotation will lose its index.
- `Index.symmetric_difference` uses `DataFrame.intersect` and `subtract` internally.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32116 from ueshin/issues/SPARK-34972/fix_flaky_tests.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants