[SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator #19204

mgaido91 · 2017-09-12T11:28:10Z

What changes were proposed in this pull request?

Added Python interface for ClusteringEvaluator

How was this patch tested?

Manual test, eg. the example Python code in the comments.

cc @yanboliang

WeichenXu123 · 2017-09-13T10:13:26Z

Jenkins, test this please.

WeichenXu123

We'd better add metrics interface for the python api, (including metricName Param, setMetricName/getMetricName method), although there is only one metrics for now.

mgaido91 · 2017-09-13T16:17:32Z

Thanks @WeichenXu123, I added it.

gatorsmile · 2017-09-13T16:58:04Z

ok to test

SparkQA · 2017-09-13T17:03:28Z

Test build #81727 has finished for PR 19204 at commit 5a6f9b4.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-13T17:45:29Z

Test build #81729 has finished for PR 19204 at commit cd84d66.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-13T18:31:01Z

Test build #81732 has finished for PR 19204 at commit e684dbb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123

LGTM except one comment. Thanks!

WeichenXu123 · 2017-09-13T23:19:06Z

python/pyspark/ml/evaluation.py

+    """
+    metricName = Param(Params._dummy(), "metricName",
+                       "metric name in evaluation "
+                       "(silhouette)",


The string in multiple lines, we should use """ instead of "". Otherwise move them to the same line.

SparkQA · 2017-09-14T09:31:01Z

Test build #81777 has finished for PR 19204 at commit 7e8bcc7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123

LGTM. Thanks!

mgaido91 · 2017-09-14T14:53:11Z

Thank you for your review and help @WeichenXu123!

yanboliang · 2017-09-17T14:32:15Z

python/pyspark/ml/evaluation.py

+    >>> from pyspark.ml.linalg import Vectors, VectorUDT
+    >>> from pyspark.ml.evaluation import ClusteringEvaluator
+    ...
+    >>> iris = datasets.load_iris()


Please don't involves other libraries if not necessary, here the doc test is used to show how to use ClusteringEvaluator to fresh users, so we should focus on evaluator and keep it as simple as possible. You can refer other evaluator to construct simple dataset.

yanboliang · 2017-09-17T14:32:33Z

python/pyspark/ml/evaluation.py

+    >>> from sklearn import datasets
+    >>> from pyspark.sql.types import *
+    >>> from pyspark.ml.linalg import Vectors, VectorUDT
+    >>> from pyspark.ml.evaluation import ClusteringEvaluator


Remove this, it's not necessary.

yanboliang · 2017-09-17T14:39:39Z

python/pyspark/ml/evaluation.py

+        super(ClusteringEvaluator, self).__init__()
+        self._java_obj = self._new_java_obj(
+            "org.apache.spark.ml.evaluation.ClusteringEvaluator", self.uid)
+        self._setDefault(predictionCol="prediction", featuresCol="features",


Remove setting default value for predictionCol and featuresCol, as they have been set in HasPredictionCol and HasFeaturesCol.

I sent #19262 to fix same issue for other evaluators, please feel free to comment. Thanks.

yanboliang · 2017-09-17T14:46:43Z

python/pyspark/ml/evaluation.py

+    ...     for i, x in enumerate(iris.data)]
+    >>> schema = StructType([
+    ...    StructField("features", VectorUDT(), True),
+    ...    StructField("cluster_id", IntegerType(), True)])


cluster_id -> prediction to emphasize this is the prediction value, not ground truth.

SparkQA · 2017-09-17T16:20:08Z

Test build #81858 has finished for PR 19204 at commit a980e6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang

One minor issue, otherwise, LGTM. Thanks.

yanboliang · 2017-09-19T14:57:31Z

python/pyspark/ml/evaluation.py

+    columns: prediction and features.
+
+    >>> from pyspark.ml.linalg import Vectors
+    >>> scoreAndLabels = map(lambda x: (Vectors.dense(x[0]), x[1]),


scoreAndLabels -> featureAndPredictions, the dataset here is different from other evaluators, we should use more accurate name. Thanks.

SparkQA · 2017-09-21T17:31:41Z

Test build #82040 has finished for PR 19204 at commit 8735c4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-09-22T05:13:19Z

Merged into master, thanks.

Added python interface for ClusteringEvaluator

31b3c6c

WeichenXu123 reviewed Sep 13, 2017

View reviewed changes

Add metricName

5a6f9b4

fix pylint error

cd84d66

fix metricName

e684dbb

WeichenXu123 reviewed Sep 13, 2017

View reviewed changes

address comment

7e8bcc7

WeichenXu123 approved these changes Sep 14, 2017

View reviewed changes

yanboliang reviewed Sep 17, 2017

View reviewed changes

Address review comments

a980e6b

yanboliang reviewed Sep 19, 2017

View reviewed changes

rename example variable

8735c4c

asfgit closed this in 5ac9685 Sep 22, 2017

mgaido91 deleted the SPARK-21981 branch November 4, 2017 08:48

[SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator #19204

[SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator #19204

Uh oh!

Conversation

mgaido91 commented Sep 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

WeichenXu123 commented Sep 13, 2017

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Sep 13, 2017

Uh oh!

gatorsmile commented Sep 13, 2017

Uh oh!

SparkQA commented Sep 13, 2017

Uh oh!

SparkQA commented Sep 13, 2017

Uh oh!

SparkQA commented Sep 13, 2017

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Sep 13, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 14, 2017

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Sep 14, 2017

Uh oh!

yanboliang Sep 17, 2017

Choose a reason for hiding this comment

Uh oh!

yanboliang Sep 17, 2017

Choose a reason for hiding this comment

Uh oh!

yanboliang Sep 17, 2017

Choose a reason for hiding this comment

Uh oh!

yanboliang Sep 17, 2017

Choose a reason for hiding this comment

Uh oh!

yanboliang Sep 17, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 17, 2017

Uh oh!

yanboliang left a comment

Choose a reason for hiding this comment

Uh oh!

yanboliang Sep 19, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 21, 2017

Uh oh!

yanboliang commented Sep 22, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mgaido91 commented Sep 12, 2017 •

edited

Loading