Skip to content

Conversation

mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Sep 12, 2017

What changes were proposed in this pull request?

Added Python interface for ClusteringEvaluator

How was this patch tested?

Manual test, eg. the example Python code in the comments.

cc @yanboliang

@WeichenXu123
Copy link
Contributor

Jenkins, test this please.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better add metrics interface for the python api, (including metricName Param, setMetricName/getMetricName method), although there is only one metrics for now.

@mgaido91
Copy link
Contributor Author

Thanks @WeichenXu123, I added it.

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Sep 13, 2017

Test build #81727 has finished for PR 19204 at commit 5a6f9b4.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 13, 2017

Test build #81729 has finished for PR 19204 at commit cd84d66.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 13, 2017

Test build #81732 has finished for PR 19204 at commit e684dbb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one comment. Thanks!

"""
metricName = Param(Params._dummy(), "metricName",
"metric name in evaluation "
"(silhouette)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string in multiple lines, we should use """ instead of "". Otherwise move them to the same line.

@SparkQA
Copy link

SparkQA commented Sep 14, 2017

Test build #81777 has finished for PR 19204 at commit 7e8bcc7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@mgaido91
Copy link
Contributor Author

Thank you for your review and help @WeichenXu123!

>>> from pyspark.ml.linalg import Vectors, VectorUDT
>>> from pyspark.ml.evaluation import ClusteringEvaluator
...
>>> iris = datasets.load_iris()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't involves other libraries if not necessary, here the doc test is used to show how to use ClusteringEvaluator to fresh users, so we should focus on evaluator and keep it as simple as possible. You can refer other evaluator to construct simple dataset.

>>> from sklearn import datasets
>>> from pyspark.sql.types import *
>>> from pyspark.ml.linalg import Vectors, VectorUDT
>>> from pyspark.ml.evaluation import ClusteringEvaluator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this, it's not necessary.

super(ClusteringEvaluator, self).__init__()
self._java_obj = self._new_java_obj(
"org.apache.spark.ml.evaluation.ClusteringEvaluator", self.uid)
self._setDefault(predictionCol="prediction", featuresCol="features",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove setting default value for predictionCol and featuresCol, as they have been set in HasPredictionCol and HasFeaturesCol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sent #19262 to fix same issue for other evaluators, please feel free to comment. Thanks.

... for i, x in enumerate(iris.data)]
>>> schema = StructType([
... StructField("features", VectorUDT(), True),
... StructField("cluster_id", IntegerType(), True)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster_id -> prediction to emphasize this is the prediction value, not ground truth.

@SparkQA
Copy link

SparkQA commented Sep 17, 2017

Test build #81858 has finished for PR 19204 at commit a980e6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor issue, otherwise, LGTM. Thanks.

columns: prediction and features.
>>> from pyspark.ml.linalg import Vectors
>>> scoreAndLabels = map(lambda x: (Vectors.dense(x[0]), x[1]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scoreAndLabels -> featureAndPredictions, the dataset here is different from other evaluators, we should use more accurate name. Thanks.

@SparkQA
Copy link

SparkQA commented Sep 21, 2017

Test build #82040 has finished for PR 19204 at commit 8735c4c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

Merged into master, thanks.

@asfgit asfgit closed this in 5ac9685 Sep 22, 2017
@mgaido91 mgaido91 deleted the SPARK-21981 branch November 4, 2017 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants