-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator #19204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Jenkins, test this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd better add metrics interface for the python api, (including metricName Param, setMetricName/getMetricName method), although there is only one metrics for now.
Thanks @WeichenXu123, I added it. |
ok to test |
Test build #81727 has finished for PR 19204 at commit
|
Test build #81729 has finished for PR 19204 at commit
|
Test build #81732 has finished for PR 19204 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one comment. Thanks!
python/pyspark/ml/evaluation.py
Outdated
""" | ||
metricName = Param(Params._dummy(), "metricName", | ||
"metric name in evaluation " | ||
"(silhouette)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string in multiple lines, we should use """ instead of "". Otherwise move them to the same line.
Test build #81777 has finished for PR 19204 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
Thank you for your review and help @WeichenXu123! |
python/pyspark/ml/evaluation.py
Outdated
>>> from pyspark.ml.linalg import Vectors, VectorUDT | ||
>>> from pyspark.ml.evaluation import ClusteringEvaluator | ||
... | ||
>>> iris = datasets.load_iris() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't involves other libraries if not necessary, here the doc test is used to show how to use ClusteringEvaluator
to fresh users, so we should focus on evaluator and keep it as simple as possible. You can refer other evaluator to construct simple dataset.
python/pyspark/ml/evaluation.py
Outdated
>>> from sklearn import datasets | ||
>>> from pyspark.sql.types import * | ||
>>> from pyspark.ml.linalg import Vectors, VectorUDT | ||
>>> from pyspark.ml.evaluation import ClusteringEvaluator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this, it's not necessary.
python/pyspark/ml/evaluation.py
Outdated
super(ClusteringEvaluator, self).__init__() | ||
self._java_obj = self._new_java_obj( | ||
"org.apache.spark.ml.evaluation.ClusteringEvaluator", self.uid) | ||
self._setDefault(predictionCol="prediction", featuresCol="features", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove setting default value for predictionCol
and featuresCol
, as they have been set in HasPredictionCol
and HasFeaturesCol
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sent #19262 to fix same issue for other evaluators, please feel free to comment. Thanks.
python/pyspark/ml/evaluation.py
Outdated
... for i, x in enumerate(iris.data)] | ||
>>> schema = StructType([ | ||
... StructField("features", VectorUDT(), True), | ||
... StructField("cluster_id", IntegerType(), True)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster_id
-> prediction
to emphasize this is the prediction value, not ground truth.
Test build #81858 has finished for PR 19204 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor issue, otherwise, LGTM. Thanks.
python/pyspark/ml/evaluation.py
Outdated
columns: prediction and features. | ||
>>> from pyspark.ml.linalg import Vectors | ||
>>> scoreAndLabels = map(lambda x: (Vectors.dense(x[0]), x[1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scoreAndLabels
-> featureAndPredictions
, the dataset here is different from other evaluators, we should use more accurate name. Thanks.
Test build #82040 has finished for PR 19204 at commit
|
Merged into master, thanks. |
What changes were proposed in this pull request?
Added Python interface for ClusteringEvaluator
How was this patch tested?
Manual test, eg. the example Python code in the comments.
cc @yanboliang