Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

In regression/clustering/ovr/als, if an output column name is empty, igore it. And if all names are empty, log a warning msg, then do nothing.

How was this patch tested?

existing tests

@SparkQA
Copy link

SparkQA commented Jun 4, 2019

Test build #106149 has finished for PR 24793 at commit 6c2a81a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 4, 2019

Test build #106150 has finished for PR 24793 at commit a0e4d54.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

This pr can help to avoid uncessary computation. For example, in GMM, current impl always predict twice, one for probabilityCol and one for probabilityCol. if we only need probability, the first col can be skipped.

@zhengruifeng
Copy link
Contributor Author

ping @srowen , would you mind help reviewing this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this happens given the check at the start of the method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise how would this be empty, unless the user set it to nothing? in which case, this seems not worth worrying about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed here to keep in line with other algs like LDA.transform
Or leave alone algs with only one output column, and remove the check in algs like LDA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'd say we don't need this check anywhere that the user would have to explicitly set no prediction column to get no output, and in that case, I don't think it's worth checking and warning. I'm neutral on removing the other checks, but not against it.

Some checks are OK like the ones above as it might be easier to accidentally get into this situation because there are multiple prediction cols.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS we should fix these messages. "called as NOOP" is cryptic. Just say "transform() does nothing because ..."

@zhengruifeng zhengruifeng force-pushed the aft_iso_check_empty_outputCol branch from a0e4d54 to f02051b Compare July 12, 2019 03:38
@SparkQA
Copy link

SparkQA commented Jul 12, 2019

Test build #107577 has finished for PR 24793 at commit 90eb0ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng zhengruifeng force-pushed the aft_iso_check_empty_outputCol branch from 90eb0ac to f010ff9 Compare July 15, 2019 04:57
@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #107665 has finished for PR 24793 at commit f010ff9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Vector => OldVector, Vectors => OldVectors}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import org.apache.spark.sql._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we avoid this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will revert these place.

import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import org.apache.spark.sql._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

import org.apache.spark.mllib.tree.model.{DecisionTreeModel => OldDecisionTreeModel}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import org.apache.spark.sql._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@SparkQA
Copy link

SparkQA commented Jul 16, 2019

Test build #107711 has finished for PR 24793 at commit 388e5fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 16, 2019

Merged to master

@srowen srowen closed this in 282a12d Jul 16, 2019
@zhengruifeng zhengruifeng deleted the aft_iso_check_empty_outputCol branch July 17, 2019 01:48
vinodkc pushed a commit to vinodkc/spark that referenced this pull request Jul 18, 2019
…names

## What changes were proposed in this pull request?
In regression/clustering/ovr/als, if an output column name is empty, igore it. And if all names are empty, log a warning msg, then do nothing.

## How was this patch tested?
existing tests

Closes apache#24793 from zhengruifeng/aft_iso_check_empty_outputCol.

Authored-by: zhengruifeng <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants