Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Feb 18, 2020

What changes were proposed in this pull request?

This PR reverts #26325 and #26347

It also removes a test added by SPARK-30047. We can't use interval in aggregate now.

Why are the changes needed?

When we do sum/avg, we need a wider type of input to hold the sum value, to reduce the possibility of overflow. For example, we use long to hold the sum of integral inputs, use double to hold the sum of float/double.

However, we don't have a wider type of interval. Also the semantic is unclear: what if the days field overflows but the months field doesn't? Currently the avg of 1 month and 2 month is 1 month 15 days, which assumes 1 month has 30 days and we should avoid this assumption.

Does this PR introduce any user-facing change?

yes, remove 2 features added in 3.0

How was this patch tested?

N/A

@cloud-fan
Copy link
Contributor Author

cc @yaooqinn @hvanhovell @maryannxue

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Feb 18, 2020

Test build #118633 has finished for PR 27619 at commit 8ffc9eb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 18, 2020

Test build #118642 has finished for PR 27619 at commit b25e491.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Merging to master/branch-3.0.

hvanhovell pushed a commit that referenced this pull request Feb 18, 2020
### What changes were proposed in this pull request?

This PR reverts #26325 and #26347

### Why are the changes needed?

When we do sum/avg, we need a wider type of input to hold the sum value, to reduce the possibility of overflow. For example, we use long to hold the sum of integral inputs, use double to hold the sum of float/double.

However, we don't have a wider type of interval. Also the semantic is unclear: what if the days field overflows but the months field doesn't? Currently the avg of `1 month` and `2 month` is `1 month 15 days`, which assumes 1 month has 30 days and we should avoid this assumption.

### Does this PR introduce any user-facing change?

yes, remove 2 features added in 3.0

### How was this patch tested?

N/A

Closes #27619 from cloud-fan/revert.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: herman <[email protected]>
(cherry picked from commit 1b67d54)
Signed-off-by: herman <[email protected]>
}
}

test("calendar interval agg support hash aggregate") {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Feb 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cloud-fan .
Although this is a correct removal, this is a part of [SPARK-30047][SQL] Support interval types in UnsafeRow which is not aimed in the PR title and description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support interval in UnsafeRow is fine. It has perf benefits, not just for hash aggregate, so we shouldn't revert it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm not asking to revert SPARK-30047. I'm saying that it would be great if this PR title and description mentioned this clearly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the PR description.

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

This PR reverts apache#26325 and apache#26347

### Why are the changes needed?

When we do sum/avg, we need a wider type of input to hold the sum value, to reduce the possibility of overflow. For example, we use long to hold the sum of integral inputs, use double to hold the sum of float/double.

However, we don't have a wider type of interval. Also the semantic is unclear: what if the days field overflows but the months field doesn't? Currently the avg of `1 month` and `2 month` is `1 month 15 days`, which assumes 1 month has 30 days and we should avoid this assumption.

### Does this PR introduce any user-facing change?

yes, remove 2 features added in 3.0

### How was this patch tested?

N/A

Closes apache#27619 from cloud-fan/revert.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: herman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants