Skip to content

Conversation

@yaooqinn
Copy link
Member

What changes were proposed in this pull request?

avg aggregate support interval type values

Why are the changes needed?

Part of SPARK-27764 Feature Parity between PostgreSQL and Spark

Does this PR introduce any user-facing change?

yes, we can do avg on intervals

How was this patch tested?

add ut

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113031 has finished for PR 26347 at commit a62d9db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class IntervalDivide(left: Expression, right: Expression)

@SparkQA
Copy link

SparkQA commented Nov 1, 2019

Test build #113058 has finished for PR 26347 at commit 91792ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 1, 2019

cc @cloud-fan @wangyum @gatorsmile

@SparkQA
Copy link

SparkQA commented Nov 4, 2019

Test build #113206 has finished for PR 26347 at commit 0102f37.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Nov 4, 2019

also cc: @MaxGekk

@maropu
Copy link
Member

maropu commented Nov 4, 2019

@yaooqinn Can you fix the build failure first?

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 5, 2019

@maropu thanks for your review, I have had it rebased with newest master branch, thanks. I appreciate for your time to take another look.

@maropu
Copy link
Member

maropu commented Nov 5, 2019

btw, this pr includes SPARK-29387, too? If so, how about fixing more basic operators (div and mul) for intervals then implementing avg in follow-up?

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 5, 2019

Yes, avg needs interval divide feature support.

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 5, 2019

#26345 I proposed this to support multiply and divide without noticing spark-29378@maropu,intended to make these step by step

@maropu
Copy link
Member

maropu commented Nov 5, 2019

I will revisit this after #26132 merged.

@SparkQA
Copy link

SparkQA commented Nov 5, 2019

Test build #113232 has finished for PR 26347 at commit 3c76add.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Decimal(interval.days) / divisor
val milliseconds = days.remainder(Decimal.ONE) * Decimal(DateTimeUtils.MICROS_PER_DAY) +
Decimal(interval.microseconds) / divisor
new CalendarInterval(months.toInt, days.toInt, milliseconds.toLong)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

milliseconds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

microseconds. thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't be needed when your pr will be merged. I'd like to leave it for a while.

@SparkQA
Copy link

SparkQA commented Nov 7, 2019

Test build #113385 has finished for PR 26347 at commit 161d742.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class DateTimeConstants
  • case class LocalShuffleReaderExec(child: SparkPlan) extends UnaryExecNode
  • class ContinuousRecordEndpoint(buckets: Seq[Seq[UnsafeRow]], lock: Object)

@SparkQA
Copy link

SparkQA commented Nov 7, 2019

Test build #113386 has finished for PR 26347 at commit 6ac44e4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 8, 2019

ping @maropu @MaxGekk @cloud-fan, this have been rebased with lastest interval divide behavior change, thanks very much for review one more time, sorry for the delay.


-- average with interval type
-- null
select avg(cast(v as interval)) from VALUES ('1 seconds'), ('2 seconds'), (null) t(v) where v is null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the same with select avg(cast(v as interval)) from VALUES (null) t(v)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I changed it in your way. thanks.

@SparkQA
Copy link

SparkQA commented Nov 8, 2019

Test build #113425 has finished for PR 26347 at commit 3443c18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private lazy val resultType = child.dataType match {
case DecimalType.Fixed(p, s) =>
DecimalType.bounded(p + 4, s + 4)
case CalendarIntervalType => CalendarIntervalType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: case interval: CalendarIntervalType => interval?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will change this and the one in the sumDataType

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 8, 2019

@cloud-fan @maropu thanks for your review, I will update with the minor revision in a minute. Sorry for the delay, focused on the spark unnest implementation just now

@SparkQA
Copy link

SparkQA commented Nov 8, 2019

Test build #113452 has finished for PR 26347 at commit 2e06305.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in e7f7990 Nov 8, 2019
hvanhovell pushed a commit that referenced this pull request Feb 18, 2020
### What changes were proposed in this pull request?

This PR reverts #26325 and #26347

### Why are the changes needed?

When we do sum/avg, we need a wider type of input to hold the sum value, to reduce the possibility of overflow. For example, we use long to hold the sum of integral inputs, use double to hold the sum of float/double.

However, we don't have a wider type of interval. Also the semantic is unclear: what if the days field overflows but the months field doesn't? Currently the avg of `1 month` and `2 month` is `1 month 15 days`, which assumes 1 month has 30 days and we should avoid this assumption.

### Does this PR introduce any user-facing change?

yes, remove 2 features added in 3.0

### How was this patch tested?

N/A

Closes #27619 from cloud-fan/revert.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: herman <[email protected]>
hvanhovell pushed a commit that referenced this pull request Feb 18, 2020
### What changes were proposed in this pull request?

This PR reverts #26325 and #26347

### Why are the changes needed?

When we do sum/avg, we need a wider type of input to hold the sum value, to reduce the possibility of overflow. For example, we use long to hold the sum of integral inputs, use double to hold the sum of float/double.

However, we don't have a wider type of interval. Also the semantic is unclear: what if the days field overflows but the months field doesn't? Currently the avg of `1 month` and `2 month` is `1 month 15 days`, which assumes 1 month has 30 days and we should avoid this assumption.

### Does this PR introduce any user-facing change?

yes, remove 2 features added in 3.0

### How was this patch tested?

N/A

Closes #27619 from cloud-fan/revert.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: herman <[email protected]>
(cherry picked from commit 1b67d54)
Signed-off-by: herman <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

This PR reverts apache#26325 and apache#26347

### Why are the changes needed?

When we do sum/avg, we need a wider type of input to hold the sum value, to reduce the possibility of overflow. For example, we use long to hold the sum of integral inputs, use double to hold the sum of float/double.

However, we don't have a wider type of interval. Also the semantic is unclear: what if the days field overflows but the months field doesn't? Currently the avg of `1 month` and `2 month` is `1 month 15 days`, which assumes 1 month has 30 days and we should avoid this assumption.

### Does this PR introduce any user-facing change?

yes, remove 2 features added in 3.0

### How was this patch tested?

N/A

Closes apache#27619 from cloud-fan/revert.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: herman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants