-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29688][SQL] Support average for interval type values #26347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #113031 has finished for PR 26347 at commit
|
|
Test build #113058 has finished for PR 26347 at commit
|
|
Test build #113206 has finished for PR 26347 at commit
|
|
also cc: @MaxGekk |
|
@yaooqinn Can you fix the build failure first? |
|
@maropu thanks for your review, I have had it rebased with newest master branch, thanks. I appreciate for your time to take another look. |
|
btw, this pr includes SPARK-29387, too? If so, how about fixing more basic operators (div and mul) for intervals then implementing avg in follow-up? |
|
Yes, avg needs interval divide feature support. |
|
#26345 I proposed this to support multiply and divide without noticing spark-29378,@maropu,intended to make these step by step |
|
I will revisit this after #26132 merged. |
|
Test build #113232 has finished for PR 26347 at commit
|
| Decimal(interval.days) / divisor | ||
| val milliseconds = days.remainder(Decimal.ONE) * Decimal(DateTimeUtils.MICROS_PER_DAY) + | ||
| Decimal(interval.microseconds) / divisor | ||
| new CalendarInterval(months.toInt, days.toInt, milliseconds.toLong) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
milliseconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
microseconds. thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't be needed when your pr will be merged. I'd like to leave it for a while.
|
Test build #113385 has finished for PR 26347 at commit
|
|
Test build #113386 has finished for PR 26347 at commit
|
|
ping @maropu @MaxGekk @cloud-fan, this have been rebased with lastest interval divide behavior change, thanks very much for review one more time, sorry for the delay. |
|
|
||
| -- average with interval type | ||
| -- null | ||
| select avg(cast(v as interval)) from VALUES ('1 seconds'), ('2 seconds'), (null) t(v) where v is null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it the same with select avg(cast(v as interval)) from VALUES (null) t(v)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I changed it in your way. thanks.
|
Test build #113425 has finished for PR 26347 at commit
|
| private lazy val resultType = child.dataType match { | ||
| case DecimalType.Fixed(p, s) => | ||
| DecimalType.bounded(p + 4, s + 4) | ||
| case CalendarIntervalType => CalendarIntervalType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: case interval: CalendarIntervalType => interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will change this and the one in the sumDataType
|
@cloud-fan @maropu thanks for your review, I will update with the minor revision in a minute. Sorry for the delay, focused on the spark |
|
Test build #113452 has finished for PR 26347 at commit
|
|
thanks, merging to master! |
### What changes were proposed in this pull request? This PR reverts #26325 and #26347 ### Why are the changes needed? When we do sum/avg, we need a wider type of input to hold the sum value, to reduce the possibility of overflow. For example, we use long to hold the sum of integral inputs, use double to hold the sum of float/double. However, we don't have a wider type of interval. Also the semantic is unclear: what if the days field overflows but the months field doesn't? Currently the avg of `1 month` and `2 month` is `1 month 15 days`, which assumes 1 month has 30 days and we should avoid this assumption. ### Does this PR introduce any user-facing change? yes, remove 2 features added in 3.0 ### How was this patch tested? N/A Closes #27619 from cloud-fan/revert. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: herman <[email protected]>
### What changes were proposed in this pull request? This PR reverts #26325 and #26347 ### Why are the changes needed? When we do sum/avg, we need a wider type of input to hold the sum value, to reduce the possibility of overflow. For example, we use long to hold the sum of integral inputs, use double to hold the sum of float/double. However, we don't have a wider type of interval. Also the semantic is unclear: what if the days field overflows but the months field doesn't? Currently the avg of `1 month` and `2 month` is `1 month 15 days`, which assumes 1 month has 30 days and we should avoid this assumption. ### Does this PR introduce any user-facing change? yes, remove 2 features added in 3.0 ### How was this patch tested? N/A Closes #27619 from cloud-fan/revert. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: herman <[email protected]> (cherry picked from commit 1b67d54) Signed-off-by: herman <[email protected]>
### What changes were proposed in this pull request? This PR reverts apache#26325 and apache#26347 ### Why are the changes needed? When we do sum/avg, we need a wider type of input to hold the sum value, to reduce the possibility of overflow. For example, we use long to hold the sum of integral inputs, use double to hold the sum of float/double. However, we don't have a wider type of interval. Also the semantic is unclear: what if the days field overflows but the months field doesn't? Currently the avg of `1 month` and `2 month` is `1 month 15 days`, which assumes 1 month has 30 days and we should avoid this assumption. ### Does this PR introduce any user-facing change? yes, remove 2 features added in 3.0 ### How was this patch tested? N/A Closes apache#27619 from cloud-fan/revert. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: herman <[email protected]>
What changes were proposed in this pull request?
avg aggregate support interval type values
Why are the changes needed?
Part of SPARK-27764 Feature Parity between PostgreSQL and Spark
Does this PR introduce any user-facing change?
yes, we can do avg on intervals
How was this patch tested?
add ut