[SPARK-31899][SQL] Forbid datetime pattern letter u #28719

cloud-fan · 2020-06-03T18:59:08Z

What changes were proposed in this pull request?

This is to fix the silent result changing between Spark 2.4 and 3.0 reported by #28692

Since we can't find a way to simulate the behavior of pattern u in the legacy formatter API, this PR proposes to forbid u, and users should use e or E instead, according to their needs. Then at least it's an explicit error instead of a silent result changing.

Why are the changes needed?

To avoid silent result changing in Spark 3.0.

Does this PR introduce any user-facing change?

Yes, now query will fail if u exists in the datetime pattern.

How was this patch tested?

updated test

cloud-fan · 2020-06-03T19:01:04Z

cc @yaooqinn @xuanyuanking @MaxGekk @srowen

SparkQA · 2020-06-04T00:14:49Z

Test build #123496 has finished for PR 28719 at commit a68288f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yaooqinn · 2020-06-04T02:08:48Z

After all, we are trying our best. This seems to be the best choice left.

Notice that only e and ee are localized, and e*3and e*4 are not. And we should disable its narrow form text style e*5 too like others

yaooqinn · 2020-06-04T03:46:49Z

Hmm... Bad news, I am afraid I just the concept of THE FIRST DAY OF WEEK is not only about the letter 'u' itself!!!

It affects all week-based patterns.

i.e., For the date 2019-12-29(Sunday), in the Sunday Start system, it belongs to 2020 of week-based-year, in the Monday Start system, it goes to 2019. the week-of-week-based-year(w) will be affected too.

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US'));
2020
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB'));
2019

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-01-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2019-52-07

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-02-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2020-01-07

Don't pay too much attention to the CSV function, I just use it to mock the default locale changing which changes the rule of THE FIRST DAY OF WEEK.

cloud-fan · 2020-06-04T11:57:31Z

This is bad news, and hopefully #28727 can fix it.

Anyway, this PR is still necessary. The new letter e is different from the legacy letter u because it becomes text when there are more than 2 letters.

SparkQA · 2020-06-04T14:33:51Z

Test build #123531 has finished for PR 28719 at commit 1b780c9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-04T14:48:08Z

Test build #123532 has finished for PR 28719 at commit 043d4b7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…ormatting too # What changes were proposed in this pull request? After all these attempts #28692 and #28719 an #28727. they all have limitations as mentioned in their discussions. Maybe the only way is to forbid them all ### Why are the changes needed? These week-based fields need Locale to express their semantics, the first day of the week varies from country to country. From the Java doc of WeekFields ```java /** * Gets the first day-of-week. * <p> * The first day-of-week varies by culture. * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday. * This method returns the first day using the standard {code DayOfWeek} enum. * * return the first day-of-week, not null */ public DayOfWeek getFirstDayOfWeek() { return firstDayOfWeek; } ``` But for the SimpleDateFormat, the day-of-week is not localized ``` u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1 ``` Currently, the default locale we use is the US, so the result moved a day or a year or a week backward. e.g. For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too ```sql spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US')); 2020 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB')); 2019 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-01-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2019-52-07 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-02-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2020-01-07 ``` For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671) ### Does this PR introduce _any_ user-facing change? With this change, user can not use 'YwuW', but 'e' for 'u' instead. This can at least turn this not to be a silent data change. ### How was this patch tested? add unit tests Closes #28728 from yaooqinn/SPARK-31879-NEW2. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ormatting too After all these attempts apache#28692 and apache#28719 an apache#28727. they all have limitations as mentioned in their discussions. Maybe the only way is to forbid them all These week-based fields need Locale to express their semantics, the first day of the week varies from country to country. From the Java doc of WeekFields ```java /** * Gets the first day-of-week. * <p> * The first day-of-week varies by culture. * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday. * This method returns the first day using the standard {code DayOfWeek} enum. * * return the first day-of-week, not null */ public DayOfWeek getFirstDayOfWeek() { return firstDayOfWeek; } ``` But for the SimpleDateFormat, the day-of-week is not localized ``` u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1 ``` Currently, the default locale we use is the US, so the result moved a day or a year or a week backward. e.g. For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too ```sql spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US')); 2020 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB')); 2019 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-01-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2019-52-07 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-02-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2020-01-07 ``` For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671) With this change, user can not use 'YwuW', but 'e' for 'u' instead. This can at least turn this not to be a silent data change. add unit tests Closes apache#28728 from yaooqinn/SPARK-31879-NEW2. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 9d5b5d0) Signed-off-by: Wenchen Fan <[email protected]>

probot-autolabeler bot added the SQL label Jun 3, 2020

HyukjinKwon changed the title ~~[SPARK-31899][SQL] forbid datetime pattern letter u~~ [SPARK-31899][SQL] Forbid datetime pattern letter u Jun 4, 2020

This was referenced Jun 4, 2020

[SPARK-31879][SQL] Using GB as default Locale for datetime formatters #28692

Closed

[SPARK-31879][SQL][test-java11] Preset monday start(iso) for the first day of week by reflection #28727

Closed

cloud-fan force-pushed the u branch from a68288f to 1b780c9 Compare June 4, 2020 11:55

forbid pattern letter u

043d4b7

cloud-fan force-pushed the u branch from 1b780c9 to 043d4b7 Compare June 4, 2020 12:04

probot-autolabeler bot added the DOCS label Jun 4, 2020

yaooqinn mentioned this pull request Jun 4, 2020

[SPARK-31879][SQL][test-java11] Make week-based pattern invalid for formatting too #28728

Closed

cloud-fan closed this Jun 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-31899][SQL] Forbid datetime pattern letter u #28719

[SPARK-31899][SQL] Forbid datetime pattern letter u #28719

Uh oh!

cloud-fan commented Jun 3, 2020

Uh oh!

cloud-fan commented Jun 3, 2020

Uh oh!

SparkQA commented Jun 4, 2020

Uh oh!

yaooqinn commented Jun 4, 2020 •

edited

Loading

Uh oh!

yaooqinn commented Jun 4, 2020 •

edited

Loading

Uh oh!

cloud-fan commented Jun 4, 2020

Uh oh!

SparkQA commented Jun 4, 2020

Uh oh!

SparkQA commented Jun 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-31899][SQL] Forbid datetime pattern letter u #28719

[SPARK-31899][SQL] Forbid datetime pattern letter u #28719

Uh oh!

Conversation

cloud-fan commented Jun 3, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Jun 3, 2020

Uh oh!

SparkQA commented Jun 4, 2020

Uh oh!

yaooqinn commented Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jun 4, 2020

Uh oh!

SparkQA commented Jun 4, 2020

Uh oh!

SparkQA commented Jun 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaooqinn commented Jun 4, 2020 •

edited

Loading

yaooqinn commented Jun 4, 2020 •

edited

Loading