Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is to fix the silent result changing between Spark 2.4 and 3.0 reported by #28692

Since we can't find a way to simulate the behavior of pattern u in the legacy formatter API, this PR proposes to forbid u, and users should use e or E instead, according to their needs. Then at least it's an explicit error instead of a silent result changing.

Why are the changes needed?

To avoid silent result changing in Spark 3.0.

Does this PR introduce any user-facing change?

Yes, now query will fail if u exists in the datetime pattern.

How was this patch tested?

updated test

@cloud-fan
Copy link
Contributor Author

@HyukjinKwon HyukjinKwon changed the title [SPARK-31899][SQL] forbid datetime pattern letter u [SPARK-31899][SQL] Forbid datetime pattern letter u Jun 4, 2020
@SparkQA
Copy link

SparkQA commented Jun 4, 2020

Test build #123496 has finished for PR 28719 at commit a68288f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member

yaooqinn commented Jun 4, 2020

After all, we are trying our best. This seems to be the best choice left.

Notice that only e and ee are localized, and e*3and e*4 are not. And we should disable its narrow form text style e*5 too like others

@yaooqinn
Copy link
Member

yaooqinn commented Jun 4, 2020

Hmm... Bad news, I am afraid I just the concept of THE FIRST DAY OF WEEK is not only about the letter 'u' itself!!!

It affects all week-based patterns.

i.e., For the date 2019-12-29(Sunday), in the Sunday Start system, it belongs to 2020 of week-based-year, in the Monday Start system, it goes to 2019. the week-of-week-based-year(w) will be affected too.

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US'));
2020
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB'));
2019

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-01-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2019-52-07

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-02-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2020-01-07

Don't pay too much attention to the CSV function, I just use it to mock the default locale changing which changes the rule of THE FIRST DAY OF WEEK.

@cloud-fan
Copy link
Contributor Author

This is bad news, and hopefully #28727 can fix it.

Anyway, this PR is still necessary. The new letter e is different from the legacy letter u because it becomes text when there are more than 2 letters.

@SparkQA
Copy link

SparkQA commented Jun 4, 2020

Test build #123531 has finished for PR 28719 at commit 1b780c9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 4, 2020

Test build #123532 has finished for PR 28719 at commit 043d4b7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

cloud-fan pushed a commit that referenced this pull request Jun 5, 2020
…ormatting too

# What changes were proposed in this pull request?

After all these attempts #28692 and #28719 an #28727.
they all have limitations as mentioned in their discussions.

Maybe the only way is to forbid them all

### Why are the changes needed?

These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.

From the Java doc of WeekFields
```java
    /**
     * Gets the first day-of-week.
     * <p>
     * The first day-of-week varies by culture.
     * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
     * This method returns the first day using the standard {code DayOfWeek} enum.
     *
     * return the first day-of-week, not null
     */
    public DayOfWeek getFirstDayOfWeek() {
        return firstDayOfWeek;
    }
```

But for the SimpleDateFormat, the day-of-week is not localized

```
u	Day number of week (1 = Monday, ..., 7 = Sunday)	Number	1
```

Currently, the default locale we use is the US, so the result moved a day or a year or a week backward.

e.g.

For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too

```sql
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US'));
2020
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB'));
2019

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-01-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2019-52-07

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-02-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2020-01-07
```

For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671)

### Does this PR introduce _any_ user-facing change?
With this change, user can not use 'YwuW',  but 'e' for 'u' instead. This can at least turn this not to be a silent data change.

### How was this patch tested?

add unit tests

Closes #28728 from yaooqinn/SPARK-31879-NEW2.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit to cloud-fan/spark that referenced this pull request Jun 5, 2020
…ormatting too

After all these attempts apache#28692 and apache#28719 an apache#28727.
they all have limitations as mentioned in their discussions.

Maybe the only way is to forbid them all

These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.

From the Java doc of WeekFields
```java
    /**
     * Gets the first day-of-week.
     * <p>
     * The first day-of-week varies by culture.
     * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
     * This method returns the first day using the standard {code DayOfWeek} enum.
     *
     * return the first day-of-week, not null
     */
    public DayOfWeek getFirstDayOfWeek() {
        return firstDayOfWeek;
    }
```

But for the SimpleDateFormat, the day-of-week is not localized

```
u	Day number of week (1 = Monday, ..., 7 = Sunday)	Number	1
```

Currently, the default locale we use is the US, so the result moved a day or a year or a week backward.

e.g.

For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too

```sql
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US'));
2020
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB'));
2019

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-01-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2019-52-07

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-02-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2020-01-07
```

For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671)

With this change, user can not use 'YwuW',  but 'e' for 'u' instead. This can at least turn this not to be a silent data change.

add unit tests

Closes apache#28728 from yaooqinn/SPARK-31879-NEW2.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 9d5b5d0)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan cloud-fan closed this Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants