-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30808][SQL] Enable Java 8 time API in Thrift server #27552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cloud-fan I tried to set the SQL config before the collect action in spark/sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala Lines 54 to 59 in aa0d136
|
|
Test build #118319 has finished for PR 27552 at commit
|
...r/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
Show resolved
Hide resolved
|
Test build #118347 has finished for PR 27552 at commit
|
It's probably because the |
|
Test build #118352 has finished for PR 27552 at commit
|
@cloud-fan Is it possible to create/restore a Dataset from an executedPlan? |
|
@MaxGekk it's possible from a logical plan, e.g. |
…server-java8-time-api # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala # sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLDriver.scala # sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala
|
Test build #118509 has finished for PR 27552 at commit
|
|
Test build #118508 has finished for PR 27552 at commit
|
|
Test build #118510 has finished for PR 27552 at commit
|
|
jenkins, retest this, please |
|
Test build #118511 has finished for PR 27552 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
Outdated
Show resolved
Hide resolved
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala
Outdated
Show resolved
Hide resolved
...hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLDriver.scala
Outdated
Show resolved
Hide resolved
|
Test build #118549 has finished for PR 27552 at commit
|
|
Test build #118552 has finished for PR 27552 at commit
|
|
So many HiveComparisonTest related tests failed, I will revert this cdb322d |
This reverts commit cdb322d.
|
Test build #118569 has finished for PR 27552 at commit
|
|
@cloud-fan Something wrong is going on here. Commands issues from HiveComparisonTest are executed twice, it seems. |
|
@MaxGekk yea creating the df again may execute the command again. Let's keep the lazy val. |
|
I would prefer to cut off this PR at this point #27552 (comment), and implement moving settings of Making dataset as lazy val doesn't help me, so, I stuck for now. |
|
While debugging the
val result: Seq[Seq[Any]] = Dataset.ofRows(ds.sparkSession, ds.queryExecution.logical)
.queryExecution
.executedPlan
.executeCollectPublic().map(_.toSeq).toSeqThis causes side effects here:
|
|
@MaxGekk I see the problem now. We should use |
|
Test build #118579 has finished for PR 27552 at commit
|
|
thanks, merging to master/3.0! |
### What changes were proposed in this pull request?
- Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call.
- Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting.
### Why are the changes needed?
Because of textual representation of timestamps/dates before 1582 year is incorrect:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:07:02
```
It must be 1001-01-01 00:**00:00**.
### Does this PR introduce any user-facing change?
Yes. After the changes:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:00:00
```
### How was this patch tested?
By running hive-thiftserver tests. In particular:
```
./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite"
```
Closes #27552 from MaxGekk/hive-thriftserver-java8-time-api.
Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit afaeb29)
Signed-off-by: Wenchen Fan <[email protected]>
| sessionWithJava8DatetimeEnabled.withActive { | ||
| // We cannot collect the original dataset because its encoders could be created | ||
| // with disabled Java 8 date-time API. | ||
| val result: Seq[Seq[Any]] = Dataset.ofRows(ds.sparkSession, ds.logicalPlan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found a problem. Dataset.ofRows will set the input session as active, so we should write Dataset.ofRows(sessionWithJava8DatetimeEnabled, ... and remove the outer sessionWithJava8DatetimeEnabled.withActive.
| // Get answer, but also get rid of the #1234 expression ids that show up in explain plans | ||
| val answer = SQLExecution.withNewExecutionId(df.queryExecution, Some(sql)) { | ||
| hiveResultString(df.queryExecution.executedPlan).map(replaceNotIncludedMsg) | ||
| hiveResultString(df).map(replaceNotIncludedMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in ThriftServerQueryTestSuite, we get the result by JDBC, so there is no DataFrame created.
We should follow pgsql and return java 8 datetime when the config is enabled. https://jdbc.postgresql.org/documentation/head/8-date-time.html
| // Convert date-time instances to types that are acceptable by Hive libs | ||
| // used in conversions to strings. | ||
| val resultRow = row.map { | ||
| case i: Instant => Timestamp.from(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems no java8 datetime values to be add to the row buffer here by SparkExecuteStatementOperation#addNonNullColumnValue
https://github.com/apache/spark/pull/27552/files#diff-72dcd8f81a51c8a815159fdf0332acdcR84-R116
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you help fix it? I think we should output java8 datetime values if the config is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are limited by hive-jdbc module, see https://github.com/apache/hive/blob/a7e704c679a00db68db9b9f921d133d79a32cfcc/jdbc/src/java/org/apache/hive/jdbc/HiveBaseResultSet.java#L427-L457, we might need our own jdbc driver implementation to achieve this
| case _ => | ||
| val sessionWithJava8DatetimeEnabled = { | ||
| val cloned = ds.sparkSession.cloneSession() | ||
| cloned.conf.set(SQLConf.DATETIME_JAVA8API_ENABLED.key, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this always true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the old Date/Timestamp doesn't follow the new calendar and may produce wrong string for some date/timestamp values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, we format Date/Timestamp by our own formatter, so this should be no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- !query
set spark.sql.datetime.java8API.enabled
-- !query schema
struct<key:string,value:string>
-- !query output
spark.sql.datetime.java8API.enabled false
-- !query
set set spark.sql.session.timeZone=America/Los_Angeles
-- !query schema
struct<key:string,value:string>
-- !query output
set spark.sql.session.timeZone America/Los_Angeles
-- !query
SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20')
-- !query schema
struct<date_trunc(MILLENNIUM, CAST(DATE '1970-03-20' AS TIMESTAMP)):timestamp>
-- !query output
1001-01-01 00:00:00
I rm this line and run SQLQueryTestSuite with cases above, the results are the same. Or does this problem only exists for spark-sql script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or does this problem only exists for spark-sql script?
Only when thrift-server is involved in the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also pass these tests through ThriftServerQueryTestSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:00:00
spark-sql> select version();
3.1.0 b3dcb63a682bc31827a86cf381f157a81e9e07ac
Also correct with bin/spark-sql
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I tested it too and looks fine. Maybe some refactor of how to format old Date/Timestamp fixes it already.
@yaooqinn can you send a PR to revert it? Let's see if all tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
This reverts commit afaeb29. ### What changes were proposed in this pull request? Based on the result and comment from #27552 (comment) In the hive module, server-side provides datetime values simply use `value.toSting`, and the client-side regenerates the results back in `HiveBaseResultSet` with `java.sql.Date(Timestamp).valueOf`. there will be inconsistency between client and server if we use java8 APIs ### Why are the changes needed? the change is still unclear enough ### Does this PR introduce any user-facing change? no ### How was this patch tested? Nah Closes #27733 from yaooqinn/SPARK-30808. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
This reverts commit afaeb29. ### What changes were proposed in this pull request? Based on the result and comment from #27552 (comment) In the hive module, server-side provides datetime values simply use `value.toSting`, and the client-side regenerates the results back in `HiveBaseResultSet` with `java.sql.Date(Timestamp).valueOf`. there will be inconsistency between client and server if we use java8 APIs ### Why are the changes needed? the change is still unclear enough ### Does this PR introduce any user-facing change? no ### How was this patch tested? Nah Closes #27733 from yaooqinn/SPARK-30808. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 1fac06c) Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
- Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call.
- Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting.
### Why are the changes needed?
Because of textual representation of timestamps/dates before 1582 year is incorrect:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:07:02
```
It must be 1001-01-01 00:**00:00**.
### Does this PR introduce any user-facing change?
Yes. After the changes:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:00:00
```
### How was this patch tested?
By running hive-thiftserver tests. In particular:
```
./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite"
```
Closes apache#27552 from MaxGekk/hive-thriftserver-java8-time-api.
Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
This reverts commit afaeb29. ### What changes were proposed in this pull request? Based on the result and comment from apache#27552 (comment) In the hive module, server-side provides datetime values simply use `value.toSting`, and the client-side regenerates the results back in `HiveBaseResultSet` with `java.sql.Date(Timestamp).valueOf`. there will be inconsistency between client and server if we use java8 APIs ### Why are the changes needed? the change is still unclear enough ### Does this PR introduce any user-facing change? no ### How was this patch tested? Nah Closes apache#27733 from yaooqinn/SPARK-30808. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
- Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call.
- Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting.
### Why are the changes needed?
Because of textual representation of timestamps/dates before 1582 year is incorrect:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:07:02
```
It must be 1001-01-01 00:**00:00**.
### Does this PR introduce any user-facing change?
Yes. After the changes:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:00:00
```
### How was this patch tested?
By running hive-thiftserver tests. In particular:
```
./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite"
```
Closes apache#27552 from MaxGekk/hive-thriftserver-java8-time-api.
Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
# Conflicts:
# sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
# sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
What changes were proposed in this pull request?
spark.sql.datetime.java8API.enabledtotrueinhiveResultString(), and restore it back at the end of the call.java.time.Instant&java.time.LocalDatetojava.sql.Timestampandjava.sql.Datefor correct formatting.Why are the changes needed?
Because of textual representation of timestamps/dates before 1582 year is incorrect:
It must be 1001-01-01 00:00:00.
Does this PR introduce any user-facing change?
Yes. After the changes:
How was this patch tested?
By running hive-thiftserver tests. In particular: