[SPARK-31859][SPARK-31861][SPARK-31863] Fix Thriftserver session timezone issues #28671

juliuszsompolski · 2020-05-28T22:53:55Z

What changes were proposed in this pull request?

Timestamp literals in Spark are interpreted as timestamps in local timezone spark.sql.session.timeZone.

If JDBC client is e.g. in TimeZone UTC-7, and sets spark.sql.session.timeZone to PST, and sends a query "SELECT timestamp '2020-05-20 12:00:00'", and the JVM timezone of the Spark cluster is e.g. UTC+2, then what currently happens is:

The timestamp literal in the query is interpreted as 12:00:00 UTC-7, i.e. 19:00:00 UTC.
When it's returned from the query, it is collected as a java.sql.Timestamp object with Dataset.collect(), and put into a Thriftserver RowSet.
Before sending it over the wire, the Timestamp is converted to String. This happens in explicitly in ColumnValue for RowBasedSet, and implicitly in ColumnBuffer for ColumnBasedSet (all non-primitive types are converted toString() there). The conversion toString uses JVM timezone, which results in a "21:00:00" (UTC+2) string representation.
The client JDBC application parses gets a "21:00:00" Timestamp back (in it's JVM timezone; if the JDBC application cares about the correct UTC internal value, it should set spark.sql.session.timeZone to be consistent with its JVM timezone)

The problem is caused by the conversion happening in Thriftserver RowSet with the generic toString() function, instead of using HiveResults.toHiveString() that takes care of correct, timezone respecting conversions. This PR fixes it by converting the Timestamp values to String earlier, in SparkExecuteStatementOperation, using that function. This fixes SPARK-31861.

Thriftserver also did not work spark.sql.datetime.java8API.enabled, because the conversions in RowSet expected an Timestamp object instead of Instant object. Using HiveResults.toHiveString() also fixes that. For this reason, we also convert Date values in SparkExecuteStatementOperation as well - so that HiveResults.toHiveString() handles LocalDate as well. This fixes SPARK-31859.

Thriftserver also did not correctly set the active SparkSession. Because of that, configuration obtained using SQLConf.get was not the correct session configuration. This affected getting the correct spark.sql.session.timeZone. It is fixed by extending the use of SparkExecuteStatementOperation.withSchedulerPool to also set the correct active SparkSession. When the correct session is set, we also no longer need to maintain the pool mapping in a sessionToActivePool map. The scheduler pool can be just correctly retrieved from the session config. "withSchedulerPool" is renamed to "withLocalProperties" and moved into a mixin helper trait, because it should be applied with every operation. This fixes SPARK-31863.

I used the opportunity to move some repetitive code from the operations to the mixin helper trait.

juliuszsompolski · 2020-05-28T22:58:03Z

cc @MaxGekk @cloud-fan @wangyum @gatorsmile

SparkQA · 2020-05-28T23:48:39Z

Test build #123253 has finished for PR 28671 at commit a2b9114.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

...thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkOperationUtils.scala

sql/hive-thriftserver/v1.2/src/main/java/org/apache/hive/service/cli/ColumnValue.java

...ftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala

juliuszsompolski

cc @cloud-fan

juliuszsompolski · 2020-05-29T18:46:47Z

sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/operation/Operation.java

+  public void close() throws HiveSQLException {
+    setState(OperationState.CLOSED);
+    cleanupOperationLog();
+  }


I needed to add this concrete base implementation here, otherwise the abstract override of close() in the trait didn't work in SparkExecuteStatementOperation, because it didn't have a concrete implementation in any of the subclasses.
On the other hand, if I wanted to implement it concretely in the trait, it was complaining about not being able to call protected methods from Java class in the trait.

This base implementation here is doing what it was doing in all the operations we have.
It might have been another tiny bug that SparkExecuteStatementOperation did not call cleanupOperationLog. The other operations did via their super call.

juliuszsompolski · 2020-05-29T18:48:44Z

...hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkOperation.scala

+
+  protected def sqlContext: SQLContext
+
+  protected var statementId = getHandle().getHandleIdentifier().getPublicId().toString()


Instead of creating a new UUID, I use the one that is already created for the operation.
I found it confusing when debugging, that there were two UUIDs used for operations: SparkUI and some of the logs from Spark classes used the ones we created, while log lines from Hive imported code used this one, and the two couldn't be linked together... Now they are going to be the same id, which should make debugging easier.

SparkQA · 2020-05-29T19:21:28Z

Test build #123298 has finished for PR 28671 at commit 29bd36f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-29T20:02:46Z

Test build #123300 has finished for PR 28671 at commit 988a29d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-05-30T06:13:04Z

nice fixes for the thriftserver! thanks, merging to master/3.0

…zone issues ### What changes were proposed in this pull request? Timestamp literals in Spark are interpreted as timestamps in local timezone spark.sql.session.timeZone. If JDBC client is e.g. in TimeZone UTC-7, and sets spark.sql.session.timeZone to PST, and sends a query "SELECT timestamp '2020-05-20 12:00:00'", and the JVM timezone of the Spark cluster is e.g. UTC+2, then what currently happens is: * The timestamp literal in the query is interpreted as 12:00:00 UTC-7, i.e. 19:00:00 UTC. * When it's returned from the query, it is collected as a java.sql.Timestamp object with Dataset.collect(), and put into a Thriftserver RowSet. * Before sending it over the wire, the Timestamp is converted to String. This happens in explicitly in ColumnValue for RowBasedSet, and implicitly in ColumnBuffer for ColumnBasedSet (all non-primitive types are converted toString() there). The conversion toString uses JVM timezone, which results in a "21:00:00" (UTC+2) string representation. * The client JDBC application parses gets a "21:00:00" Timestamp back (in it's JVM timezone; if the JDBC application cares about the correct UTC internal value, it should set spark.sql.session.timeZone to be consistent with its JVM timezone) The problem is caused by the conversion happening in Thriftserver RowSet with the generic toString() function, instead of using HiveResults.toHiveString() that takes care of correct, timezone respecting conversions. This PR fixes it by converting the Timestamp values to String earlier, in SparkExecuteStatementOperation, using that function. This fixes SPARK-31861. Thriftserver also did not work spark.sql.datetime.java8API.enabled, because the conversions in RowSet expected an Timestamp object instead of Instant object. Using HiveResults.toHiveString() also fixes that. For this reason, we also convert Date values in SparkExecuteStatementOperation as well - so that HiveResults.toHiveString() handles LocalDate as well. This fixes SPARK-31859. Thriftserver also did not correctly set the active SparkSession. Because of that, configuration obtained using SQLConf.get was not the correct session configuration. This affected getting the correct spark.sql.session.timeZone. It is fixed by extending the use of SparkExecuteStatementOperation.withSchedulerPool to also set the correct active SparkSession. When the correct session is set, we also no longer need to maintain the pool mapping in a sessionToActivePool map. The scheduler pool can be just correctly retrieved from the session config. "withSchedulerPool" is renamed to "withLocalProperties" and moved into a mixin helper trait, because it should be applied with every operation. This fixes SPARK-31863. I used the opportunity to move some repetitive code from the operations to the mixin helper trait. Closes #28671 from juliuszsompolski/SPARK-31861. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit af35691) Signed-off-by: Wenchen Fan <[email protected]>

MaxGekk · 2020-06-04T20:03:54Z

...ftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala

+    withJdbcStatement() { statement =>
+      withJdbcStatement() { st =>


@juliuszsompolski Did you use this nested st intentionally? Looks like statement is not used.

.... this was not intentional, looks like a copy paste error on my part. I raised #28735 to fix it.

…ncel and close should not transiently ERROR ### What changes were proposed in this pull request? #28671 introduced a change where the order in which CANCELED state for SparkExecuteStatementOperation is set was changed. Before setting the state to CANCELED, `cleanup()` was called which kills the jobs, causing an exception to be thrown inside `execute()`. This causes the state to transiently become ERROR before being set to CANCELED. This PR fixes the order. ### Why are the changes needed? Bug: wrong operation state is set. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test in SparkExecuteStatementOperationSuite.scala. Closes #28912 from alismess-db/execute-statement-operation-cleanup-order. Authored-by: Ali Smesseim <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…ncel and close should not transiently ERROR ### What changes were proposed in this pull request? #28671 introduced a change where the order in which CANCELED state for SparkExecuteStatementOperation is set was changed. Before setting the state to CANCELED, `cleanup()` was called which kills the jobs, causing an exception to be thrown inside `execute()`. This causes the state to transiently become ERROR before being set to CANCELED. This PR fixes the order. ### Why are the changes needed? Bug: wrong operation state is set. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test in SparkExecuteStatementOperationSuite.scala. Closes #28912 from alismess-db/execute-statement-operation-cleanup-order. Authored-by: Ali Smesseim <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 8b0a54e) Signed-off-by: HyukjinKwon <[email protected]>

fix

a2b9114

probot-autolabeler bot added the SQL label May 28, 2020

cloud-fan reviewed May 29, 2020

View reviewed changes

...thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkOperationUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 29, 2020

View reviewed changes

sql/hive-thriftserver/v1.2/src/main/java/org/apache/hive/service/cli/ColumnValue.java Show resolved Hide resolved

cloud-fan reviewed May 29, 2020

View reviewed changes

...ftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala Outdated Show resolved Hide resolved

review comments

29bd36f

juliuszsompolski commented May 29, 2020

View reviewed changes

stray empty line

988a29d

cloud-fan closed this in af35691 May 30, 2020

MaxGekk mentioned this pull request Jun 2, 2020

[SPARK-30808][SQL] Enable Java 8 time API in in Thriftserver SQL CLI #28705

Closed

MaxGekk reviewed Jun 4, 2020

View reviewed changes

alismess-db mentioned this pull request Jun 23, 2020

[SPARK-32057][SQL][test-hive1.2][test-hadoop2.7] ExecuteStatement: cancel and close should not transiently ERROR #28912

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-31859][SPARK-31861][SPARK-31863] Fix Thriftserver session timezone issues #28671

[SPARK-31859][SPARK-31861][SPARK-31863] Fix Thriftserver session timezone issues #28671

Uh oh!

juliuszsompolski commented May 28, 2020 •

edited

Loading

Uh oh!

juliuszsompolski commented May 28, 2020

Uh oh!

SparkQA commented May 28, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

juliuszsompolski left a comment

Uh oh!

juliuszsompolski May 29, 2020

Uh oh!

juliuszsompolski May 29, 2020

Uh oh!

SparkQA commented May 29, 2020

Uh oh!

SparkQA commented May 29, 2020

Uh oh!

cloud-fan commented May 30, 2020

Uh oh!

MaxGekk Jun 4, 2020

Uh oh!

juliuszsompolski Jun 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		protected def sqlContext: SQLContext

		protected var statementId = getHandle().getHandleIdentifier().getPublicId().toString()

		withJdbcStatement() { statement =>
		withJdbcStatement() { st =>

[SPARK-31859][SPARK-31861][SPARK-31863] Fix Thriftserver session timezone issues #28671

[SPARK-31859][SPARK-31861][SPARK-31863] Fix Thriftserver session timezone issues #28671

Uh oh!

Conversation

juliuszsompolski commented May 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Uh oh!

juliuszsompolski commented May 28, 2020

Uh oh!

SparkQA commented May 28, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

juliuszsompolski left a comment

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski May 29, 2020

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski May 29, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 29, 2020

Uh oh!

SparkQA commented May 29, 2020

Uh oh!

cloud-fan commented May 30, 2020

Uh oh!

MaxGekk Jun 4, 2020

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski Jun 5, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juliuszsompolski commented May 28, 2020 •

edited

Loading