Skip to content

Conversation

@juliuszsompolski
Copy link
Contributor

@juliuszsompolski juliuszsompolski commented May 28, 2020

What changes were proposed in this pull request?

Timestamp literals in Spark are interpreted as timestamps in local timezone spark.sql.session.timeZone.

If JDBC client is e.g. in TimeZone UTC-7, and sets spark.sql.session.timeZone to PST, and sends a query "SELECT timestamp '2020-05-20 12:00:00'", and the JVM timezone of the Spark cluster is e.g. UTC+2, then what currently happens is:

  • The timestamp literal in the query is interpreted as 12:00:00 UTC-7, i.e. 19:00:00 UTC.
  • When it's returned from the query, it is collected as a java.sql.Timestamp object with Dataset.collect(), and put into a Thriftserver RowSet.
  • Before sending it over the wire, the Timestamp is converted to String. This happens in explicitly in ColumnValue for RowBasedSet, and implicitly in ColumnBuffer for ColumnBasedSet (all non-primitive types are converted toString() there). The conversion toString uses JVM timezone, which results in a "21:00:00" (UTC+2) string representation.
  • The client JDBC application parses gets a "21:00:00" Timestamp back (in it's JVM timezone; if the JDBC application cares about the correct UTC internal value, it should set spark.sql.session.timeZone to be consistent with its JVM timezone)

The problem is caused by the conversion happening in Thriftserver RowSet with the generic toString() function, instead of using HiveResults.toHiveString() that takes care of correct, timezone respecting conversions. This PR fixes it by converting the Timestamp values to String earlier, in SparkExecuteStatementOperation, using that function. This fixes SPARK-31861.

Thriftserver also did not work spark.sql.datetime.java8API.enabled, because the conversions in RowSet expected an Timestamp object instead of Instant object. Using HiveResults.toHiveString() also fixes that. For this reason, we also convert Date values in SparkExecuteStatementOperation as well - so that HiveResults.toHiveString() handles LocalDate as well. This fixes SPARK-31859.

Thriftserver also did not correctly set the active SparkSession. Because of that, configuration obtained using SQLConf.get was not the correct session configuration. This affected getting the correct spark.sql.session.timeZone. It is fixed by extending the use of SparkExecuteStatementOperation.withSchedulerPool to also set the correct active SparkSession. When the correct session is set, we also no longer need to maintain the pool mapping in a sessionToActivePool map. The scheduler pool can be just correctly retrieved from the session config. "withSchedulerPool" is renamed to "withLocalProperties" and moved into a mixin helper trait, because it should be applied with every operation. This fixes SPARK-31863.

I used the opportunity to move some repetitive code from the operations to the mixin helper trait.

@juliuszsompolski
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented May 28, 2020

Test build #123253 has finished for PR 28671 at commit a2b9114.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

@juliuszsompolski juliuszsompolski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +301 to +304
public void close() throws HiveSQLException {
setState(OperationState.CLOSED);
cleanupOperationLog();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to add this concrete base implementation here, otherwise the abstract override of close() in the trait didn't work in SparkExecuteStatementOperation, because it didn't have a concrete implementation in any of the subclasses.
On the other hand, if I wanted to implement it concretely in the trait, it was complaining about not being able to call protected methods from Java class in the trait.

This base implementation here is doing what it was doing in all the operations we have.
It might have been another tiny bug that SparkExecuteStatementOperation did not call cleanupOperationLog. The other operations did via their super call.


protected def sqlContext: SQLContext

protected var statementId = getHandle().getHandleIdentifier().getPublicId().toString()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a new UUID, I use the one that is already created for the operation.
I found it confusing when debugging, that there were two UUIDs used for operations: SparkUI and some of the logs from Spark classes used the ones we created, while log lines from Hive imported code used this one, and the two couldn't be linked together... Now they are going to be the same id, which should make debugging easier.

@SparkQA
Copy link

SparkQA commented May 29, 2020

Test build #123298 has finished for PR 28671 at commit 29bd36f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 29, 2020

Test build #123300 has finished for PR 28671 at commit 988a29d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

nice fixes for the thriftserver! thanks, merging to master/3.0

@cloud-fan cloud-fan closed this in af35691 May 30, 2020
cloud-fan pushed a commit that referenced this pull request May 30, 2020
…zone issues

### What changes were proposed in this pull request?

Timestamp literals in Spark are interpreted as timestamps in local timezone spark.sql.session.timeZone.

If JDBC client is e.g. in TimeZone UTC-7, and sets spark.sql.session.timeZone to PST, and sends a query "SELECT timestamp '2020-05-20 12:00:00'", and the JVM timezone of the Spark cluster is e.g. UTC+2, then what currently happens is:
* The timestamp literal in the query is interpreted as 12:00:00 UTC-7, i.e. 19:00:00 UTC.
* When it's returned from the query, it is collected as a java.sql.Timestamp object with Dataset.collect(), and put into a Thriftserver RowSet.
* Before sending it over the wire, the Timestamp is converted to String. This happens in explicitly in ColumnValue for RowBasedSet, and implicitly in ColumnBuffer for ColumnBasedSet (all non-primitive types are converted toString() there). The conversion toString uses JVM timezone, which results in a "21:00:00" (UTC+2) string representation.
* The client JDBC application parses gets a "21:00:00" Timestamp back (in it's JVM timezone; if the JDBC application cares about the correct UTC internal value, it should set spark.sql.session.timeZone to be consistent with its JVM timezone)

The problem is caused by the conversion happening in Thriftserver RowSet with the generic toString() function, instead of using HiveResults.toHiveString() that takes care of correct, timezone respecting conversions. This PR fixes it by converting the Timestamp values to String earlier, in SparkExecuteStatementOperation, using that function. This fixes SPARK-31861.

Thriftserver also did not work spark.sql.datetime.java8API.enabled, because the conversions in RowSet expected an Timestamp object instead of Instant object. Using HiveResults.toHiveString() also fixes that. For this reason, we also convert Date values in SparkExecuteStatementOperation as well - so that HiveResults.toHiveString() handles LocalDate as well. This fixes SPARK-31859.

Thriftserver also did not correctly set the active SparkSession. Because of that, configuration obtained using SQLConf.get was not the correct session configuration. This affected getting the correct spark.sql.session.timeZone. It is fixed by extending the use of SparkExecuteStatementOperation.withSchedulerPool to also set the correct active SparkSession. When the correct session is set, we also no longer need to maintain the pool mapping in a sessionToActivePool map. The scheduler pool can be just correctly retrieved from the session config. "withSchedulerPool" is renamed to "withLocalProperties" and moved into a mixin helper trait, because it should be applied with every operation. This fixes SPARK-31863.

I used the opportunity to move some repetitive code from the operations to the mixin helper trait.

Closes #28671 from juliuszsompolski/SPARK-31861.

Authored-by: Juliusz Sompolski <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit af35691)
Signed-off-by: Wenchen Fan <[email protected]>
Comment on lines +816 to +817
withJdbcStatement() { statement =>
withJdbcStatement() { st =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliuszsompolski Did you use this nested st intentionally? Looks like statement is not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.... this was not intentional, looks like a copy paste error on my part. I raised #28735 to fix it.

HyukjinKwon pushed a commit that referenced this pull request Jul 8, 2020
…ncel and close should not transiently ERROR

### What changes were proposed in this pull request?
#28671 introduced a change where the order in which CANCELED state for SparkExecuteStatementOperation is set was changed. Before setting the state to CANCELED, `cleanup()` was called which kills the jobs, causing an exception to be thrown inside `execute()`. This causes the state to transiently become ERROR before being set to CANCELED. This PR fixes the order.

### Why are the changes needed?
Bug: wrong operation state is set.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit test in SparkExecuteStatementOperationSuite.scala.

Closes #28912 from alismess-db/execute-statement-operation-cleanup-order.

Authored-by: Ali Smesseim <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Jul 8, 2020
…ncel and close should not transiently ERROR

### What changes were proposed in this pull request?
#28671 introduced a change where the order in which CANCELED state for SparkExecuteStatementOperation is set was changed. Before setting the state to CANCELED, `cleanup()` was called which kills the jobs, causing an exception to be thrown inside `execute()`. This causes the state to transiently become ERROR before being set to CANCELED. This PR fixes the order.

### Why are the changes needed?
Bug: wrong operation state is set.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit test in SparkExecuteStatementOperationSuite.scala.

Closes #28912 from alismess-db/execute-statement-operation-cleanup-order.

Authored-by: Ali Smesseim <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 8b0a54e)
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants