[SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE syntax #26736

Ngone51 · 2019-12-02T12:10:32Z

What changes were proposed in this pull request?

In this PR, we propose to use the value of spark.sql.source.default as the provider for CREATE TABLE syntax instead of hive in Spark 3.0.

And to help the migration, we introduce a legacy conf spark.sql.legacy.respectHiveDefaultProvider.enabled and set its default to false.

Why are the changes needed?

Currently, CREATE TABLE syntax use hive provider to create table while DataFrameWriter.saveAsTable API using the value of spark.sql.source.default as a provider to create table. It would be better to make them consistent.
User may gets confused in some cases. For example:

CREATE TABLE t1 (c1 INT) USING PARQUET;
CREATE TABLE t2 (c1 INT);

In these two DDLs, use may think that t2 should also use parquet as default provider since Spark always advertise parquet as the default format. However, it's hive in this case.

On the other hand, if we omit the USING clause in a CTAS statement, we do pick parquet by default if spark.sql.hive.convertCATS=true:

CREATE TABLE t3 USING PARQUET AS SELECT 1 AS VALUE;
CREATE TABLE t4 AS SELECT 1 AS VALUE;

And these two cases together can be really confusing.

Now, Spark SQL is very independent and popular. We do not need to be fully consistent with Hive's behavior.

Does this PR introduce any user-facing change?

Yes, before this PR, using CREATE TABLE syntax will use hive provider. But now, it use the value of spark.sql.source.default as its provider.

How was this patch tested?

Added tests in DDLParserSuite and HiveDDlSuite.

Ngone51 · 2019-12-02T12:11:05Z

cc @cloud-fan @gatorsmile

cloud-fan · 2019-12-02T12:21:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(false)

+  val LEGACY_RESPECT_HIVE_DEFAULT_PROVIDER_ENABLED =
+    buildConf("spark.sql.legacy.respectHiveDefaultProvider.enabled")


how about spark.sql.legacy.createHiveTableByDefault.enabled

SparkQA · 2019-12-02T14:13:20Z

Test build #114727 has finished for PR 26736 at commit 2451d2f.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

xy2953396112 · 2019-12-02T15:45:16Z

fix Conflicting files.

SparkQA · 2019-12-05T15:37:04Z

Test build #114905 has finished for PR 26736 at commit d46d6db.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-05T18:59:18Z

Test build #114911 has finished for PR 26736 at commit cae3571.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-05T20:13:36Z

Test build #114910 has finished for PR 26736 at commit fb4a186.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-05T20:53:14Z

Test build #114912 has finished for PR 26736 at commit fc9b910.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-12-06T02:46:41Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala

           |INTO 2 BUCKETS
           |AS SELECT key, value, cast(key % 3 as string) as p FROM src
-        """.stripMargin)
+      """.stripMargin)


nit: indent.

cloud-fan · 2019-12-06T05:46:03Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala


+  test("SPARK-30098: create table without provider should " +
+    "use default data source under non-legacy mode") {
+    withSQLConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED.key -> "false") {


let's remove this withSQLConf to show that it's the default behavior.

cloud-fan · 2019-12-06T07:04:09Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveShowCreateTableSuite.scala

+    try {
+      TestHive.setConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED, true)
+      withTable("t1") {
+        val createTable = "CREATE TABLE `t1`(`a` STRUCT<`b`: STRING>)"


shall we just add using hive?

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSerDeSuite.scala

SparkQA · 2019-12-06T08:05:01Z

Test build #114931 has finished for PR 26736 at commit a201307.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-06T08:06:28Z

retest this please

SparkQA · 2019-12-06T12:04:29Z

Test build #114933 has finished for PR 26736 at commit a201307.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-06T17:37:31Z

thanks, merging to master!

Ngone51 · 2019-12-07T09:21:28Z

thanks a lot! @cloud-fan

sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala

… `false` by default ### What changes were proposed in this pull request? This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`. ### Why are the changes needed? Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period. - 2019-12-06: #26736 (58be82a) - 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j - 2020-05-16: #28517 At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098. - 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 - 2020-12-03: #30554 Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction. - SPARK-42603 on 2023-02-27 as an independent idea. - SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea ### Does this PR introduce _any_ user-facing change? Yes, the migration document is updated. ### How was this patch tested? Pass the CIs with the adjusted test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46207 from dongjoon-hyun/SPARK-46122. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Ngone51 added 3 commits December 2, 2019 19:13

change CREATE TABLE default provider

0be3a59

remove unsed StaticSQLConf

00fdf32

nit

2451d2f

cloud-fan reviewed Dec 2, 2019

View reviewed changes

rename config

870900c

resolve conflict

70fd3df

Ngone51 added 21 commits December 3, 2019 11:01

fix DDLParserSuite

4177654

fix SparkSqlParserSuite.create table - schema

944336b

fix InMemoryCatalogedDDLSuite

8f91010

fix HiveCompatibilitySuite

bd0134e

fix CachedTableSuite

0c05d4d

fix HiveDDLSuite

7b0aeef

fix HiveExplainSuite

fabe532

fix HiveQuerySuite

d9c9932

fix HiveResolutionSuite

dbffd31

fix HiveSerDeSuite

f254e15

fix HiveTableScanSuite

82e0e44

fix HiveTypeCoercionSuite

d9a0387

fix HiveUDFSuite

fd6af36

fix PruningSuite

86954d8

fix SQLQuerySuite

939d1d7

fix HiveCatalogedDDLSuite

3b29064

fix UDFSuite.scala

b75d57f

fix TestHiveSuite

bc45423

fix StatisticsSuite

f8a7543

fix InsertSuite

72e941d

fix HiveUserDefinedTypeSuite

662eecd

Ngone51 added 3 commits December 6, 2019 00:02

fix CreateTableAsSelectSuite

fb4a186

fix and revert

cae3571

another check

fc9b910

dongjoon-hyun added the SQL label Dec 5, 2019

Ngone51 commented Dec 6, 2019

View reviewed changes

cloud-fan reviewed Dec 6, 2019

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSerDeSuite.scala Show resolved Hide resolved

cloud-fan reviewed Dec 6, 2019

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSerDeSuite.scala Show resolved Hide resolved

Ngone51 added 3 commits December 6, 2019 15:36

fix HiveShowCreateTableSuite

d4b53e6

nit indent

35175b0

show default behaviour

a201307

cloud-fan closed this in 58be82a Dec 6, 2019

gatorsmile mentioned this pull request Jan 12, 2020

[SPARK-28794][SQL][DOC] Documentation for Create table Command #26759

Closed

juliuszsompolski reviewed Jan 13, 2020

View reviewed changes

sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala Show resolved Hide resolved

dongjoon-hyun mentioned this pull request Apr 24, 2024

[SPARK-46122][SQL] Set spark.sql.legacy.createHiveTableByDefault to false by default #46207

Closed

[SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE syntax #26736

[SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE syntax #26736

Uh oh!

Conversation

Ngone51 commented Dec 2, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Ngone51 commented Dec 2, 2019

Uh oh!

cloud-fan Dec 2, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 2, 2019

Uh oh!

xy2953396112 commented Dec 2, 2019

Uh oh!

SparkQA commented Dec 5, 2019

Uh oh!

SparkQA commented Dec 5, 2019

Uh oh!

SparkQA commented Dec 5, 2019

Uh oh!

SparkQA commented Dec 5, 2019

Uh oh!

Ngone51 Dec 6, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 6, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 6, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Dec 6, 2019

Uh oh!

cloud-fan commented Dec 6, 2019

Uh oh!

SparkQA commented Dec 6, 2019

Uh oh!

cloud-fan commented Dec 6, 2019

Uh oh!

Ngone51 commented Dec 7, 2019

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants