-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE syntax #26736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE syntax #26736
Conversation
| .createWithDefault(false) | ||
|
|
||
| val LEGACY_RESPECT_HIVE_DEFAULT_PROVIDER_ENABLED = | ||
| buildConf("spark.sql.legacy.respectHiveDefaultProvider.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about spark.sql.legacy.createHiveTableByDefault.enabled
|
Test build #114727 has finished for PR 26736 at commit
|
|
fix Conflicting files. |
|
Test build #114905 has finished for PR 26736 at commit
|
|
Test build #114911 has finished for PR 26736 at commit
|
|
Test build #114910 has finished for PR 26736 at commit
|
|
Test build #114912 has finished for PR 26736 at commit
|
| |INTO 2 BUCKETS | ||
| |AS SELECT key, value, cast(key % 3 as string) as p FROM src | ||
| """.stripMargin) | ||
| """.stripMargin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent.
|
|
||
| test("SPARK-30098: create table without provider should " + | ||
| "use default data source under non-legacy mode") { | ||
| withSQLConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED.key -> "false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove this withSQLConf to show that it's the default behavior.
| try { | ||
| TestHive.setConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED, true) | ||
| withTable("t1") { | ||
| val createTable = "CREATE TABLE `t1`(`a` STRUCT<`b`: STRING>)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we just add using hive?
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSerDeSuite.scala
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSerDeSuite.scala
Show resolved
Hide resolved
|
Test build #114931 has finished for PR 26736 at commit
|
|
retest this please |
|
Test build #114933 has finished for PR 26736 at commit
|
|
thanks, merging to master! |
|
thanks a lot! @cloud-fan |
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
Show resolved
Hide resolved
… `false` by default ### What changes were proposed in this pull request? This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`. ### Why are the changes needed? Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period. - 2019-12-06: #26736 (58be82a) - 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j - 2020-05-16: #28517 At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098. - 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 - 2020-12-03: #30554 Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction. - SPARK-42603 on 2023-02-27 as an independent idea. - SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea ### Does this PR introduce _any_ user-facing change? Yes, the migration document is updated. ### How was this patch tested? Pass the CIs with the adjusted test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46207 from dongjoon-hyun/SPARK-46122. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
In this PR, we propose to use the value of
spark.sql.source.defaultas the provider forCREATE TABLEsyntax instead ofhivein Spark 3.0.And to help the migration, we introduce a legacy conf
spark.sql.legacy.respectHiveDefaultProvider.enabledand set its default tofalse.Why are the changes needed?
Currently,
CREATE TABLEsyntax use hive provider to create table whileDataFrameWriter.saveAsTableAPI using the value ofspark.sql.source.defaultas a provider to create table. It would be better to make them consistent.User may gets confused in some cases. For example:
In these two DDLs, use may think that
t2should also use parquet as default provider since Spark always advertise parquet as the default format. However, it's hive in this case.On the other hand, if we omit the USING clause in a CTAS statement, we do pick parquet by default if
spark.sql.hive.convertCATS=true:And these two cases together can be really confusing.
Does this PR introduce any user-facing change?
Yes, before this PR, using
CREATE TABLEsyntax will use hive provider. But now, it use the value ofspark.sql.source.defaultas its provider.How was this patch tested?
Added tests in
DDLParserSuiteandHiveDDlSuite.