-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19583][SQL]CTAS for data source table with a created location should succeed #16938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #72932 has finished for PR 16938 at commit
|
|
I don't think we should treat it as a bug just because hive supports it, we should think more. Does it make sense to specify an existing directory in CTAS? |
|
We need to define a consistent rule in Catalog how to handle the scenario when the to-be-created directory already exists. So far, in most DDL scenarios, when trying to create a directory but it already exists, we just simply use the existing directory without an error message. |
|
From what I understand, this change is applicable for EXTERNAL tables only. There are two main uses of EXTERNAL tables I am aware of (repost from #16868 (comment)):
Ability to point to random location (which already has data) and create an EXTERNAL table over it is important for supporting EXTERNAL tables. If we don't allow this PR, then the options left to users are:
@cloud-fan : I don't think Spark's interpretation of EXTERNAL tables is different from Hive's. If it is, can you share the differences ? I think we should allow this. If you have specific concerns, lets discuss those. |
|
I think in CTAS,it is not allowed an existed table, no strict for the path exists. In DataFrameWriter.save with errorifnotexist mode,path existed is not allowed. |
|
@cloud-fan @gatorsmile @tejasapatil let's discuss this together? |
|
ok let's discuss it case by case:
|
|
One more case for managed tables: After we finishing the TABLE-level DDLs, we also need to do the same things for DATABASE-level DDLs and PARTITION-level DDLs. |
|
Updated(test Hive 2.0.0 adding external). Summary: 1. CREATE TABLE ... LOCATION path 2. CREATE TABLE ... LOCATION path AS SELECT ... 3. ALTER TABLE ... SET LOCATION path 4. CREATE TABLE .... 5. CREATE TABLE ... AS SELECT ... |
|
Updated(test Hive 2.0.0 adding external).. Summary: 1. CREATE TABLE ... PARTITIONED BY ... LOCATION path 2. CREATE TABLE ...PARTITIONED BY ... LOCATION path AS SELECT ... 3. ALTER TABLE ... PARTITION(...) ... SET LOCATION path 4. CREATE TABLE ....PARTITIONED BY ... 5. CREATE TABLE ... PARTITIONED BY ... AS SELECT ... |
|
@windpiger Thank you for your efforts! What you did above need to be written as the test cases. Could you do it as a separate PR? In addition, all the cases you tried are only for hive serve tables, right? |
|
Could you check the behaviors for both data source tables and hive serde tables? Later, we also need to check the behaviors of InMemoryCatalog for data source tables without enabling Hive supports. |
|
@gatorsmile I will add hive serde table for spark with HiveExternalCatalog, and parquet table with InMemoryCatalog soon. |
|
@gatorsmile @cloud-fan @tejasapatil
|
|
I think hive's behavior makes more sense. Users may wanna insert data to this table and put the data in a specified location, even it doesn't exist at the beginning.
The reason applies here too.
When users don't specify the location, mostly they would expect this is a fresh table and the table path should not exist. |
|
|
@cloud-fan situation 3. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... |
|
@windpiger yes for both questions. |
|
Based on the doc, Hive does not support CTAS when the target table is external. 1. CREATE TABLE ... LOCATION path When you testing the above two cases for checking Hive's behaviors, you need to change the syntax a little bit by manually adding 1. CREATE EXTERNAL TABLE ... LOCATION path Could you update the comparison? Thanks! |
|
@gatorsmile sorry, I make a mistake of this, I have updated the compare test above. |
|
@cloud-fan @gatorsmile @tejasapatil As we discussed above, we have three actions to do: 1. CREATE TABLE ... (PARTITIONED BY ...) LOCATION pathsituation:path not exists
2. CREATE TABLE ...(PARTITIONED BY ...) LOCATION path AS SELECTsituation:path exists
3. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...situation:default warehouse table path exists
please help to confirm the actions above, if it is ok, situation 2 is this PR going to resolve, and I will make another PR to resolve situation 1&3, thanks~ |
|
I found you also changed the following cases: Actually, they are managed tables. You do not need to update them. Can you roll back the changes? Thanks! |
|
Basically, the rules you proposed can be summarized below,
Is my understanding right? |
|
oh, you are right~ thanks! I have roll back the changes. Yes, summary is consist with our discussed. |
|
Thank you for your work! Maybe the last question. In the above case, you used |
|
@windpiger : I realised that you are checking the hive behavior against Hive 2.0.0. Spark is expected to support semantics for Hive 1.2.1 :
I am not upto date with the differences between those two releases of hive wrt this discussion. Can you confirm if the observations reported earlier in the discussion are valid against Hive 1.2.1 ? |
|
Test build #73399 has started for PR 16938 at commit |
|
Test build #73400 has finished for PR 16938 at commit
|
|
Test build #73405 has finished for PR 16938 at commit
|
|
Test build #73411 has finished for PR 16938 at commit
|
|
Test build #73424 has finished for PR 16938 at commit
|
|
@gatorsmile @cloud-fan could you help to review this pr? thanks :) |
|
|
||
| saveDataIntoTable( | ||
| sparkSession, table, table.storage.locationUri, query, mode, tableExists = true) | ||
| saveDataIntoTable(sparkSession, table, table.storage.locationUri, query, mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we just pass APPEND mode instead of creating a new overwrite parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also explain the behavior in the PR description: When path already exists, we should append to it or overwrite it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateTableAsSelectCommand overwrite table path or not has no relation with the SaveMode of CTAS, that is if the table has already exist, we will overwrite it whatever the SaveMode is except that we should still check the path if exists for a managed(another PR to resolve this), so I think overwrite can't be replaced by SaveMode.
@cloud-fan do I understand right? thanks~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @cloud-fan ~
| !exists | ||
| case (s, exists) => | ||
| throw new IllegalStateException(s"unsupported save mode $s ($exists)") | ||
| if (overwrite) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is simple for InsertIntoHadoopFsRelationCommand , just to care if it should overwrite the original already existed path.
| } | ||
| val result = saveDataIntoTable( | ||
| sparkSession, table, tableLocation, query, mode, tableExists = false) | ||
| sparkSession, table, tableLocation, query, mode, overwrite = true, tableExists = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this branch, table not exist, and the expected behavior is to create a directory or overwrite it if it's already existed. So we can just pass a OverWrite mode here, and everything should work. Did I miss something?
|
Test build #73587 has finished for PR 16938 at commit
|
| s"got: ${allPaths.mkString(", ")}") | ||
| } | ||
|
|
||
| if (pathExists) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we move this check from InsertIntoHadoopFsRelationCommand to here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in InsertIntoHadoopFsRelationCommand , it just care the overwrite logic, like InsertIntoHiveTableCommand https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L80
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in InsertIntoHadoopFsRelationCommand we still need to check path existence, I do not quite agree with this refactor.
Anyway, let's revert it and put the refactor in a new PR, to unblock this bug fix patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks~
|
Test build #73634 has finished for PR 16938 at commit
|
|
Test build #73648 has finished for PR 16938 at commit
|
|
I am modifying the hacky code |
|
Test build #73680 has finished for PR 16938 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
Failed with the error message:
while hive table is ok ,so we should fix it for datasource table.
The reason is that the SaveMode check is put in
InsertIntoHadoopFsRelationCommand, and the SaveMode check actually usepath, this is fine when we useDataFrameWriter.save(), because this situation of SaveMode act onpath.While when we use
CreateDataSourceAsSelectCommand, the situation of SaveMode act on table, andwe have already do SaveMode check in
CreateDataSourceAsSelectCommandfor table , so we should not do SaveMode check in the following logic inInsertIntoHadoopFsRelationCommandfor path, this is redundant and wrong logic forCreateDataSourceAsSelectCommandAfter this PR, the following DDL will succeed, when the location has been created we will append it or overwrite it.
How was this patch tested?
unit test added