-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-22601][SQL] Data load is getting displayed successful on providing non existing nonlocal file path #19823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5b247a8
to
bed0e65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still kind of wonder whether this causes a problem in some scenario where the path will be created later after this is evaluated, but I don't know this path well enough to say for sure
val hadoopConf = sparkSession.sessionState.newHadoopConf() | ||
val srcPath = new Path(path) | ||
val fs = srcPath.getFileSystem(hadoopConf) | ||
if(!fs.exists(srcPath)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: space after if
@@ -341,6 +341,12 @@ case class LoadDataCommand( | |||
} else { | |||
val uri = new URI(path) | |||
if (uri.getScheme() != null && uri.getAuthority() != null) { | |||
val hadoopConf = sparkSession.sessionState.newHadoopConf() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably lift this out of the if-else and use it in the other branch too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Test build #3991 has finished for PR 19823 at commit
|
@@ -2624,7 +2624,13 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { | |||
val e = intercept[AnalysisException](sql("SELECT nvl(1, 2, 3)")) | |||
assert(e.message.contains("Invalid number of arguments")) | |||
} | |||
|
|||
test("load command invalid path validation ") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to DDLSuite
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: also blank lines
withTable("tbl") { | ||
sql("CREATE TABLE tbl(i INT, j STRING) USING parquet") | ||
val e = intercept[AnalysisException](sql("load data inpath 'hdfs://localhost/doesnotexist.csv' into table tbl")) | ||
assert(e.message.contains("LOAD DATA input path does not exist")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indents between 2629 and 2631
val srcPath = new Path(path) | ||
val fs = srcPath.getFileSystem(hadoopConf) | ||
if(!fs.exists(srcPath)) { | ||
throw new AnalysisException(s"LOAD DATA input path does not exist: $path") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two space indents.
@@ -341,6 +341,12 @@ case class LoadDataCommand( | |||
} else { | |||
val uri = new URI(path) | |||
if (uri.getScheme() != null && uri.getAuthority() != null) { | |||
val hadoopConf = sparkSession.sessionState.newHadoopConf() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks for fixing this! |
@@ -341,6 +341,12 @@ case class LoadDataCommand( | |||
} else { | |||
val uri = new URI(path) | |||
if (uri.getScheme() != null && uri.getAuthority() != null) { | |||
val hadoopConf = sparkSession.sessionState.newHadoopConf() | |||
val srcPath = new Path(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Let's use new Path(uri)
. I think we better validate uri
in this scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But will this lose the normalization in new Path(path)
? Or normalization in URI
covers it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here path
is an URI here otherwise val uri = new URI(path)
should fail, and we return uri
. So, I think we should check if the uri
is valid.
bed0e65
to
1540c16
Compare
val srcPath = new Path(uri) | ||
val fs = srcPath.getFileSystem(hadoopConf) | ||
if (!fs.exists(srcPath)) { | ||
throw new AnalysisException(s"LOAD DATA input path does not exist: $path") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nit: double spaces
fa4b63d
to
cc1b176
Compare
Thanks for the comments guys, i am working on it.,will update the PR based on comments. |
Basically this validation stands good for both cases where scheme can come as null and not null, i will update the logic as Sean told. Thanks |
retest this please |
@gatorsmile @HyukjinKwon @srowen Please review as i modified the code as per provided comments. thanks |
Test build #3996 has finished for PR 19823 at commit
|
aca9c27
to
a315bf3
Compare
@@ -2392,5 +2392,13 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { | |||
} | |||
} | |||
} | |||
test("load command invalid path validation ") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sujith71955 according to Jenkins, there's a whitespace at end of line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated it. In my local build it was working fine. Thanks for the feedback
@@ -2392,5 +2392,13 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { | |||
} | |||
} | |||
} | |||
test("load command invalid path validation ") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: insert an empty line before the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. thanks
a315bf3
to
a60843c
Compare
retest this please |
Test build #84283 has finished for PR 19823 at commit
|
a60843c
to
2a84256
Compare
This is another problem downloading from an Apache mirror. I can add retry logic, but, hm, this seems to be a kind of fragile thing, to download several huge tarballs in unit tests. CC @cloud-fan |
How about we make it a runnable class and only run it occasionally or before the release? |
BTW this PR LGTM |
Test build #3999 has finished for PR 19823 at commit
|
Test build #4000 has finished for PR 19823 at commit
|
@@ -2392,5 +2392,14 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { | |||
} | |||
} | |||
} | |||
|
|||
test("load command for non local invalid path validation") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this test case out of
Seq(true, false).foreach { caseSensitive =>
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right gatorsmile, this is why test case was failing, one more change i have done,Since i am dealing with hive table i moved this test case to HiveDDLSuite.scala, hope its fine. Thanks for the review
2a84256
to
f6eb4ad
Compare
…ding non existing hdfs file path ## What changes were proposed in this pull request? When user tries to load data with a non existing hdfs file path system is not validating it and the load command operation is getting successful. This is misleading to the user. already there is a validation in the scenario of local file path. This PR has added validation in the scenario of hdfs file path ## How was this patch tested? existing tests are present to verify the impact and manually the scenario is been verified in hdfs cluster
ok to test |
LGTM pending Jenkins |
Test build #84356 has finished for PR 19823 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another lgtm
Thanks all for the review and guidance. |
Thanks! Merged to master/2.2 |
…ding non existing nonlocal file path ## What changes were proposed in this pull request? When user tries to load data with a non existing hdfs file path system is not validating it and the load command operation is getting successful. This is misleading to the user. already there is a validation in the scenario of none existing local file path. This PR has added validation in the scenario of nonexisting hdfs file path ## How was this patch tested? UT has been added for verifying the issue, also snapshots has been added after the verification in a spark yarn cluster Author: sujith71955 <[email protected]> Closes #19823 from sujith71955/master_LoadComand_Issue. (cherry picked from commit 16adaf6) Signed-off-by: gatorsmile <[email protected]>
…ding non existing nonlocal file path ## What changes were proposed in this pull request? When user tries to load data with a non existing hdfs file path system is not validating it and the load command operation is getting successful. This is misleading to the user. already there is a validation in the scenario of none existing local file path. This PR has added validation in the scenario of nonexisting hdfs file path ## How was this patch tested? UT has been added for verifying the issue, also snapshots has been added after the verification in a spark yarn cluster Author: sujith71955 <[email protected]> Closes apache#19823 from sujith71955/master_LoadComand_Issue. (cherry picked from commit 16adaf6) Signed-off-by: gatorsmile <[email protected]>
What changes were proposed in this pull request?
When user tries to load data with a non existing hdfs file path system is not validating it and the load command operation is getting successful.
This is misleading to the user. already there is a validation in the scenario of none existing local file path. This PR has added validation in the scenario of nonexisting hdfs file path
How was this patch tested?
UT has been added for verifying the issue, also snapshots has been added after the verification in a spark yarn cluster