Skip to content

Conversation

sujith71955
Copy link
Contributor

What changes were proposed in this pull request?

When user tries to load data with a non existing hdfs file path system is not validating it and the load command operation is getting successful.
This is misleading to the user. already there is a validation in the scenario of none existing local file path. This PR has added validation in the scenario of nonexisting hdfs file path

How was this patch tested?

UT has been added for verifying the issue, also snapshots has been added after the verification in a spark yarn cluster

@sujith71955
Copy link
Contributor Author

loaddataissue_verificationresult

@sujith71955 sujith71955 force-pushed the master_LoadComand_Issue branch from 5b247a8 to bed0e65 Compare November 26, 2017 18:34
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still kind of wonder whether this causes a problem in some scenario where the path will be created later after this is evaluated, but I don't know this path well enough to say for sure

val hadoopConf = sparkSession.sessionState.newHadoopConf()
val srcPath = new Path(path)
val fs = srcPath.getFileSystem(hadoopConf)
if(!fs.exists(srcPath)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: space after if

@@ -341,6 +341,12 @@ case class LoadDataCommand(
} else {
val uri = new URI(path)
if (uri.getScheme() != null && uri.getAuthority() != null) {
val hadoopConf = sparkSession.sessionState.newHadoopConf()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably lift this out of the if-else and use it in the other branch too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@SparkQA
Copy link

SparkQA commented Nov 26, 2017

Test build #3991 has finished for PR 19823 at commit bed0e65.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -2624,7 +2624,13 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
val e = intercept[AnalysisException](sql("SELECT nvl(1, 2, 3)"))
assert(e.message.contains("Invalid number of arguments"))
}

test("load command invalid path validation ") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to DDLSuite?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also blank lines

withTable("tbl") {
sql("CREATE TABLE tbl(i INT, j STRING) USING parquet")
val e = intercept[AnalysisException](sql("load data inpath 'hdfs://localhost/doesnotexist.csv' into table tbl"))
assert(e.message.contains("LOAD DATA input path does not exist"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indents between 2629 and 2631

val srcPath = new Path(path)
val fs = srcPath.getFileSystem(hadoopConf)
if(!fs.exists(srcPath)) {
throw new AnalysisException(s"LOAD DATA input path does not exist: $path")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two space indents.

@@ -341,6 +341,12 @@ case class LoadDataCommand(
} else {
val uri = new URI(path)
if (uri.getScheme() != null && uri.getAuthority() != null) {
val hadoopConf = sparkSession.sessionState.newHadoopConf()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@gatorsmile
Copy link
Member

Thanks for fixing this!

@@ -341,6 +341,12 @@ case class LoadDataCommand(
} else {
val uri = new URI(path)
if (uri.getScheme() != null && uri.getAuthority() != null) {
val hadoopConf = sparkSession.sessionState.newHadoopConf()
val srcPath = new Path(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's use new Path(uri). I think we better validate uri in this scope.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But will this lose the normalization in new Path(path) ? Or normalization in URI covers it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here path is an URI here otherwise val uri = new URI(path) should fail, and we return uri. So, I think we should check if the uri is valid.

@sujith71955 sujith71955 force-pushed the master_LoadComand_Issue branch from bed0e65 to 1540c16 Compare November 27, 2017 11:25
@sujith71955 sujith71955 changed the title [SPARK-22601][SQL] Data load is getting displayed successful on providing non existing hdfs file path [WIP][SPARK-22601][SQL] Data load is getting displayed successful on providing non existing hdfs file path Nov 27, 2017
val srcPath = new Path(uri)
val fs = srcPath.getFileSystem(hadoopConf)
if (!fs.exists(srcPath)) {
throw new AnalysisException(s"LOAD DATA input path does not exist: $path")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit: double spaces

@sujith71955 sujith71955 force-pushed the master_LoadComand_Issue branch 2 times, most recently from fa4b63d to cc1b176 Compare November 27, 2017 12:18
@sujith71955
Copy link
Contributor Author

Thanks for the comments guys, i am working on it.,will update the PR based on comments.

@sujith71955
Copy link
Contributor Author

Basically this validation stands good for both cases where scheme can come as null and not null, i will update the logic as Sean told. Thanks

@sujith71955
Copy link
Contributor Author

retest this please

@sujith71955 sujith71955 changed the title [WIP][SPARK-22601][SQL] Data load is getting displayed successful on providing non existing hdfs file path [SPARK-22601][SQL] Data load is getting displayed successful on providing non existing hdfs file path Nov 27, 2017
@sujith71955
Copy link
Contributor Author

@gatorsmile @HyukjinKwon @srowen Please review as i modified the code as per provided comments. thanks

@SparkQA
Copy link

SparkQA commented Nov 28, 2017

Test build #3996 has finished for PR 19823 at commit cc1b176.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@sujith71955 sujith71955 force-pushed the master_LoadComand_Issue branch 2 times, most recently from aca9c27 to a315bf3 Compare November 28, 2017 17:44
@@ -2392,5 +2392,13 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils {
}
}
}
test("load command invalid path validation ") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujith71955 according to Jenkins, there's a whitespace at end of line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated it. In my local build it was working fine. Thanks for the feedback

@@ -2392,5 +2392,13 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils {
}
}
}
test("load command invalid path validation ") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: insert an empty line before the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. thanks

@sujith71955 sujith71955 force-pushed the master_LoadComand_Issue branch from a315bf3 to a60843c Compare November 29, 2017 04:46
@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Nov 29, 2017

Test build #84283 has finished for PR 19823 at commit a60843c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sujith71955 sujith71955 force-pushed the master_LoadComand_Issue branch from a60843c to 2a84256 Compare November 29, 2017 11:26
@srowen
Copy link
Member

srowen commented Nov 29, 2017

This is another problem downloading from an Apache mirror. I can add retry logic, but, hm, this seems to be a kind of fragile thing, to download several huge tarballs in unit tests. CC @cloud-fan

@cloud-fan
Copy link
Contributor

How about we make it a runnable class and only run it occasionally or before the release?

@cloud-fan
Copy link
Contributor

BTW this PR LGTM

@SparkQA
Copy link

SparkQA commented Nov 29, 2017

Test build #3999 has finished for PR 19823 at commit 2a84256.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2017

Test build #4000 has finished for PR 19823 at commit 2a84256.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -2392,5 +2392,14 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils {
}
}
}

test("load command for non local invalid path validation") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this test case out of

Seq(true, false).foreach { caseSensitive =>
  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right gatorsmile, this is why test case was failing, one more change i have done,Since i am dealing with hive table i moved this test case to HiveDDLSuite.scala, hope its fine. Thanks for the review

@sujith71955 sujith71955 force-pushed the master_LoadComand_Issue branch from 2a84256 to f6eb4ad Compare November 30, 2017 20:03
@sujith71955 sujith71955 changed the title [SPARK-22601][SQL] Data load is getting displayed successful on providing non existing hdfs file path [SPARK-22601][SQL] Data load is getting displayed successful on providing non existing nonlocal file path Nov 30, 2017
…ding non existing hdfs file path

## What changes were proposed in this pull request?
When user tries to load data with a non existing hdfs file path system is not validating it and the load command operation is getting successful.
This is misleading to the user. already there is a validation in the scenario of local file path. This PR has added validation in the scenario of hdfs file path
## How was this patch tested?
existing tests are present to verify the impact and manually the scenario is been verified in hdfs cluster
@HyukjinKwon
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented Dec 1, 2017

Test build #84356 has finished for PR 19823 at commit f6eb4ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another lgtm

@sujith71955
Copy link
Contributor Author

Thanks all for the review and guidance.

@gatorsmile
Copy link
Member

Thanks! Merged to master/2.2

asfgit pushed a commit that referenced this pull request Dec 1, 2017
…ding non existing nonlocal file path

## What changes were proposed in this pull request?
When user tries to load data with a non existing hdfs file path system is not validating it and the load command operation is getting successful.
This is misleading to the user. already there is a validation in the scenario of none existing local file path. This PR has added validation in the scenario of nonexisting hdfs file path
## How was this patch tested?
UT has been added for verifying the issue, also snapshots has been added after the verification in a spark yarn cluster

Author: sujith71955 <[email protected]>

Closes #19823 from sujith71955/master_LoadComand_Issue.

(cherry picked from commit 16adaf6)
Signed-off-by: gatorsmile <[email protected]>
@asfgit asfgit closed this in 16adaf6 Dec 1, 2017
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…ding non existing nonlocal file path

## What changes were proposed in this pull request?
When user tries to load data with a non existing hdfs file path system is not validating it and the load command operation is getting successful.
This is misleading to the user. already there is a validation in the scenario of none existing local file path. This PR has added validation in the scenario of nonexisting hdfs file path
## How was this patch tested?
UT has been added for verifying the issue, also snapshots has been added after the verification in a spark yarn cluster

Author: sujith71955 <[email protected]>

Closes apache#19823 from sujith71955/master_LoadComand_Issue.

(cherry picked from commit 16adaf6)
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants