-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21917][CORE][YARN] Supporting adding http(s) resources in yarn mode #19130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #81402 has finished for PR 19130 at commit
|
Jenkins, retest this please. |
Test build #81410 has finished for PR 19130 at commit
|
Test build #81447 has finished for PR 19130 at commit
|
@vanzin @tgravescs , would you please help to review, thanks! |
sorry probably wont' get to this today, will look tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a Filesystem.getFileSystemClass function we could use here instead of calling dummy uri
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we somehow want to make this configurable per scheme? Right now its basically http/https, in the future would we want to possibly handle other filesystems that hadoop doesn't support. Making this a settable config would make that easier
@tgravescs , thanks for your comments, can you review again, if it is what you expected. |
Test build #81658 has finished for PR 19130 at commit
|
Jenkins, retest this please. |
Test build #81665 has finished for PR 19130 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"When running in YARN cluster manager, ?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the broken comment, my bad, I will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try { ... }.isSuccess
? You could also avoid this call if the scheme is in the blacklist.
docs/running-on-yarn.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better wording:
Comma-separated list of schemes for which files will be downloaded to the local disk prior to being added to YARN's distributed cache. For use in cases where the YARN service does not support schemes that are supported by Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to say this is missing spark.yarn.dist.files
and .jars
, but later those properties seem to be set based on args.files
and args.jars
.
Which kinda raises the question of what happens when the user sets both. From the documentation it sounds like that should work (both sets of files get added), but from the code it seems --files
and --jars
would overwrite the spark.yarn.*
configs...
In any case, that's not the fault of your change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the code --files
and --jars
overwrite spark.yarn.*
long ago AFAIK. What I think is that we should make spark.yarn.*
as an internal configurations to reduce the discrepancy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better if tests could avoid this... you could start a local http server, but that feels like a lot of work. Is there some way to mock the behavior instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's my concern, let me think out another way to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...still are...
Also I'm not sure I understand the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you have 3 different tests in this block (at least), could you break them into separate tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanzin , it is a little difficult to mock the download behavior, so here I check if "spark.testing" is configured, return a dummy local path if it is configured. What do you think about this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
Test build #81775 has finished for PR 19130 at commit
|
Test build #81776 has finished for PR 19130 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor things to address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code like this (break a comma-separate string into a list) is copy & pasted in so many places that it probably deserves a method in Utils
.
There's one in ConfigHelpers.stringToSeq
but that class is private to its package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a help method in Utils
and changed in SparkSubmit
related codes. There still have some other places which requires to change, but I will not touch them in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break multi-line args one per line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better: "force download from blacklisted schemes"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should already be set by the build scripts, was it not working for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried, I saw some UT also set this configuration, let me check if it is explicitly required or not.
Test build #81804 has finished for PR 19130 at commit
|
Test build #81805 has finished for PR 19130 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a problem only for YARN? Do standalone and Mesos have this problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a problem for YARN currently, because YARN uses dist cache to distribute resources to yarn cluster, dist cache requires supported Hadoop FS to copy resources, if our resource scheme is http, it will try to find http FS to handle such resource, which will be failed since no http FS supported in current Hadoop.
In standalone and Mesos cluster, we use Spark's internal logic to handle http resources, this logic handles well for the http(s) resources, so there should be no issue for standalone and mesos mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why make it a function? Can't we just inline it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be, let me change the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we explicitly list "http" | "https" | "ftp"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is not required, because shouldDownload
logic will handle this. If 1) this resource scheme is blacklisted, or 2) it is not support by Hadoop, then Spark will handle this through downloadFile
method. Since "http" | "https" | "ftp" is not supported by Hadoop before 2.9, so it implies that resources with such scheme will be handled by Spark itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give an example of these schemes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sure.
Test build #81905 has finished for PR 19130 at commit
|
docs/running-on-yarn.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we make these 3 the default value of this config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not necessary, we still want to leverage Hadoop's http(s) FS to distribute resources by default if it is running on Hadoop 2.9+ (https://issues.apache.org/jira/browse/HADOOP-14383)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah got it
Test build #81912 has finished for PR 19130 at commit
|
Change-Id: I7897817ceaaafecd779a6e085c96d2a28363d7d6
Change-Id: I9176970799c5aa33f0dbd6556509b2d1f77b6f6b
9a2c8c7
to
0fb7943
Compare
Test build #81914 has finished for PR 19130 at commit
|
LGTM, just one question: why we need |
Hi @cloud-fan , the main purpose of |
thanks, merging to master! |
What changes were proposed in this pull request?
In the current Spark, when submitting application on YARN with remote resources
./bin/spark-shell --jars http://central.maven.org/maven2/com/github/swagger-akka-http/swagger-akka-http_2.11/0.10.1/swagger-akka-http_2.11-0.10.1.jar --master yarn-client -v
, Spark will be failed with:This is because
YARN#client
assumes resources are on the Hadoop compatible FS. To fix this problem, here propose to download remote http(s) resources to local and add this local downloaded resources to dist cache. This solution has one downside: remote resources are downloaded and uploaded again, but it only restricted to only remote http(s) resources, also the overhead is not so big. The advantages of this solution is that it is simple and the code changes restricts to onlySparkSubmit
.How was this patch tested?
Unit test added, also verified in local cluster.