[LIVY-435] Allow floating point numbers to be used for the driver and executor core counts #74

coderanger · 2018-01-23T20:26:36Z

https://issues.apache.org/jira/browse/LIVY-435 is the ticket.

I added unit tests for the improved struct parsing but SparkProcessBuilder has no tests currently so I wasn't sure how to add them (if needed, the changes there are relatively minor).

…ore counts.

jerryshao · 2018-01-25T01:25:24Z

AFAIK, Spark itself also doesn't support non-integer cores. How do you that Spark supports this non-integer cores, also what's the meaning of 0.1 cores?

CC @ajbozarth

ajbozarth · 2018-01-25T01:45:57Z

Spark does not support non-Integer values. Those values are passed into Spark as String and converted to Integers, but even then passing it a double would fail the spark-submit. I'm not sure where you got the misconception, but I looked in the Livy and Spark code and the values you changed here definitely have to be Integers for the spark-submit to succeed.

coderanger · 2018-01-25T01:46:16Z

They are Ints in the YARN implementations, but Doubles for Mesos and Kubernetes. In k8s, the cores values get translated into Kubernetes Pod resource requests, where fractional cores are a thing that makes sense :) This isn't something you would generally do in real jobs (unless the job wasn't CPU bound) but it matters for local development.

coderanger · 2018-01-25T01:47:10Z

And I can 100% promise you this works with the k8s backend, it's the only way I can start a job in my 2 CPU dev VM (Kubernetes doesn't allow overcommit on requests, it will refuse to schedule things instead).

jerryshao · 2018-01-25T01:53:47Z

OK, I see, let me first check the code of Spark on Mesos and k8s.

coderanger · 2018-01-25T01:59:32Z

https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L109 is the relevant code there :)

And I think https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosFineGrainedSchedulerBackend.scala#L62 is the relevant bit in Mesos land, but I don't know that side very well.

ajbozarth · 2018-01-25T02:05:58Z

Thanks for those links, given those usages are in forks of Spark and not the main repo I'm a little bit hesitant, but I know we have talked about formally extending support for Meso in the future so I'd be ok with this change. I think we should confirm how Livy would handle Spark throwing a spark-submit error for a non-Int when using YARN/Standalone though. If the error is handled cleanly then I'm fine with this, otherwise we would address that in this PR.

ajbozarth · 2018-01-25T02:09:07Z

Related to my above statement, if we're adding tests we should add a negative test as well to check that passing a string or some other value would not work.

coderanger · 2018-01-25T02:12:36Z

Heh, the k8s branch/fork is getting upstreamed, that subsystem is already in actually https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L94

ajbozarth · 2018-01-25T02:37:15Z

Wow, no idea how I missed that in my search results, on second look it's definitely there. We should support this then, but I stand by my comments about testing. We should check how Livy handles YARN and Standalone failing on double input as well as add a test that makes sure Livy won't take in non-numerical input.

coderanger · 2018-01-25T19:22:33Z

I may need some pointers on where to add tests and how :) The few that I added was just tweaking some existing test code, I am still very much a n00b at this codebase.

ajbozarth · 2018-01-25T23:41:00Z

So for the string Unit Test I figured you write it just like the ones you added. As for the other testing, you should try out your code change using Livy with YARN and Standalone and give it a Double or a String and see how it handles the error. In the case of String, that error is what your checking for in the above Unit Test. In the case of a Double, you should make sure that the error is handled safely and doesn't cause anything to crash or hang, if it does you'll need to address that in this PR.

Does that help you with next steps?

…rigger an exception).

coderanger · 2018-01-26T23:54:31Z

Okay, I added another pair of unit tests for the string case. I'm kind of surprised that Jackson is even willing to decode that kind of invalid value. I could make it more explicitly fail at decoding like is already done for the kind key if you are interested in that :)

As for how this will impact other modes, I did some digging through the Spark code I think the answer is "not much". On the submit side, all CLI parsing into the SparkConf is handled purely as strings. The only type checks come when retrieving values from the conf, so as long as you don't actually use a fractional core value with YARN, it should be fine (technically if you tried to assign more than 2⁵³ cores it would cause a problem because that can't be represented in a double with exact precision, but I think we're probably okay on that one). Local mode actually ignores the values entirely and will happily accept --executor-cores banana.

ajbozarth · 2018-01-30T00:35:04Z

Thanks for the updates. As a followup though, even if it's "not much" what exactly would happen if you give a fractional core value to Livy when using YARN with your changes?

coderanger · 2018-01-30T00:57:44Z

For batch requests, spark-submit will error out with a type coercion error when it calls sparkConf.getInt('spark.driver.cores'). I'm less clear on session/interactive requests and I don't have the environment to test it but I think it should be the same.

coderanger · 2018-01-30T00:57:56Z

Ack, didn't mean to close, sorry.

ajbozarth · 2018-01-30T01:00:58Z

So it will cause the initial submit to error and the Livy session would then be ended as FAILED?

coderanger · 2018-01-30T01:16:07Z

Yep, definitely for batch sessions, probably (99%, unless I missed something in the code) for interactive.

ajbozarth · 2018-01-30T01:20:06Z

Ok then I believe this LGTM, @jerryshao want to take one last look?

jerryshao · 2018-01-30T01:27:50Z

I will review again later today.

jerryshao · 2018-02-02T02:09:19Z

Spark itself has issues to use fractional core numbers, see apache/spark#20460. Let's wait until Spark itself fix such issues.

ajbozarth · 2018-02-03T00:16:12Z

Nice catch @jerryshao I agree we should wait until Spark has that fix in.

coderanger · 2018-02-07T19:35:13Z

From the issues in apache/spark#20460 I think we should close this out and pursue a different tack :) Probably leaving the main driver,executor.cores as ints and use secondary properties to allow more fine-grained control over k8s resource requests.

Allow floating point numbers to be used for the driver and executor c…

5feb705

…ore counts.

Add tests for passing string values for the number of cores (should t…

fe62039

…rigger an exception).

coderanger closed this Jan 30, 2018

coderanger reopened this Jan 30, 2018

coderanger closed this Feb 7, 2018

[LIVY-435] Allow floating point numbers to be used for the driver and executor core counts #74

[LIVY-435] Allow floating point numbers to be used for the driver and executor core counts #74

Conversation

coderanger commented Jan 23, 2018

Uh oh!

jerryshao commented Jan 25, 2018

Uh oh!

ajbozarth commented Jan 25, 2018

Uh oh!

coderanger commented Jan 25, 2018

Uh oh!

coderanger commented Jan 25, 2018

Uh oh!

jerryshao commented Jan 25, 2018

Uh oh!

coderanger commented Jan 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajbozarth commented Jan 25, 2018

Uh oh!

ajbozarth commented Jan 25, 2018

Uh oh!

coderanger commented Jan 25, 2018

Uh oh!

ajbozarth commented Jan 25, 2018

Uh oh!

coderanger commented Jan 25, 2018

Uh oh!

ajbozarth commented Jan 25, 2018

Uh oh!

coderanger commented Jan 26, 2018

Uh oh!

ajbozarth commented Jan 30, 2018

Uh oh!

coderanger commented Jan 30, 2018

Uh oh!

coderanger commented Jan 30, 2018

Uh oh!

ajbozarth commented Jan 30, 2018

Uh oh!

coderanger commented Jan 30, 2018

Uh oh!

ajbozarth commented Jan 30, 2018

Uh oh!

jerryshao commented Jan 30, 2018

Uh oh!

jerryshao commented Feb 2, 2018

Uh oh!

ajbozarth commented Feb 3, 2018

Uh oh!

coderanger commented Feb 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderanger commented Jan 25, 2018 •

edited

Loading