Advice on application architecture for render farm

I realize that this sort of thing belongs on stackoverflow, but I've had a lot of bad luck there recently, I seem to have a talent for asking questions that no one is able to answer.

Anyway, I work for a company that makes web-based tools for computer animated films, and I want to use rethinkdb-job-queue as the basis of our new render farm. The essential idea is that there will be a Kubernetes cluster where individual rendering tasks will be controlled by a scheduler daemon using the queue.

As I see it, there would be two queues - one for Jobs, and one for Tasks within those Jobs. A Job might consists of a rendering step, a compositing step, and then running ffmpeg to create the final movie. Each of the steps is highly parallel - if your movie has 96 frames there's no reason not to run 96 machines, given that for a professional film project it may take several hours to render a single frame.

OK, so Queue.process() gets called when a Job is first submitted. The callback would then look at the recipe file for that Job and create a bunch of task objects. The tasks are organized in a dependency graph, so some tasks will be ready to run while others need to wait.

OK so what happens to the Job while all this is going on? We want to put the Job into some sort of quiescent state, it's not done but it's waiting for the tasks to complete. Whenever anything 'interesting' happens (like a task finishing) we want to process the job again and see if there's more work that needs to be done.

The tasks are in a similar boat, except that what they are waiting for are external processes - essentially worker tasks running in docker containers which signal when they are done running (most likely via a mutation to RethinkDB of some kind). So again, the tasks need to go to sleep and only get woken up when their task completes.

One thing that's very important is the ability for the scheduler to go down without interrupting rendering jobs. Given how long it takes to render a frame, it would be bad if taking down the scheduler for maintenance meant abandoning many hours (and dollars) worth of work.

What that means is that when the scheduler comes up again, it needs to be able to revisit any jobs or tasks that are in the queue, in order to see if they have completed while the scheduler was offline.

So the key points are (a) no persistent in-memory state, and (b) jobs and tasks are neither i/o bound or compute bound, they are mostly just waiting for stuff to happen on other machines. In such a case, I wouldn't have a task wait on a promise, but rather use a state-machine based approach, since I want the state to be durable.

I think rethinkdb-job-queue can do all this, but I'm not entirely sure what the right approach is. For example, to put jobs to sleep I can set their delay values to an infinite time in the future, and then restore them later.

For example, in my initial experiments I noticed that once Queue.process() has been called for a job, it doesn't get called again, even if I never call next(). So I think my concept of the execution model must be wrong.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Advice on application architecture for render farm #64

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Advice on application architecture for render farm #64

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions