Skip to content

feature request: Increase slurm queues/remove artificial 50 count limit. #6923

@gwolski

Description

@gwolski

I use parallelcluster/slurm for chip design EDA workloads. We have the need for running small jobs (think simulations) on 8gb/1 core machines to the 192 core/1.5TB machines and beyond (physical design verification).

I also make heavy use of spot for my smaller machine jobs, the simulation regression runs. These just constantly run and I have multiple instance types available and if APC finds insufficient capacity for one time, it just goes to the next type.

I make extensive use of tracking license counts, because you don't want machines starting up w/o licenses available. That then requires all jobs that use a certain license to be in the same cluster - manageable.

But now I've got overlapping/redundant queues in multiple clusters. It's ok, just more to manage.

For example, I could have a "sim" cluster for all my jobs that runs simulations - and takes advantage of many spot machines. Then I have a "pd" cluster for all my physical design jobs.

So now I have to bounce between the different clusters to see they are being utilized/performing etc. I also have to manage/train users to use different clusters.

This is not something you need to do with LSF.

So, it would be nice if all my slurm queues were in one cluster.

Can you remove the hard limit of 50, is there really something that prevents you from making this dynamic? Or can you set it to some nice big number, i.e. 250, so we don't hit it?

Here's an example of all the machines I presently have in my cluster. You can see some of the machines I only request for spot implementation. I've had to comment out some machines as I'll go over the 50 queue limit.
This is a specification file from Allan Carter's https://github.com/aws-samples/aws-eda-slurm-cluster deployment tool that I rely on:

      InstanceTypes:     # can't have more than 50 instances
        - r7a.medium                        # 0.07608 cpu=1 mem=8
        - m6a.large:   {UseOnDemand: false} # 0.0864  cpu=1 mem=8 hyper-threading turned off
        - m6i.large:   {UseOnDemand: false} # 0.096   cpu=1 mem=8 hyper-threading turned off
        - m7i.large:   {UseOnDemand: false} # 0.1008  cpu=1 mem=8 hyper-threading turned off
        - r6a.large                         # 0.1134  cpu=1 mem=16 hyper-threading turned off
        - r6i.large:   {UseOnDemand: false} # 0.126   cpu=1 mem=16 hyper-threading turned off
        - r7i.large:   {UseOnDemand: false} # 0.1323  cpu=1 mem=16 hyper-threading turned off
#        - m7a.large                         # 0.11592 cpu=2 mem=8
#        - c7i.xlarge: {UseOnDemand: false}  # 0.1785  cpu=2 mem=8 hyper-threading turned off
#        - c6a.xlarge: {UseOnDemand: false}  # 0.153   cpu=2 mem=8 hyper-threading turned off
        - r7a.large                         # 0.15215 cpu=2 mem=16
        - m6a.xlarge:  {UseOnDemand: false} # 0.1728  cpu=2 mem=16 hyper-threading turned off
#        - m6i.xlarge:  {UseOnDemand: false} # 0.192   cpu=2 mem=16 hyper-threading turned off
        - m7i.xlarge:  {UseOnDemand: false} # 0.2016  cpu=2 mem=16 hyper-threading turned off
        - r6a.xlarge                        # 0.2268  cpu=2 mem=32 hyperthreading turned off
        - r6i.xlarge:  {UseOnDemand: false} # 0.252   cpu=2 mem=32 hyperthreading turned off
        - r7i.xlarge:  {UseOnDemand: false} # 0.2646  cpu=2 mem=32 hyperthreading turned off
        - m7a.xlarge                        # 0.23184 cpu=4 mem=16
        - c6a.2xlarge: {UseOnDemand: false} # 0.306   cpu=4 mem=16 hyperthreading turned off
        - c7i.2xlarge: {UseOnDemand: false} # 0.357   cpu=4 mem=16 hyperthreading turned off
        - r7a.xlarge                        # 0.3043  cpu=4 mem=32
        - m6a.2xlarge: {UseOnDemand: false} # 0.3456  cpu=4 mem=32 hyperthreading turned off
        - m7i.2xlarge: {UseOnDemand: false} # 0.4032  cpu=4 mem=32 hyperthreading turned off
        - r6a.2xlarge                       # 0.4536  cpu=4 mem=64 hyperthreading turned off
        - r6i.2xlarge: {UseOnDemand: false} # 0.504   cpu=4 mem=64 hyperthreading turned off
        - r7i.2xlarge: {UseOnDemand: false} # 0.5292  cpu=4 mem=64 hyperthreading turned off
#        - c7a.2xlarge  # 0.41056 cpu=8 mem=16
#        - m7a.2xlarge  # 0.46368 cpu=8 mem=32
#        - c6a.4xlarge  # 0.612   cpu=8 mem=32 hyperthreading turned off
#        - c7i.4xlarge  # 0.714   cpu=8 mem=32 hyperthreading turned off
#        - r7a.2xlarge  # 0.6086  cpu=8 mem=64
#        - m6a.4xlarge  # 0.6912  cpu=8 mem=64  hyper-threading turned off
#        - m7i.4xlarge  # 0.8064  cpu=8 mem=64  hyper-threading turned off
        - r7i.4xlarge  # 1.0548  cpu=8 mem=128  hyper-threading turned off
        - r6a.4xlarge: {UseOnDemand: false} # 0.9072  cpu=8 mem=128  hyper-threading turned off
#        - c7a.4xlarge  # 0.82112 cpu=16 mem=32
#        - m7a.4xlarge  # 0.92736 cpu=16 mem=64
#        - c7i.8xlarge  # 1.428 cpu=16 mem=64 hyper-threading turned off
        - r7a.4xlarge: {UseSpot: false}  # 1.2172  cpu=16 mem=128
        - r7i.8xlarge                    # 2.1168  cpu=16 mem=256  hyper-threading turned off.
        - r6a.8xlarge: {UseOnDemand: false} # 1.8144  cpu=16 mem=256  hyper-threading turned off.
        - r7i.12xlarge                   # 3.1752  cpu=24 mem=384
        - c7a.8xlarge: {UseSpot: false}  # 1.64224 cpu=32 mem=64
        - m7a.8xlarge: {UseSpot: false}  # 1.85472 cpu=32 mem=128
        - r7a.8xlarge: {UseSpot: false}  # 2.4344  cpu=32 mem=256
        - r7i.16xlarge                   # 4.2336  cpu=32 mem=512
        - r7a.12xlarge                   # 3.6516  cpu=48 mem=384
        - c7a.16xlarge                   # 3.28448 cpu=64 mem=128
        - r7a.16xlarge: {UseSpot: false} # 4.8688  cpu=64 mem=512

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions