-
Notifications
You must be signed in to change notification settings - Fork 315
Description
I use parallelcluster/slurm for chip design EDA workloads. We have the need for running small jobs (think simulations) on 8gb/1 core machines to the 192 core/1.5TB machines and beyond (physical design verification).
I also make heavy use of spot for my smaller machine jobs, the simulation regression runs. These just constantly run and I have multiple instance types available and if APC finds insufficient capacity for one time, it just goes to the next type.
I make extensive use of tracking license counts, because you don't want machines starting up w/o licenses available. That then requires all jobs that use a certain license to be in the same cluster - manageable.
But now I've got overlapping/redundant queues in multiple clusters. It's ok, just more to manage.
For example, I could have a "sim" cluster for all my jobs that runs simulations - and takes advantage of many spot machines. Then I have a "pd" cluster for all my physical design jobs.
So now I have to bounce between the different clusters to see they are being utilized/performing etc. I also have to manage/train users to use different clusters.
This is not something you need to do with LSF.
So, it would be nice if all my slurm queues were in one cluster.
Can you remove the hard limit of 50, is there really something that prevents you from making this dynamic? Or can you set it to some nice big number, i.e. 250, so we don't hit it?
Here's an example of all the machines I presently have in my cluster. You can see some of the machines I only request for spot implementation. I've had to comment out some machines as I'll go over the 50 queue limit.
This is a specification file from Allan Carter's https://github.com/aws-samples/aws-eda-slurm-cluster deployment tool that I rely on:
InstanceTypes: # can't have more than 50 instances
- r7a.medium # 0.07608 cpu=1 mem=8
- m6a.large: {UseOnDemand: false} # 0.0864 cpu=1 mem=8 hyper-threading turned off
- m6i.large: {UseOnDemand: false} # 0.096 cpu=1 mem=8 hyper-threading turned off
- m7i.large: {UseOnDemand: false} # 0.1008 cpu=1 mem=8 hyper-threading turned off
- r6a.large # 0.1134 cpu=1 mem=16 hyper-threading turned off
- r6i.large: {UseOnDemand: false} # 0.126 cpu=1 mem=16 hyper-threading turned off
- r7i.large: {UseOnDemand: false} # 0.1323 cpu=1 mem=16 hyper-threading turned off
# - m7a.large # 0.11592 cpu=2 mem=8
# - c7i.xlarge: {UseOnDemand: false} # 0.1785 cpu=2 mem=8 hyper-threading turned off
# - c6a.xlarge: {UseOnDemand: false} # 0.153 cpu=2 mem=8 hyper-threading turned off
- r7a.large # 0.15215 cpu=2 mem=16
- m6a.xlarge: {UseOnDemand: false} # 0.1728 cpu=2 mem=16 hyper-threading turned off
# - m6i.xlarge: {UseOnDemand: false} # 0.192 cpu=2 mem=16 hyper-threading turned off
- m7i.xlarge: {UseOnDemand: false} # 0.2016 cpu=2 mem=16 hyper-threading turned off
- r6a.xlarge # 0.2268 cpu=2 mem=32 hyperthreading turned off
- r6i.xlarge: {UseOnDemand: false} # 0.252 cpu=2 mem=32 hyperthreading turned off
- r7i.xlarge: {UseOnDemand: false} # 0.2646 cpu=2 mem=32 hyperthreading turned off
- m7a.xlarge # 0.23184 cpu=4 mem=16
- c6a.2xlarge: {UseOnDemand: false} # 0.306 cpu=4 mem=16 hyperthreading turned off
- c7i.2xlarge: {UseOnDemand: false} # 0.357 cpu=4 mem=16 hyperthreading turned off
- r7a.xlarge # 0.3043 cpu=4 mem=32
- m6a.2xlarge: {UseOnDemand: false} # 0.3456 cpu=4 mem=32 hyperthreading turned off
- m7i.2xlarge: {UseOnDemand: false} # 0.4032 cpu=4 mem=32 hyperthreading turned off
- r6a.2xlarge # 0.4536 cpu=4 mem=64 hyperthreading turned off
- r6i.2xlarge: {UseOnDemand: false} # 0.504 cpu=4 mem=64 hyperthreading turned off
- r7i.2xlarge: {UseOnDemand: false} # 0.5292 cpu=4 mem=64 hyperthreading turned off
# - c7a.2xlarge # 0.41056 cpu=8 mem=16
# - m7a.2xlarge # 0.46368 cpu=8 mem=32
# - c6a.4xlarge # 0.612 cpu=8 mem=32 hyperthreading turned off
# - c7i.4xlarge # 0.714 cpu=8 mem=32 hyperthreading turned off
# - r7a.2xlarge # 0.6086 cpu=8 mem=64
# - m6a.4xlarge # 0.6912 cpu=8 mem=64 hyper-threading turned off
# - m7i.4xlarge # 0.8064 cpu=8 mem=64 hyper-threading turned off
- r7i.4xlarge # 1.0548 cpu=8 mem=128 hyper-threading turned off
- r6a.4xlarge: {UseOnDemand: false} # 0.9072 cpu=8 mem=128 hyper-threading turned off
# - c7a.4xlarge # 0.82112 cpu=16 mem=32
# - m7a.4xlarge # 0.92736 cpu=16 mem=64
# - c7i.8xlarge # 1.428 cpu=16 mem=64 hyper-threading turned off
- r7a.4xlarge: {UseSpot: false} # 1.2172 cpu=16 mem=128
- r7i.8xlarge # 2.1168 cpu=16 mem=256 hyper-threading turned off.
- r6a.8xlarge: {UseOnDemand: false} # 1.8144 cpu=16 mem=256 hyper-threading turned off.
- r7i.12xlarge # 3.1752 cpu=24 mem=384
- c7a.8xlarge: {UseSpot: false} # 1.64224 cpu=32 mem=64
- m7a.8xlarge: {UseSpot: false} # 1.85472 cpu=32 mem=128
- r7a.8xlarge: {UseSpot: false} # 2.4344 cpu=32 mem=256
- r7i.16xlarge # 4.2336 cpu=32 mem=512
- r7a.12xlarge # 3.6516 cpu=48 mem=384
- c7a.16xlarge # 3.28448 cpu=64 mem=128
- r7a.16xlarge: {UseSpot: false} # 4.8688 cpu=64 mem=512