fix(vllm_performance): Avoid multiple experiments using the same kubernetes deployment at the same time #268

christian-pinto · 2025-11-28T15:39:50Z

In case a of BS>1 if the entity samples require the same deployment type and there is one running, they will all use it at the same time. This spoils the test results as the experiments interfere with each other.

This PR makes the following changes:

Experiments using the same deployment can run in parallel only if there are as many parallel K8s environments available, either already existing or creating new ones.
K8s deployments using the same model, and starting at the same time, will elect one as the leader (we first one to get in) let it start and then continue with their processing. This is to avoid multiple vLLM instances downloading the same model model in the shared HF cache. This mechanism is to avoid corrupting the cache.

I have done the following tests:

Operation: Grouped Random walk on all entities
Actuator config: out of cluster with 3 max parallel environments (Why 3? Why not?)
Space: 16 entities in total

Batch size	K8s deployment
2	all same deployment
3	all same deployment
6	all same deployment
2	4 deployment types
3	4 deployment types
6	4 deployment types

All tests successful.

I have also tested artificially failing one deployment while downloading a model for the first time and other deployments waiting. New Leader kicks in and the process continues

@michael-johnston and/or @AlessandroPomponio please try on your environment.

Example space with 16 entities all requesting the same K8s deployment

entitySpace:
  - identifier: model
    propertyDomain:
      values:
        - ibm-granite/granite-3.0-2b-instruct
  - identifier: "number_input_tokens"
    propertyDomain:
      values: [1024]
  - identifier: "request_rate"
    propertyDomain:
      values: [1, 2, 4, 8]
  - identifier: n_cpus
    propertyDomain:
      values: [2]
  - identifier: memory
    propertyDomain:
      values: ["128Gi"]
  - identifier: "max_batch_tokens"
    propertyDomain:
      values: [8192]
  - identifier: "max_num_seq"
    propertyDomain:
      values: [256]
  - identifier: "n_gpus"
    propertyDomain:
      values: [1]
  - identifier: "num_prompts"
    propertyDomain:
      values: [1, 2, 3, 4]
  - identifier: "gpu_type"
    propertyDomain:
      values: ["NVIDIA-A100-80GB-PCIe"]
experiments:
  - actuatorIdentifier: vllm_performance
    experimentIdentifier: test-deployment-v1
metadata:
  description: A space of vllm deployment configurations
  name: vllm_deployments_2_entities_same_deployment

Example space with 16 entities requesting 4 K8s deployment

entitySpace:
  - identifier: model
    propertyDomain:
      values:
        - ibm-granite/granite-3.0-2b-instruct
  - identifier: "number_input_tokens"
    propertyDomain:
      values: [1024]
  - identifier: "request_rate"
    propertyDomain:
      values: [1]
  - identifier: n_cpus
    propertyDomain:
      values: [2]
  - identifier: memory
    propertyDomain:
      values: ["128Gi"]
  - identifier: "max_batch_tokens"
    propertyDomain:
      values: [8192]
  - identifier: "max_num_seq"
    propertyDomain:
      values: [32, 64 , 128, 256]
  - identifier: "n_gpus"
    propertyDomain:
      values: [1]
  - identifier: "num_prompts"
    propertyDomain:
      values: [1, 2, 3, 4]
  - identifier: "gpu_type"
    propertyDomain:
      values: ["NVIDIA-A100-80GB-PCIe"]
experiments:
  - actuatorIdentifier: vllm_performance
    experimentIdentifier: test-deployment-v1
metadata:
  description: A space of vllm deployment configurations
  name: vllm_deployments_2_entities_same_deployment

Example space with 16 entities requesting 4 K8s deployment using two different models

entitySpace:
  - identifier: model
    propertyDomain:
      values:
        - ibm-granite/granite-3.0-2b-instruct
        - ibm-granite/granite-3.0-8b-instruct
  - identifier: "number_input_tokens"
    propertyDomain:
      values: [1024]
  - identifier: "request_rate"
    propertyDomain:
      values: [1]
  - identifier: n_cpus
    propertyDomain:
      values: [2]
  - identifier: memory
    propertyDomain:
      values: ["128Gi"]
  - identifier: "max_batch_tokens"
    propertyDomain:
      values: [8192]
  - identifier: "max_num_seq"
    propertyDomain:
      values: [32, 64]
  - identifier: "n_gpus"
    propertyDomain:
      values: [1]
  - identifier: "num_prompts"
    propertyDomain:
      values: [1, 2, 3, 4]
  - identifier: "gpu_type"
    propertyDomain:
      values: ["NVIDIA-A100-80GB-PCIe"]
experiments:
  - actuatorIdentifier: vllm_performance
    experimentIdentifier: test-deployment-v1
metadata:
  description: A space of vllm deployment configurations
  name: vllm_deployments_2_entities_same_deployment

sample random walk

metadata:
  name: randomwalk-grouped-vllm-performance-full
spaces:
  - your-space
actuatorConfigurationIdentifiers:
  - your-actuator-config

operation:
  module:
    moduleClass: RandomWalk
  parameters:
    numberEntities: all
    batchSize: 2
    singleMeasurement: False
    samplerConfig:
      mode: 'sequentialgrouped'
      samplerType: 'generator'
      grouping: #A unique combination of these properties is a new vLLM deployment
        - model
        - image
        - memory
        - max_batch_tokens
        - max_num_seq
        - n_gpus
        - gpu_type
        - n_cpus

…ments in parallel Signed-off-by: Christian Pinto <[email protected]>

…onments Signed-off-by: Christian Pinto <[email protected]>

…oyment Signed-off-by: Christian Pinto <[email protected]>

DRL-NextGen · 2025-11-28T15:55:33Z

Checks Summary

Last run: 2025-11-28T15:55:31.673Z

Code Risk Analyzer vulnerability scan found 2 vulnerabilities:

Severity	Identifier	Package	Details	Fix
🔷Medium	CVE-2025-50181	urllib3	urllib3 redirects are not disabled when retries are disabled on PoolManager instantiation GHSA-pq67-6m6q-mj2v urllib3:2.3.0->kubernetes:34.1.0	2.5.0
🔷Medium	CVE-2025-50182	urllib3	urllib3 does not control redirects in browsers and Node.js GHSA-48p4-8xcf-vxj5 urllib3:2.3.0->kubernetes:34.1.0	2.5.0

AlessandroPomponio · 2025-11-28T16:19:24Z

plugins/actuators/vllm_performance/ado_actuators/vllm_performance/deployment_management.py

+        if (
+            model not in self.model_already_downloaded
+            and model not in self.deployments_to_wait_for
+        ):
+            self.deployments_to_wait_for[model] = DeploymentWaiter(
+                identifier=identifier
+            )
+            return True
+        return False


By inverting the if you can reduce nesting and improve readability:

Suggested change

if (

model not in self.model_already_downloaded

and model not in self.deployments_to_wait_for

):

self.deployments_to_wait_for[model] = DeploymentWaiter(

identifier=identifier

)

return True

return False

if (

model in self.model_already_downloaded

or model in self.deployments_to_wait_for

):

return False

self.deployments_to_wait_for[model] = DeploymentWaiter(identifier=identifier)

return True

AlessandroPomponio · 2025-11-28T16:19:46Z

plugins/actuators/vllm_performance/ado_actuators/vllm_performance/k8s/create_environment.py

+    :param check_interval: wait interval
+    :param timeout: timeout


It's probably worth mentioning here that these values are in seconds

AlessandroPomponio · 2025-11-28T16:21:37Z

plugins/actuators/vllm_performance/ado_actuators/vllm_performance/deployment_management.py

+            return True
+        return False
+
+    async def wait(self, request_id: str, identifier: str, model: str) -> None:


The identifier variable name is a bit too generic - what is represented by this identifier?

AlessandroPomponio · 2025-11-28T16:23:27Z

plugins/actuators/vllm_performance/ado_actuators/vllm_performance/deployment_management.py

+                console.put.remote(
+                    message=RichConsoleSpinnerMessage(
+                        id=request_id,
+                        label=f"({request_id}) Waiting for conflicting K8s deployment ({waiter.identifier}) to be started",


Is conflicting the right term here?

AlessandroPomponio · 2025-11-28T16:25:14Z

plugins/actuators/vllm_performance/ado_actuators/vllm_performance/deployment_management.py

+                        state="start",
+                    )
+                )
+                await waiter.wait_event.wait()


Maybe wait_event should be called models_downloaded_updated_event so that it reads waiter.models_downloaded_updated_event.wait() or something along the lines (basically, let's make wait_event more descriptive)

AlessandroPomponio · 2025-11-28T16:39:08Z