Skip to content

Conversation

@christian-pinto
Copy link
Member

@christian-pinto christian-pinto commented Nov 28, 2025

In case a of BS>1 if the entity samples require the same deployment type and there is one running, they will all use it at the same time. This spoils the test results as the experiments interfere with each other.

This PR makes the following changes:

  • Experiments using the same deployment can run in parallel only if there are as many parallel K8s environments available, either already existing or creating new ones.
  • K8s deployments using the same model, and starting at the same time, will elect one as the leader (we first one to get in) let it start and then continue with their processing. This is to avoid multiple vLLM instances downloading the same model model in the shared HF cache. This mechanism is to avoid corrupting the cache.

I have done the following tests:

  • Operation: Grouped Random walk on all entities
  • Actuator config: out of cluster with 3 max parallel environments (Why 3? Why not?)
  • Space: 16 entities in total
Batch size K8s deployment
2 all same deployment
3 all same deployment
6 all same deployment
2 4 deployment types
3 4 deployment types
6 4 deployment types

All tests successful.

I have also tested artificially failing one deployment while downloading a model for the first time and other deployments waiting. New Leader kicks in and the process continues

@michael-johnston and/or @AlessandroPomponio please try on your environment.

Example space with 16 entities all requesting the same K8s deployment

entitySpace:
  - identifier: model
    propertyDomain:
      values:
        - ibm-granite/granite-3.0-2b-instruct
  - identifier: "number_input_tokens"
    propertyDomain:
      values: [1024]
  - identifier: "request_rate"
    propertyDomain:
      values: [1, 2, 4, 8]
  - identifier: n_cpus
    propertyDomain:
      values: [2]
  - identifier: memory
    propertyDomain:
      values: ["128Gi"]
  - identifier: "max_batch_tokens"
    propertyDomain:
      values: [8192]
  - identifier: "max_num_seq"
    propertyDomain:
      values: [256]
  - identifier: "n_gpus"
    propertyDomain:
      values: [1]
  - identifier: "num_prompts"
    propertyDomain:
      values: [1, 2, 3, 4]
  - identifier: "gpu_type"
    propertyDomain:
      values: ["NVIDIA-A100-80GB-PCIe"]
experiments:
  - actuatorIdentifier: vllm_performance
    experimentIdentifier: test-deployment-v1
metadata:
  description: A space of vllm deployment configurations
  name: vllm_deployments_2_entities_same_deployment

Example space with 16 entities requesting 4 K8s deployment

entitySpace:
  - identifier: model
    propertyDomain:
      values:
        - ibm-granite/granite-3.0-2b-instruct
  - identifier: "number_input_tokens"
    propertyDomain:
      values: [1024]
  - identifier: "request_rate"
    propertyDomain:
      values: [1]
  - identifier: n_cpus
    propertyDomain:
      values: [2]
  - identifier: memory
    propertyDomain:
      values: ["128Gi"]
  - identifier: "max_batch_tokens"
    propertyDomain:
      values: [8192]
  - identifier: "max_num_seq"
    propertyDomain:
      values: [32, 64 , 128, 256]
  - identifier: "n_gpus"
    propertyDomain:
      values: [1]
  - identifier: "num_prompts"
    propertyDomain:
      values: [1, 2, 3, 4]
  - identifier: "gpu_type"
    propertyDomain:
      values: ["NVIDIA-A100-80GB-PCIe"]
experiments:
  - actuatorIdentifier: vllm_performance
    experimentIdentifier: test-deployment-v1
metadata:
  description: A space of vllm deployment configurations
  name: vllm_deployments_2_entities_same_deployment

Example space with 16 entities requesting 4 K8s deployment using two different models

entitySpace:
  - identifier: model
    propertyDomain:
      values:
        - ibm-granite/granite-3.0-2b-instruct
        - ibm-granite/granite-3.0-8b-instruct
  - identifier: "number_input_tokens"
    propertyDomain:
      values: [1024]
  - identifier: "request_rate"
    propertyDomain:
      values: [1]
  - identifier: n_cpus
    propertyDomain:
      values: [2]
  - identifier: memory
    propertyDomain:
      values: ["128Gi"]
  - identifier: "max_batch_tokens"
    propertyDomain:
      values: [8192]
  - identifier: "max_num_seq"
    propertyDomain:
      values: [32, 64]
  - identifier: "n_gpus"
    propertyDomain:
      values: [1]
  - identifier: "num_prompts"
    propertyDomain:
      values: [1, 2, 3, 4]
  - identifier: "gpu_type"
    propertyDomain:
      values: ["NVIDIA-A100-80GB-PCIe"]
experiments:
  - actuatorIdentifier: vllm_performance
    experimentIdentifier: test-deployment-v1
metadata:
  description: A space of vllm deployment configurations
  name: vllm_deployments_2_entities_same_deployment

sample random walk

metadata:
  name: randomwalk-grouped-vllm-performance-full
spaces:
  - your-space
actuatorConfigurationIdentifiers:
  - your-actuator-config

operation:
  module:
    moduleClass: RandomWalk
  parameters:
    numberEntities: all
    batchSize: 2
    singleMeasurement: False
    samplerConfig:
      mode: 'sequentialgrouped'
      samplerType: 'generator'
      grouping: #A unique combination of these properties is a new vLLM deployment
        - model
        - image
        - memory
        - max_batch_tokens
        - max_num_seq
        - n_gpus
        - gpu_type
        - n_cpus

@DRL-NextGen
Copy link
Member

Checks Summary

Last run: 2025-11-28T15:55:31.673Z

Code Risk Analyzer vulnerability scan found 2 vulnerabilities:

Severity Identifier Package Details Fix
🔷Medium CVE-2025-50181 urllib3
urllib3 redirects are not disabled when retries are disabled on PoolManager instantiationGHSA-pq67-6m6q-mj2v

urllib3:2.3.0->kubernetes:34.1.0
2.5.0
🔷Medium CVE-2025-50182 urllib3
urllib3 does not control redirects in browsers and Node.jsGHSA-48p4-8xcf-vxj5

urllib3:2.3.0->kubernetes:34.1.0
2.5.0

Comment on lines +26 to +34
if (
model not in self.model_already_downloaded
and model not in self.deployments_to_wait_for
):
self.deployments_to_wait_for[model] = DeploymentWaiter(
identifier=identifier
)
return True
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By inverting the if you can reduce nesting and improve readability:

Suggested change
if (
model not in self.model_already_downloaded
and model not in self.deployments_to_wait_for
):
self.deployments_to_wait_for[model] = DeploymentWaiter(
identifier=identifier
)
return True
return False
if (
model in self.model_already_downloaded
or model in self.deployments_to_wait_for
):
return False
self.deployments_to_wait_for[model] = DeploymentWaiter(identifier=identifier)
return True

Comment on lines +69 to +70
:param check_interval: wait interval
:param timeout: timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth mentioning here that these values are in seconds

return True
return False

async def wait(self, request_id: str, identifier: str, model: str) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The identifier variable name is a bit too generic - what is represented by this identifier?

console.put.remote(
message=RichConsoleSpinnerMessage(
id=request_id,
label=f"({request_id}) Waiting for conflicting K8s deployment ({waiter.identifier}) to be started",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is conflicting the right term here?

state="start",
)
)
await waiter.wait_event.wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe wait_event should be called models_downloaded_updated_event so that it reads waiter.models_downloaded_updated_event.wait() or something along the lines (basically, let's make wait_event more descriptive)

if len(self.free_environments) > 0:
# There are unused environments, let's evict one

# Gets the oldest env in the dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Gets the oldest env in the dict
# Gets the oldest env in the list

# There are unused environments, let's evict one

# Gets the oldest env in the dict
venv_to_evict = self.free_environments[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "venv" mean? Just environment?
I'd suggest calling this oldest_free_environment, oldest_unused_environment or oldest_environment_not_in_use to be more descriptive

Comment on lines +163 to +164
except ApiException as e:
logger.error(f"Error deleting deployment or service {e}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you delete everything in one single try and do not raise anything in case of errors - is it fine?

Comment on lines +168 to 173
else:
# No room for creating a new environment
logger.debug(
f"There are already {self.max_concurrent} actively in use, and I can't create a new one"
)
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inverting the if would remove a lot of nesting

"""
Report test completion
:param definition: environment definition
:param wipe: flag to indicate the environment iis to be completely removed and not freed for later use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param wipe: flag to indicate the environment iis to be completely removed and not freed for later use
:param wipe: flag to indicate the environment is to be completely removed and not freed for later use

I'd call this something like reclaim_on_completion

@michael-johnston
Copy link
Member

michael-johnston commented Dec 1, 2025

I see this behaviour with 2 environments batch size 3 using "Example space with 16 entities " only one model.

Leader starts deployments, other two wait (they all need same deployment)
Screenshot 2025-12-01 at 21 29 11

They then all appear to execute concurrently

Screenshot 2025-12-01 at 21 30 35

Two things

  • Since I have max_environments=2 I expected two of the three to start creating environments
  • In the case of one, I did not expect them all to report they were running (is this a logging bug?)

@michael-johnston
Copy link
Member

michael-johnston commented Dec 1, 2025

Also there is an issue, not related to the change where (in the case of one max environment)

  • you lose connection to the cluster as deployment is spinning up and the code is waiting
  • The max retries for checking the deployment is exceeded raising K8SConnectionError
  • However the deployment cannot be destroyed as there is no connection
  • The experiment is marked as invalid -> new experiments queue
  • The connection comes back however nothing can proceed as there is one "stray" deployment that will never be garbage collected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants