-
Notifications
You must be signed in to change notification settings - Fork 1
fix(vllm_performance): Avoid multiple experiments using the same kubernetes deployment at the same time #268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ments in parallel Signed-off-by: Christian Pinto <[email protected]>
…onments Signed-off-by: Christian Pinto <[email protected]>
…oyment Signed-off-by: Christian Pinto <[email protected]>
|
Checks Summary Last run: 2025-11-28T15:55:31.673Z Code Risk Analyzer vulnerability scan found 2 vulnerabilities:
|
| if ( | ||
| model not in self.model_already_downloaded | ||
| and model not in self.deployments_to_wait_for | ||
| ): | ||
| self.deployments_to_wait_for[model] = DeploymentWaiter( | ||
| identifier=identifier | ||
| ) | ||
| return True | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By inverting the if you can reduce nesting and improve readability:
| if ( | |
| model not in self.model_already_downloaded | |
| and model not in self.deployments_to_wait_for | |
| ): | |
| self.deployments_to_wait_for[model] = DeploymentWaiter( | |
| identifier=identifier | |
| ) | |
| return True | |
| return False | |
| if ( | |
| model in self.model_already_downloaded | |
| or model in self.deployments_to_wait_for | |
| ): | |
| return False | |
| self.deployments_to_wait_for[model] = DeploymentWaiter(identifier=identifier) | |
| return True |
| :param check_interval: wait interval | ||
| :param timeout: timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth mentioning here that these values are in seconds
| return True | ||
| return False | ||
|
|
||
| async def wait(self, request_id: str, identifier: str, model: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The identifier variable name is a bit too generic - what is represented by this identifier?
| console.put.remote( | ||
| message=RichConsoleSpinnerMessage( | ||
| id=request_id, | ||
| label=f"({request_id}) Waiting for conflicting K8s deployment ({waiter.identifier}) to be started", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is conflicting the right term here?
| state="start", | ||
| ) | ||
| ) | ||
| await waiter.wait_event.wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe wait_event should be called models_downloaded_updated_event so that it reads waiter.models_downloaded_updated_event.wait() or something along the lines (basically, let's make wait_event more descriptive)
| if len(self.free_environments) > 0: | ||
| # There are unused environments, let's evict one | ||
|
|
||
| # Gets the oldest env in the dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Gets the oldest env in the dict | |
| # Gets the oldest env in the list |
| # There are unused environments, let's evict one | ||
|
|
||
| # Gets the oldest env in the dict | ||
| venv_to_evict = self.free_environments[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "venv" mean? Just environment?
I'd suggest calling this oldest_free_environment, oldest_unused_environment or oldest_environment_not_in_use to be more descriptive
| except ApiException as e: | ||
| logger.error(f"Error deleting deployment or service {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you delete everything in one single try and do not raise anything in case of errors - is it fine?
| else: | ||
| # No room for creating a new environment | ||
| logger.debug( | ||
| f"There are already {self.max_concurrent} actively in use, and I can't create a new one" | ||
| ) | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inverting the if would remove a lot of nesting
| """ | ||
| Report test completion | ||
| :param definition: environment definition | ||
| :param wipe: flag to indicate the environment iis to be completely removed and not freed for later use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :param wipe: flag to indicate the environment iis to be completely removed and not freed for later use | |
| :param wipe: flag to indicate the environment is to be completely removed and not freed for later use |
I'd call this something like reclaim_on_completion
|
Also there is an issue, not related to the change where (in the case of one max environment)
|


In case a of BS>1 if the entity samples require the same deployment type and there is one running, they will all use it at the same time. This spoils the test results as the experiments interfere with each other.
This PR makes the following changes:
I have done the following tests:
All tests successful.
I have also tested artificially failing one deployment while downloading a model for the first time and other deployments waiting. New Leader kicks in and the process continues
@michael-johnston and/or @AlessandroPomponio please try on your environment.
Example space with 16 entities all requesting the same K8s deployment
Example space with 16 entities requesting 4 K8s deployment
Example space with 16 entities requesting 4 K8s deployment using two different models
sample random walk