-
Notifications
You must be signed in to change notification settings - Fork 831
[Jobs][Consolidation] Fix signal file creation when using Postgres #7717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand.
sky/server/common.py
Outdated
| # if pg is used, so the warning might not correctly show up. This is | ||
| # ok for now since realistically, pg is only used with helm deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ok for now since realistically, pg is only used with helm deployment.
Wait, why? We still need the warning for a remote API server, I think. Do you just mean that it's okay to skip the warning in the kubernetes logs, since it will still show up on the client side?
We definitely still need the value of is_consolidation_mode to be correct, since it affects how many server workers are started. This comment makes it sound like it will be incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So its a little bit complicated. This is the e2e workflow:
- In the helm deployment, we call the cli
sky api startto start an API server.
skypilot/charts/skypilot/templates/api-deployment.yaml
Lines 215 to 217 in bcdfc27
if sky api start -h | grep -q -- "--foreground"; then exec sky api start {{ include "skypilot.apiArgs" . }} --foreground else - This cli calls into the
_start_api_serverfunction insidesky/server/common.py. - in this function, we are still in the "client side", meaning the env vars
ENV_VAR_IS_SKYPILOT_SERVERis not set. This indicates that the config is loaded as a client and thus the db connection string is ignored. This will make thejobs.utils.is_consolidation_modenot read from the config inside the pg. But this is okay as the value ofis_consolidation_modeis only used for printing a warning, not determining the number of workers.
skypilot/sky/skypilot_config.py
Lines 485 to 499 in bcdfc27
def reload_config(init_db: bool = False) -> None: internal_config_path = os.environ.get(ENV_VAR_SKYPILOT_CONFIG) if internal_config_path is not None: # {ENV_VAR_SKYPILOT_CONFIG} is used internally. # When this environment variable is set, the config loading # behavior is not defined in the public interface. # SkyPilot reserves the right to change the config loading behavior # at any time when this environment variable is set. _reload_config_from_internal_file(internal_config_path) return if os.environ.get(constants.ENV_VAR_IS_SKYPILOT_SERVER) is not None: _reload_config_as_server(init_db=init_db) else: _reload_config_as_client() - we set this env var in the function and start a new process for the API server.
Lines 566 to 579 in bcdfc27
if foreground: # Replaces the current process with the API server os.environ[constants.ENV_VAR_IS_SKYPILOT_SERVER] = 'true' _set_metrics_env_var(os.environ, metrics, deploy) if enable_basic_auth: os.environ[constants.ENV_VAR_ENABLE_BASIC_AUTH] = 'true' os.execvp(args[0], args) log_path = os.path.expanduser(constants.API_SERVER_LOGS) os.makedirs(os.path.dirname(log_path), exist_ok=True) # For spawn mode, copy the environ to avoid polluting the SDK process. server_env = os.environ.copy() server_env[constants.ENV_VAR_IS_SKYPILOT_SERVER] = 'true' - here, inside the new process, we calculate number of workers. This time, since the env vars is set, the config is loaded from pg and the value of
is_consolidation_modeis correct.
Lines 2063 to 2064 in bcdfc27
config = server_config.compute_server_config(cmd_args.deploy, max_db_connections)
What this PR do:
- Previously, there is two ways of starting API server:
sky api startor directly runningpython -m sky.server.server. - We want both ways to create the signal file on api server start, but we dont want it get manipulated twice. so inside
_start_api_serverwe pass in an extra argument--start-with-pythonto skip the file creation inside thesky/server/server.py. - However, as the
_start_api_serveris client side, it does not read config from pg. so it will not create a signal file. but it still passes the--start-with-pythonarg, so insidesky/server/server.pyit will not create signal file as well. - This pr remove the arg and lets the
sky/server/server.pyto check and create a signal file anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please lmk if that make sense to you and if you understand, i'll start to polish the comments :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense. Thanks for the explanation. I think the main thing is that it's confusing that this is run before the server is started, and that we may not have PG. We can also have more detail about how this is only used for the memory check warning.
If you have consolidation mode, there will also be a warning in the server logs if the memory is too low, I assume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, do you think we can remove this warning for now? This is indeed very confusing and there will be a similar log in API server anyway.
Lines 163 to 166 in cf381be
| logger.warning( | |
| 'SkyPilot API server will run in low resource mode because ' | |
| 'the available memory is less than ' | |
| f'{server_constants.MIN_AVAIL_MEM_GB}GB.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this message is valuable for local API server, where the user may see the message here but probably not the one in the API server logs.
We could skip the message entirely if postgres is used, since that's almost certainly a remote API server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revamped the comments. PTAL again ;)
sky/server/server.py
Outdated
| managed_job_utils.is_consolidation_mode(on_api_restart=True) | ||
| # Maybe touch the signal file on API server startup. Do it again here even | ||
| # if we already touched it in the sky/server/common.py::_start_api_server. | ||
| # This is because the above function call is in client side and when pg is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the above function call
Which? All of this code is server-side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you mean _start_api_server - but we should clarify that it's running on the server but outside of where the server process is started. Maybe we can avoid the terms "server side" and "client side" in this case, in favor of "within the skypilot API server process tree" or something like that.
|
/smoke-test --managed-jobs |
When using Postgres, the config needs to be load from the pg db which is only be done from the server side of code. Previously we take a look at the config from client side first and then create the signal file. Later we skip this step in server side.
When using pg, the config will be different from client side and server side. This PR adds back re-check in server side so it correctly creates the signal file.
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)