Skip to content

Conversation

@cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Oct 23, 2025

When using Postgres, the config needs to be load from the pg db which is only be done from the server side of code. Previously we take a look at the config from client side first and then create the signal file. Later we skip this step in server side.

When using pg, the config will be different from client side and server side. This PR adds back re-check in server side so it correctly creates the signal file.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@cblmemo cblmemo requested review from aylei and cg505 October 23, 2025 17:16
Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand.

Comment on lines 543 to 544
# if pg is used, so the warning might not correctly show up. This is
# ok for now since realistically, pg is only used with helm deployment.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok for now since realistically, pg is only used with helm deployment.

Wait, why? We still need the warning for a remote API server, I think. Do you just mean that it's okay to skip the warning in the kubernetes logs, since it will still show up on the client side?

We definitely still need the value of is_consolidation_mode to be correct, since it affects how many server workers are started. This comment makes it sound like it will be incorrect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So its a little bit complicated. This is the e2e workflow:

  1. In the helm deployment, we call the cli sky api start to start an API server.
    if sky api start -h | grep -q -- "--foreground"; then
    exec sky api start {{ include "skypilot.apiArgs" . }} --foreground
    else
  2. This cli calls into the _start_api_server function inside sky/server/common.py.
  3. in this function, we are still in the "client side", meaning the env vars ENV_VAR_IS_SKYPILOT_SERVER is not set. This indicates that the config is loaded as a client and thus the db connection string is ignored. This will make the jobs.utils.is_consolidation_mode not read from the config inside the pg. But this is okay as the value of is_consolidation_mode is only used for printing a warning, not determining the number of workers.
    def reload_config(init_db: bool = False) -> None:
    internal_config_path = os.environ.get(ENV_VAR_SKYPILOT_CONFIG)
    if internal_config_path is not None:
    # {ENV_VAR_SKYPILOT_CONFIG} is used internally.
    # When this environment variable is set, the config loading
    # behavior is not defined in the public interface.
    # SkyPilot reserves the right to change the config loading behavior
    # at any time when this environment variable is set.
    _reload_config_from_internal_file(internal_config_path)
    return
    if os.environ.get(constants.ENV_VAR_IS_SKYPILOT_SERVER) is not None:
    _reload_config_as_server(init_db=init_db)
    else:
    _reload_config_as_client()
  4. we set this env var in the function and start a new process for the API server.
    if foreground:
    # Replaces the current process with the API server
    os.environ[constants.ENV_VAR_IS_SKYPILOT_SERVER] = 'true'
    _set_metrics_env_var(os.environ, metrics, deploy)
    if enable_basic_auth:
    os.environ[constants.ENV_VAR_ENABLE_BASIC_AUTH] = 'true'
    os.execvp(args[0], args)
    log_path = os.path.expanduser(constants.API_SERVER_LOGS)
    os.makedirs(os.path.dirname(log_path), exist_ok=True)
    # For spawn mode, copy the environ to avoid polluting the SDK process.
    server_env = os.environ.copy()
    server_env[constants.ENV_VAR_IS_SKYPILOT_SERVER] = 'true'
  5. here, inside the new process, we calculate number of workers. This time, since the env vars is set, the config is loaded from pg and the value of is_consolidation_mode is correct.

    skypilot/sky/server/server.py

    Lines 2063 to 2064 in bcdfc27

    config = server_config.compute_server_config(cmd_args.deploy,
    max_db_connections)

What this PR do:

  1. Previously, there is two ways of starting API server: sky api start or directly running python -m sky.server.server.
  2. We want both ways to create the signal file on api server start, but we dont want it get manipulated twice. so inside _start_api_server we pass in an extra argument --start-with-python to skip the file creation inside the sky/server/server.py.
  3. However, as the _start_api_server is client side, it does not read config from pg. so it will not create a signal file. but it still passes the --start-with-python arg, so inside sky/server/server.py it will not create signal file as well.
  4. This pr remove the arg and lets the sky/server/server.py to check and create a signal file anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please lmk if that make sense to you and if you understand, i'll start to polish the comments :))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense. Thanks for the explanation. I think the main thing is that it's confusing that this is run before the server is started, and that we may not have PG. We can also have more detail about how this is only used for the memory check warning.
If you have consolidation mode, there will also be a warning in the server logs if the memory is too low, I assume.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, do you think we can remove this warning for now? This is indeed very confusing and there will be a similar log in API server anyway.

logger.warning(
'SkyPilot API server will run in low resource mode because '
'the available memory is less than '
f'{server_constants.MIN_AVAIL_MEM_GB}GB.')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this message is valuable for local API server, where the user may see the message here but probably not the one in the API server logs.
We could skip the message entirely if postgres is used, since that's almost certainly a remote API server.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revamped the comments. PTAL again ;)

managed_job_utils.is_consolidation_mode(on_api_restart=True)
# Maybe touch the signal file on API server startup. Do it again here even
# if we already touched it in the sky/server/common.py::_start_api_server.
# This is because the above function call is in client side and when pg is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above function call

Which? All of this code is server-side.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean _start_api_server - but we should clarify that it's running on the server but outside of where the server process is started. Maybe we can avoid the terms "server side" and "client side" in this case, in favor of "within the skypilot API server process tree" or something like that.

@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 29, 2025

/smoke-test --managed-jobs
/quicktest-core

@cblmemo cblmemo merged commit 542ef36 into master Oct 30, 2025
20 checks passed
@cblmemo cblmemo deleted the fix-signal-file-on-pg branch October 30, 2025 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants