[API server] handle logs request in coroutine #5366
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
close #4767
This PR includes the minimal changes that move
/logs
handling to coroutine:/logs
in uvicorn's event loop;Though the task is now executed directly in the unvicorn process, we still maintain a request record for logs request to keep the behavior consistent: user can still cancel a log request
sky api cancel
and retrieve the log again withsky api logs
.Follow ups:
sky jobs log
Benchmark
python tests/load_tests/test_load_on_server.py -n 100 --apis tail_logs -c kubernetes
under low server concurrency, 1c2g machine (1 long workers + 2 short workers):There is a 7x improvement in average. The bottleneck of this PR is that each log task runs in a dedicated thread and there is only 1 uvicorn worker process, GIL contention makes the 100 logs threads cannot be fully concurrent.
python tests/load_tests/test_load_on_server.py -n 100 --apis tail_logs -c aws
under unlimited concurrency local mode (burstable worker), 4c16g machine:Resources:
About 10x memory efficiency. However, the test found that logs on aws instance is significantly slower than logs on kubernetes instance (I switch the benchmark env to AWS EC2 for accurate resource usage accounting). This might be related to more RPCs/CPU cycles touched by the AWS code path, I leave this as a followup as it is not actually relevant to this PR.
Tests
Tested (run the relevant ones):
bash format.sh
/smoke-test
(CI) orpytest tests/test_smoke.py
(local)/smoke-test -k test_name
(CI) orpytest tests/test_smoke.py::test_name
(local)/quicktest-core
(CI) orpytest tests/smoke_tests/test_backward_compat.py
(local)