-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[core] Make ReleaseUnusedBundles Fault Tolerant #57786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[core] Make ReleaseUnusedBundles Fault Tolerant #57786
Conversation
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
@Sparks0219 merge conflict, and could you please separate the worker interface refactoring into another PR? |
992e1bb
to
240ef63
Compare
f896cd6
to
240ef63
Compare
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
|
||
TEST_F(NodeManagerTest, TestHandleRequestWorkerLeaseInfeasibleIdempotent) { | ||
auto lease_spec = BuildLeaseSpec({{"CPU", 1}}); | ||
auto lease_spec = BuildLeaseSpec({{"CPU", 11}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I needed to expand the cpu resources for this test from 0 to some positive number (10) for the bundle test, but then the infeasible lease test failed since 1 cpu is no longer infeasible. I'll change this to a constexpr and not some magic number so it's more clear
bundle_spec_map_; | ||
|
||
friend bool IsBundleRegistered(const PlacementGroupResourceManager &manager, | ||
const BundleID &bundle_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀
no way around? + is it necessary to assert on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm I think we need some way of looking into pgmanager state since ReleaseUnusedBundles is calling pgmanager methods and I don't think there's anything I can use for this in the class currently :(
*local_lease_manager_); | ||
|
||
placement_group_resource_manager_ = | ||
std::make_unique<NewPlacementGroupResourceManager>(*cluster_resource_scheduler_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can save for the refactor pr, but why is called new 😵💫
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀 no clue
monkeypatch.setenv( | ||
"RAY_testing_rpc_failure", | ||
"NodeManagerService.grpc_client.ReleaseUnusedBundles=1:100:0" | ||
+ ",NodeManagerService.grpc_client.CancelResourceReserve=100:100:0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just do -1 if you want to kill it
but I don't love this
So like it's possible to happen without injecting failures on CancelResourceReserve because the gcs could be die before all the Cancel's happen. You should leave a comment describing that otherwise it's p hard to figure out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea exactly, but its inherently flaky so we need to perma block CancelResourceReserve to make it deterministic. I'll leave a comment + put -1 instead to be mroe clear about my intentions
Signed-off-by: joshlee <[email protected]>
Description
Making ReleaseUnusedBundles fault tolerant and enabling retries on network failures. Added cpp test to verify idempotency and created a python integration test. Also created a fake worker class since I needed to noop the underlying connection which is used in DestroyWorker and didn't want to modify the mock class.
Related issues
Types of change
Checklist
Does this PR introduce breaking changes?
Testing:
Code Quality:
git commit -s
)Documentation:
doc/source/
(if applicable)Additional context