Skip to content

Conversation

Sparks0219
Copy link
Contributor

@Sparks0219 Sparks0219 commented Oct 16, 2025

Description

Making ReleaseUnusedBundles fault tolerant and enabling retries on network failures. Added cpp test to verify idempotency and created a python integration test. Also created a fake worker class since I needed to noop the underlying connection which is used in DestroyWorker and didn't want to modify the mock class.

Related issues

Types of change

  • Bug fix 🐛
  • New feature ✨
  • Enhancement 🚀
  • Code refactoring 🔧
  • Documentation update 📖
  • Chore 🧹
  • Style 🎨

Checklist

Does this PR introduce breaking changes?

  • Yes ⚠️
  • No

Testing:

  • Added/updated tests for my changes
  • Tested the changes manually
  • This PR is not tested ❌ (please explain why)

Code Quality:

  • Signed off every commit (git commit -s)
  • Ran pre-commit hooks (setup guide)

Documentation:

  • Updated documentation (if applicable) (contribution guide)
  • Added new APIs to doc/source/ (if applicable)

Additional context

Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
@Sparks0219 Sparks0219 requested review from dayshah and edoakes October 16, 2025 06:14
@Sparks0219 Sparks0219 added the go add ONLY when ready to merge, run all tests label Oct 16, 2025
@Sparks0219 Sparks0219 marked this pull request as ready for review October 16, 2025 06:15
@Sparks0219 Sparks0219 requested a review from a team as a code owner October 16, 2025 06:15
Signed-off-by: joshlee <[email protected]>
@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 16, 2025
@edoakes
Copy link
Collaborator

edoakes commented Oct 16, 2025

@Sparks0219 merge conflict, and could you please separate the worker interface refactoring into another PR?

@Sparks0219 Sparks0219 force-pushed the joshlee/make-release-unused-bundles-fault-tolerant branch from 992e1bb to 240ef63 Compare October 16, 2025 18:54
cursor[bot]

This comment was marked as outdated.

@Sparks0219 Sparks0219 force-pushed the joshlee/make-release-unused-bundles-fault-tolerant branch 2 times, most recently from f896cd6 to 240ef63 Compare October 16, 2025 19:55
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>

TEST_F(NodeManagerTest, TestHandleRequestWorkerLeaseInfeasibleIdempotent) {
auto lease_spec = BuildLeaseSpec({{"CPU", 1}});
auto lease_spec = BuildLeaseSpec({{"CPU", 11}});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I needed to expand the cpu resources for this test from 0 to some positive number (10) for the bundle test, but then the infeasible lease test failed since 1 cpu is no longer infeasible. I'll change this to a constexpr and not some magic number so it's more clear

bundle_spec_map_;

friend bool IsBundleRegistered(const PlacementGroupResourceManager &manager,
const BundleID &bundle_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀
no way around? + is it necessary to assert on this

Copy link
Contributor Author

@Sparks0219 Sparks0219 Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm I think we need some way of looking into pgmanager state since ReleaseUnusedBundles is calling pgmanager methods and I don't think there's anything I can use for this in the class currently :(

*local_lease_manager_);

placement_group_resource_manager_ =
std::make_unique<NewPlacementGroupResourceManager>(*cluster_resource_scheduler_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can save for the refactor pr, but why is called new 😵‍💫

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀 no clue

monkeypatch.setenv(
"RAY_testing_rpc_failure",
"NodeManagerService.grpc_client.ReleaseUnusedBundles=1:100:0"
+ ",NodeManagerService.grpc_client.CancelResourceReserve=100:100:0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do -1 if you want to kill it

but I don't love this

So like it's possible to happen without injecting failures on CancelResourceReserve because the gcs could be die before all the Cancel's happen. You should leave a comment describing that otherwise it's p hard to figure out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea exactly, but its inherently flaky so we need to perma block CancelResourceReserve to make it deterministic. I'll leave a comment + put -1 instead to be mroe clear about my intentions

Signed-off-by: joshlee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants