-
Couldn't load subscription status.
- Fork 32
feat: Smart Cordon, Restart Requests #595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a Smart Cordon and Restart Requests system to improve cluster upgrade efficiency by allowing graceful node restarts with minimal work loss. The system enables background tasks to yield when nodes are cordoned, schedules finalize tasks during batch operations to reduce waste, fixes WebUI JSON-RPC reconnection issues, and adds a restart request mechanism for cordoned nodes.
- Background tasks on cordoned nodes now yield to other nodes automatically
- Finalize tasks can be scheduled while batch tasks are still running to minimize wasted work
- WebUI JSON-RPC client now properly handles reconnections with backoff and error handling
- New restart request system allows operators to request cordoned node restarts that auto-uncordon upon completion
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| web/static/lib/jsonrpc.mjs | Improved WebSocket reconnection logic with proper error handling and backoff |
| web/static/cluster-machines.mjs | Added restart/abort restart UI controls and status display |
| web/api/webrpc/cluster.go | Added restart request database operations and API endpoints |
| tasks/seal/task_finalize.go | Added scheduling overrides to allow finalize during batch tasks |
| tasks/proofshare/task_client_poll.go | Marked task as yieldable for cordoned nodes |
| tasks/f3/f3_task.go | Changed error message to "yield" for background task yielding |
| harmony/taskhelp/common.go | Added documentation about background task yielding requirement |
| harmony/harmonytask/task_type_handler.go | Implemented yielding logic for background and yieldable tasks |
| harmony/harmonytask/metrics.go | Added uptime metric collection with version tagging |
| harmony/harmonytask/harmonytask.go | Core implementation of restart requests and scheduling overrides |
| harmony/harmonydb/sql/20230719-harmony.sql | Added database schema comments for new fields |
| documentation/en/curio-service.md | Added systemd configuration for restart exit status |
| apt/curio.service | Added systemd configuration for restart exit status |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a bunch of camel and snake case mix on JS side. I am assuming most of it comes from AI. Can we have consistent naming there? Rest is good.
* cordon: yield bg tasks * make PSClientPoll yield as well * harmonytask: Scheduling Overrides * metrics: record version * fix build * smart restart * webui: Jsonrpc reconnect fixes * webui: restart requests * make gen * jrpc reject on conn fail * missing schema file * rm snake case
This PR does a bunch of things, massively improving QoL and efficiency when upgrading clusters.