Skip to content

Conversation

@magik6k
Copy link
Collaborator

@magik6k magik6k commented Aug 18, 2025

This PR does a bunch of things, massively improving QoL and efficiency when upgrading clusters.

  • Background tasks on cordoned nodes automatically yield to other nodes now
  • Finalize task now is scheduled while there is a batch task still in progress.
    • This doesn't impact non-supraseal pipeline nearly as much so didn't implement for that one
    • In the supraseal pipeline this means that nodes can be restarted with nearly zero wasted work
  • Made webui jsonrpc actually reconnect correctly
  • Implemented a restart-request harmonytask mechanism, which allows node operators to request a cordoned node to restart. Node will only restart when it has no tasks running and when it does it auto-uncordons itself.
    • In the UI it's only possible to start the request when the node is already cordoned, prevents accidental clicks and still does keep it easy to mass-trigger
    • Node restart is done by having the process exit with exitcode 100, which generally has no standard meaning. Systemd (even the service files we shipped prior to this PR) will restart curio after exit like that. I did add explicit restart handling to service files for this exitcode for completeness but it's not strictly necessary.
  • Also reporting node build version in prometheus, makes it possible to build grafana dashboard for tracking node upgrade progress.
2025-08-18-142114_664x266_scrot 2025-08-18-142121_951x432_scrot

@magik6k magik6k requested a review from Copilot August 18, 2025 10:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a Smart Cordon and Restart Requests system to improve cluster upgrade efficiency by allowing graceful node restarts with minimal work loss. The system enables background tasks to yield when nodes are cordoned, schedules finalize tasks during batch operations to reduce waste, fixes WebUI JSON-RPC reconnection issues, and adds a restart request mechanism for cordoned nodes.

  • Background tasks on cordoned nodes now yield to other nodes automatically
  • Finalize tasks can be scheduled while batch tasks are still running to minimize wasted work
  • WebUI JSON-RPC client now properly handles reconnections with backoff and error handling
  • New restart request system allows operators to request cordoned node restarts that auto-uncordon upon completion

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
web/static/lib/jsonrpc.mjs Improved WebSocket reconnection logic with proper error handling and backoff
web/static/cluster-machines.mjs Added restart/abort restart UI controls and status display
web/api/webrpc/cluster.go Added restart request database operations and API endpoints
tasks/seal/task_finalize.go Added scheduling overrides to allow finalize during batch tasks
tasks/proofshare/task_client_poll.go Marked task as yieldable for cordoned nodes
tasks/f3/f3_task.go Changed error message to "yield" for background task yielding
harmony/taskhelp/common.go Added documentation about background task yielding requirement
harmony/harmonytask/task_type_handler.go Implemented yielding logic for background and yieldable tasks
harmony/harmonytask/metrics.go Added uptime metric collection with version tagging
harmony/harmonytask/harmonytask.go Core implementation of restart requests and scheduling overrides
harmony/harmonydb/sql/20230719-harmony.sql Added database schema comments for new fields
documentation/en/curio-service.md Added systemd configuration for restart exit status
apt/curio.service Added systemd configuration for restart exit status

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@magik6k magik6k marked this pull request as ready for review August 18, 2025 12:35
@magik6k magik6k requested a review from LexLuthr August 18, 2025 12:36
Copy link
Contributor

@LexLuthr LexLuthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bunch of camel and snake case mix on JS side. I am assuming most of it comes from AI. Can we have consistent naming there? Rest is good.

@magik6k magik6k merged commit de0c36e into main Aug 18, 2025
17 checks passed
@magik6k magik6k deleted the feat/smart-cordon branch August 18, 2025 14:23
@magik6k magik6k mentioned this pull request Aug 22, 2025
2 tasks
rvagg pushed a commit that referenced this pull request Sep 12, 2025
* cordon: yield bg tasks

* make PSClientPoll yield as well

* harmonytask: Scheduling Overrides

* metrics: record version

* fix build

* smart restart

* webui: Jsonrpc reconnect fixes

* webui: restart requests

* make gen

* jrpc reject on conn fail

* missing schema file

* rm snake case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants