-
Notifications
You must be signed in to change notification settings - Fork 11
Bugfix: Reboot Manager behavior - multiple bugfixes & error rate reduction #310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes various reboot manager issues by refining the configuration parameters, error messages, and control flow to better handle edge cases in reboot timing and notifications. Key changes include:
- Refactoring of the Reboot Manager to use private members and improved logging.
- Updating tests to reflect changes in method visibility and error messages.
- Revising constant values and usage in the reboot logic to prevent premature failure.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/core/tests/library/RuntimeCompositor.py | Adjusts method redirections to use the private reboot method for testing consistency. |
| src/core/tests/Test_RebootManager.py | Updates test calls to access the private reboot method and validates the new error message. |
| src/core/src/core_logic/RebootManager.py | Refactors reboot settings, logging, and maintains dynamic wait logic; includes a bug. |
| src/core/src/core_logic/PatchInstaller.py | Uses the new method to check if the maintenance window was exceeded. |
| src/core/src/bootstrap/Constants.py | Introduces updated constants for reboot buffer and timeout configurations. |
Comments suppressed due to low confidence (1)
src/core/src/core_logic/RebootManager.py:124
- The variable 'reboot_pending' is not defined in this branch, which could cause a runtime error. Consider removing it or replacing it with a relevant state indicator.
self.composite_logger.log_error('[RM] Bug-check: Unexpected code branch reached. [RebootSetting={0}][RebootPending={1}]'.format(str(self.__reboot_setting), str(reboot_pending)))
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #310 +/- ##
==========================================
+ Coverage 93.13% 93.15% +0.01%
==========================================
Files 103 103
Lines 17567 17608 +41
==========================================
+ Hits 16361 16402 +41
Misses 1206 1206
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
rane-rajasi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments inline
this release includes: [x] [Bugfix: Restore patch mode config on Image Default #301](#301) [x] [Clean-up: Envlayer dead-code clean-up for Environment Recorder and Emulator #304](#304) [x] [Bugfix: Reboot Manager behavior - multiple bugfixes & error rate reduction #310](#310) [x] [Bugfix: Fix sudo check logs and wordings #315](#315) [x] [Bugfix: Mitigation for environments with Python Unbuffered I/O #320](#320)
this release includes: [x] Bugfix: Restore patch mode config on Image Default (#301) [x] Clean-up: Envlayer dead-code clean-up for Environment Recorder and Emulator ( #304) [x] Bugfix: Reboot Manager behavior - multiple bugfixes & error rate reduction (#310) [x] Bugfix: Fix sudo check logs and wordings (#315) [x] Bugfix: Mitigation for environments with Python Unbuffered I/O (#320) [x] Bugfix: Correct reboot management parameters(#322)
Fixing issues seen while testing:
Bugfix: Mitigate external Ubuntu Pro Client issues
The main issue was that my test infra for that code failed to complete patch installation with this error message:
"Reboot failed to proceed on the machine in a timely manner."
The machine then failed to accept ssh logins for 20+ minutes. I thought the machine was dead, but it eventually came back suggesting a much longer reboot time than expected. Tracing this further in production, it suggested there were 500+ operations per week hitting the same issue and getting reported as a terminal failure.
This prompted a closer look at all the Reboot Manager code, and PR addresses every issue identified with it.
The changes in this PR:
Differentiating between all of the following:
(a) Reboot buffer in minutes = minimum time required to consider a reboot. This was being overloaded incorrectly to also broadcast the time delay to starting the reboot, which could cause silent maintenance window exceeds.
(b) Reboot notify timeout in minutes = introduced new to be deliberate about the duration of the notification window.
(c) Reboot wait timeout in minutes (min & max) = time duration to wait before considering an attempt to reboot a failure.
The effective reboot wait timeout is dynamic now - it's sits somewhere between the min and max allowed and uses as much as the remaining time in the maintenance window allows to allow for success.
Reboot manager code has been cleaned up to meet current coding standards for the code base.
Reduce likelihood of timeout at the Compute RP by refreshing the status as long as we're still waiting for a reboot and our process is still running.
The configured values may be adjusted in the future to account for what is seen at scale.