Skip to content

Conversation

@kjohn-msft
Copy link
Collaborator

@kjohn-msft kjohn-msft commented Apr 16, 2025

Fixing issues seen while testing:
Bugfix: Mitigate external Ubuntu Pro Client issues

The main issue was that my test infra for that code failed to complete patch installation with this error message:
"Reboot failed to proceed on the machine in a timely manner."

The machine then failed to accept ssh logins for 20+ minutes. I thought the machine was dead, but it eventually came back suggesting a much longer reboot time than expected. Tracing this further in production, it suggested there were 500+ operations per week hitting the same issue and getting reported as a terminal failure.

This prompted a closer look at all the Reboot Manager code, and PR addresses every issue identified with it.

The changes in this PR:

  1. Differentiating between all of the following:
    (a) Reboot buffer in minutes = minimum time required to consider a reboot. This was being overloaded incorrectly to also broadcast the time delay to starting the reboot, which could cause silent maintenance window exceeds.
    (b) Reboot notify timeout in minutes = introduced new to be deliberate about the duration of the notification window.
    (c) Reboot wait timeout in minutes (min & max) = time duration to wait before considering an attempt to reboot a failure.

  2. The effective reboot wait timeout is dynamic now - it's sits somewhere between the min and max allowed and uses as much as the remaining time in the maintenance window allows to allow for success.

  3. Reboot manager code has been cleaned up to meet current coding standards for the code base.

  4. Reduce likelihood of timeout at the Compute RP by refreshing the status as long as we're still waiting for a reboot and our process is still running.

The configured values may be adjusted in the future to account for what is seen at scale.

@kjohn-msft kjohn-msft added the bug Something isn't working label Apr 16, 2025
@kjohn-msft kjohn-msft requested a review from feng-j678 April 16, 2025 21:00
@kjohn-msft kjohn-msft self-assigned this Apr 16, 2025
Copilot AI review requested due to automatic review settings April 16, 2025 21:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes various reboot manager issues by refining the configuration parameters, error messages, and control flow to better handle edge cases in reboot timing and notifications. Key changes include:

  • Refactoring of the Reboot Manager to use private members and improved logging.
  • Updating tests to reflect changes in method visibility and error messages.
  • Revising constant values and usage in the reboot logic to prevent premature failure.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/core/tests/library/RuntimeCompositor.py Adjusts method redirections to use the private reboot method for testing consistency.
src/core/tests/Test_RebootManager.py Updates test calls to access the private reboot method and validates the new error message.
src/core/src/core_logic/RebootManager.py Refactors reboot settings, logging, and maintains dynamic wait logic; includes a bug.
src/core/src/core_logic/PatchInstaller.py Uses the new method to check if the maintenance window was exceeded.
src/core/src/bootstrap/Constants.py Introduces updated constants for reboot buffer and timeout configurations.
Comments suppressed due to low confidence (1)

src/core/src/core_logic/RebootManager.py:124

  • The variable 'reboot_pending' is not defined in this branch, which could cause a runtime error. Consider removing it or replacing it with a relevant state indicator.
self.composite_logger.log_error('[RM] Bug-check: Unexpected code branch reached. [RebootSetting={0}][RebootPending={1}]'.format(str(self.__reboot_setting), str(reboot_pending)))

@codecov
Copy link

codecov bot commented Apr 16, 2025

Codecov Report

Attention: Patch coverage is 95.83333% with 4 lines in your changes missing coverage. Please review.

Project coverage is 93.15%. Comparing base (1183fcf) to head (0a6b07f).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/core/src/core_logic/RebootManager.py 92.72% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #310      +/-   ##
==========================================
+ Coverage   93.13%   93.15%   +0.01%     
==========================================
  Files         103      103              
  Lines       17567    17608      +41     
==========================================
+ Hits        16361    16402      +41     
  Misses       1206     1206              
Flag Coverage Δ
python27 93.14% <95.78%> (+0.01%) ⬆️
python312 93.15% <95.83%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kjohn-msft kjohn-msft added the OE PR is considered near complete due to OE sign-off. label Apr 17, 2025
feng-j678
feng-j678 previously approved these changes Apr 17, 2025
Copy link
Contributor

@rane-rajasi rane-rajasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments inline

feng-j678
feng-j678 previously approved these changes Apr 23, 2025
@feng-j678 feng-j678 merged commit 97667b9 into master Apr 29, 2025
8 checks passed
@feng-j678 feng-j678 deleted the kjohn-reboottimeout branch April 29, 2025 21:00
@feng-j678 feng-j678 mentioned this pull request May 9, 2025
feng-j678 added a commit that referenced this pull request May 9, 2025
this release includes:
[x] [Bugfix: Restore patch mode config on Image Default
#301](#301)
[x] [Clean-up: Envlayer dead-code clean-up for Environment Recorder and
Emulator #304](#304)
[x] [Bugfix: Reboot Manager behavior - multiple bugfixes & error rate
reduction #310](#310)
[x] [Bugfix: Fix sudo check logs and wordings
#315](#315)
[x] [Bugfix: Mitigation for environments with Python Unbuffered I/O
#320](#320)
@feng-j678 feng-j678 mentioned this pull request May 12, 2025
kjohn-msft pushed a commit that referenced this pull request May 12, 2025
this release includes:
[x] Bugfix: Restore patch mode config on Image Default
(#301)
[x] Clean-up: Envlayer dead-code clean-up for Environment Recorder and
Emulator ( #304)
[x] Bugfix: Reboot Manager behavior - multiple bugfixes & error rate
reduction (#310)
[x] Bugfix: Fix sudo check logs and wordings
(#315)
[x] Bugfix: Mitigation for environments with Python Unbuffered I/O 
 (#320)
[x] Bugfix: Correct reboot management
parameters(#322)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working OE PR is considered near complete due to OE sign-off.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants