-
Notifications
You must be signed in to change notification settings - Fork 11
Bugfix: Mitigation for environments with Python Unbuffered I/O #320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses a bug where process execution hangs on certain SUSE environments by modifying the process launch logic.
- Changed the error handling from an ambiguous return to an explicit "return None"
- Introduced a loop for attempting process launch with diagnostic parameters
- Added an explicit return at the end of the function to ensure uniform exit handling
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #320 +/- ##
=======================================
Coverage 93.77% 93.77%
=======================================
Files 103 103
Lines 17923 17925 +2
=======================================
+ Hits 16808 16810 +2
Misses 1115 1115
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
this release includes: [x] [Bugfix: Restore patch mode config on Image Default #301](#301) [x] [Clean-up: Envlayer dead-code clean-up for Environment Recorder and Emulator #304](#304) [x] [Bugfix: Reboot Manager behavior - multiple bugfixes & error rate reduction #310](#310) [x] [Bugfix: Fix sudo check logs and wordings #315](#315) [x] [Bugfix: Mitigation for environments with Python Unbuffered I/O #320](#320)
this release includes: [x] Bugfix: Restore patch mode config on Image Default (#301) [x] Clean-up: Envlayer dead-code clean-up for Environment Recorder and Emulator ( #304) [x] Bugfix: Reboot Manager behavior - multiple bugfixes & error rate reduction (#310) [x] Bugfix: Fix sudo check logs and wordings (#315) [x] Bugfix: Mitigation for environments with Python Unbuffered I/O (#320) [x] Bugfix: Correct reboot management parameters(#322)
A breaking change in the SUSE fork of the Azure Linux Agent caused LinuxPatchExtension execution (and some others) to hang and fail when executed against it. If a machine is running version 2.12.0.4 for either the provisioning agent or the guest agent, it will be affected.
Symptoms & Conditions:
VMExtensionProvisioningErrors on SUSE 15, affecting all patch operations on it too.
This will happen 100% of the time on the latest image SUSE released (April 17th) because it has 2.12.0.4 baked in and it auto-upgrades to 2.13.1.1 immediately (PA = 2.12.0.4, GA=2.13.1.1)
The immediate previous version (Jan 6th) starts with 2.9.1.1 and stays on it unaffected. If upgraded manually directly to 2.13.1.1, everything is good. If upgraded to 2.12.0.4, things break.
The scope of the problem is specifically long-running processes looking like they are successfully starting (PID, etc.) but not actually doing anything. (PA -> GA -> handler -> Core specifically)
Customer Self-Mitigation Steps
In the absence of this fix being available on their machines, customers can self-mitigate by upgrading their agent (fully - both provisioning agent and guest agent) to 2.13.1.1.
What does the mitigation here entail?
The extension handler was designed to capture stdout and stderr streams in the event of an unforeseen failure in launching the core process. This has been changed to default to not capturing those streams in the first attempt (always succeeds on all machines today), and then iff it fails, resort to retrying with the streams to capture data for troubleshooting.
Are other extensions affected?
Yes, others are. Discussions with SUSE are ongoing, so customers are encouraged to mitigate the issue themselves if they hit provisioning errors in other situations.