Skip to content

Conversation

@kjohn-msft
Copy link
Collaborator

A breaking change in the SUSE fork of the Azure Linux Agent caused LinuxPatchExtension execution (and some others) to hang and fail when executed against it. If a machine is running version 2.12.0.4 for either the provisioning agent or the guest agent, it will be affected.

Symptoms & Conditions:
VMExtensionProvisioningErrors on SUSE 15, affecting all patch operations on it too.

  • Doesn’t happen if the agent version is 2.9.1.1 (both provisioning agent, and guest agent)
  • Doesn’t happen if it is 2.13.1.1 (both provisioning agent and guest agent)
  • Happens if PA is 2.12.0.4 (even if GA is 2.12.0.4 or higher, like the latest 2.13.1.1)

This will happen 100% of the time on the latest image SUSE released (April 17th) because it has 2.12.0.4 baked in and it auto-upgrades to 2.13.1.1 immediately (PA = 2.12.0.4, GA=2.13.1.1)

The immediate previous version (Jan 6th) starts with 2.9.1.1 and stays on it unaffected. If upgraded manually directly to 2.13.1.1, everything is good. If upgraded to 2.12.0.4, things break.

The scope of the problem is specifically long-running processes looking like they are successfully starting (PID, etc.) but not actually doing anything. (PA -> GA -> handler -> Core specifically)

Customer Self-Mitigation Steps
In the absence of this fix being available on their machines, customers can self-mitigate by upgrading their agent (fully - both provisioning agent and guest agent) to 2.13.1.1.

What does the mitigation here entail?
The extension handler was designed to capture stdout and stderr streams in the event of an unforeseen failure in launching the core process. This has been changed to default to not capturing those streams in the first attempt (always succeeds on all machines today), and then iff it fails, resort to retrying with the streams to capture data for troubleshooting.

Are other extensions affected?
Yes, others are. Discussions with SUSE are ongoing, so customers are encouraged to mitigate the issue themselves if they hit provisioning errors in other situations.

Copilot AI review requested due to automatic review settings May 9, 2025 18:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a bug where process execution hangs on certain SUSE environments by modifying the process launch logic.

  • Changed the error handling from an ambiguous return to an explicit "return None"
  • Introduced a loop for attempting process launch with diagnostic parameters
  • Added an explicit return at the end of the function to ensure uniform exit handling

@kjohn-msft kjohn-msft requested a review from feng-j678 May 9, 2025 18:00
@codecov
Copy link

codecov bot commented May 9, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.77%. Comparing base (f86301d) to head (0a5ad70).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #320   +/-   ##
=======================================
  Coverage   93.77%   93.77%           
=======================================
  Files         103      103           
  Lines       17923    17925    +2     
=======================================
+ Hits        16808    16810    +2     
  Misses       1115     1115           
Flag Coverage Δ
python27 93.77% <100.00%> (+<0.01%) ⬆️
python312 93.77% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kjohn-msft kjohn-msft self-assigned this May 9, 2025
@kjohn-msft kjohn-msft merged commit 480369e into master May 9, 2025
8 checks passed
@kjohn-msft kjohn-msft deleted the kjohn-suse branch May 9, 2025 21:23
@feng-j678 feng-j678 mentioned this pull request May 9, 2025
feng-j678 added a commit that referenced this pull request May 9, 2025
this release includes:
[x] [Bugfix: Restore patch mode config on Image Default
#301](#301)
[x] [Clean-up: Envlayer dead-code clean-up for Environment Recorder and
Emulator #304](#304)
[x] [Bugfix: Reboot Manager behavior - multiple bugfixes & error rate
reduction #310](#310)
[x] [Bugfix: Fix sudo check logs and wordings
#315](#315)
[x] [Bugfix: Mitigation for environments with Python Unbuffered I/O
#320](#320)
@feng-j678 feng-j678 mentioned this pull request May 12, 2025
kjohn-msft pushed a commit that referenced this pull request May 12, 2025
this release includes:
[x] Bugfix: Restore patch mode config on Image Default
(#301)
[x] Clean-up: Envlayer dead-code clean-up for Environment Recorder and
Emulator ( #304)
[x] Bugfix: Reboot Manager behavior - multiple bugfixes & error rate
reduction (#310)
[x] Bugfix: Fix sudo check logs and wordings
(#315)
[x] Bugfix: Mitigation for environments with Python Unbuffered I/O 
 (#320)
[x] Bugfix: Correct reboot management
parameters(#322)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants