-
Notifications
You must be signed in to change notification settings - Fork 11
Bugfix: Retry Handler to Prevent Unbounded Retries while trying to Mitigate YUM Update Errors #303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added a robust retry handler logic to prevent unbounded retries while trying to mitigate errors
Added: Raise exception (optional param) on error mitigation failure
Modified: TC to reflect raising exception instead of logging
…tigate YUM Update Errors (#302) - Bring changes from Forked repo branch to main repo branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a bug in the YUM package manager where unbounded retries during error mitigation caused the extension to timeout. Key changes include:
- Adding a retry limit mechanism with retry_count and max_retries to prevent infinite recursion.
- Introducing a seen_errors set to avoid duplicate error mitigation attempts.
- Adding 4 new unit tests to verify that the retry limit is enforced.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/core/tests/library/LegacyEnvLayerExtensions.py | Adds a test scenario for issue mitigation retries |
| src/core/tests/Test_YumPackageManager.py | Adds 4 units tests |
| src/core/src/package_managers/YumPackageManager.py | Updates error mitigation logic with retry control and error logging |
| src/core/src/bootstrap/Constants.py | Introduces the MAX_RETRY_ATTEMPTS_FOR_ERROR_MITIGATION constant |
@microsoft-github-policy-service agree company="Microsoft" |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #303 +/- ##
==========================================
+ Coverage 92.49% 92.51% +0.02%
==========================================
Files 100 100
Lines 16304 16351 +47
==========================================
+ Hits 15080 15127 +47
Misses 1224 1224
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Added: New UT to test retry condition
Added: 3 more UTs
feng-j678
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pr description is awesome and very detailed
Fix: Modified test cases to cover sad path
This release includes: [x] Engg. hygiene: Remove TelemetryWriter related log noise [#312](#312) [x] Bugfix: Mitigate external Ubuntu Pro Client issue [#308](#308) [x] Feature: Adding support for Azure Linux 2.0 in Tdnf Package Manager [#311](#311) [x] Eng. sys: Upgrade CICD pipeline from Python 3.9 to Python 3.12 [#309](#309) [x] Coverage: Increase code coverage - TimerManager and ServiceManager [#307](#307) [x] Bugfix: Unit tests broken in Python 3.12 [#306](#306) [x] Feature: Adding Azure Linux 3.0 Base Support [#293](#293) [x] Bugfix: Retry Handler to Prevent Unbounded Retries while trying to Mitigate YUM Update Errors [#303](#303) [x] BugFix: CentOS VMs not installing patches during Auto Patching [#298](#298) [x] Bugfix: Auto-assessment - Restricting execution permissions to root user/ owner [#299](#299)
Issue
Bug 30347555: [AzGPS][Linux][Guest] Limit number of retries on Known Errors - YUM - RHEL
This bugfix task was created due to a customer reported issue.
What's Broken:
YumPackageManager.pya.
try_mitigate_issues_if_any()invokesself.check_known_issues_and_attempt_fix()to self-mitigate the issue.b.
self.check_known_issues_and_attempt_fix()finds the error signature,"Failed to download metadata for repo"in its list, and invokes"sudo yum update -y --disablerepo='*' --enablerepo='*microsoft*'"c. This command executes successfully, and
try_mitigate_issues_if_any()performs a check to validate if this fixed the failure.d. However,
"Failed to download metadata for repo"error is again thrown.e.
try_mitigate_issues_if_any()invokes itself recursively, to resolve this issue.What's the Fix:
Retry Limit Control: Added a
retry_countandmax_retriesmechanism to prevent infinite loops by limiting the number of attempts to fix errors.Tracking Seen Errors: Introduced a
seen_errorsset to track previously encountered errors (out) and preemptively exit the retry loop to avoid retrying fixes for the same error repeatedly.Duplicate Error Detection: If the current error (
out) is already inseen_errors, the function logs the issue as a customer environment error and exits gracefully.Unit Tests:
Added 4 unit tests.
Result:

How to interpret the above?
1. Output error received to
try_mitigate_issues_if_any().2. Tries to fix the error in
self.check_known_issues_and_attempt_fix()via disablerepo command.3. Performs post mitigation check, hits an error & recursively invokes
try_mitigate_issues_if_any()with this error.4. Error signature is already present in seen_errors, logs error message (highlighted in yellow) & exits.
FAQ: Why is MAX_RETRY_ATTEMPTS_FOR_ERROR_MITIGATION = 10 (A very high value)?
Answer: TL;DR - To give self.check_known_issues_and_attempt_fix() a chance to fix things on its own, especially if the error stack is large.
For example,
Error 1 -> Try to mitigate, which raises Error 2 -> call self.check_known_issues_and_attempt_fix() , which raises Error 3 -> call self.check_known_issues_and_attempt_fix() , which raises Error 4 -> call self.check_known_issues_and_attempt_fix() , which raises Error 5 -> call self.check_known_issues_and_attempt_fix() , after which there are no other errors.
If we set MAX_RETRY_ATTEMPTS_FOR_ERROR_MITIGATION to a lower value like 3, it will pre-emptively exit the loop, although there might be no fault in the first place.
If we see the same error again and again, the check against seen_errors should exit the loop as expected.