Bugfix: Retry Handler to Prevent Unbounded Retries while trying to Mitigate YUM Update Errors #303

rirakshi-dev · 2025-04-01T21:25:41Z

Issue

Bug 30347555: [AzGPS][Linux][Guest] Limit number of retries on Known Errors - YUM - RHEL

This bugfix task was created due to a customer reported issue.

What's Broken:

Yum Update hits the following error: Failed to download metadata for repo 'EfficiOS-RHEL7-x86-64': repomd.xml parser error: Parse error at line: 33 (xmlParseStartTag: invalid element name
Inside YumPackageManager.py
a. try_mitigate_issues_if_any() invokes self.check_known_issues_and_attempt_fix() to self-mitigate the issue.
b. self.check_known_issues_and_attempt_fix() finds the error signature, "Failed to download metadata for repo" in its list, and invokes "sudo yum update -y --disablerepo='*' --enablerepo='*microsoft*'"
c. This command executes successfully, and try_mitigate_issues_if_any() performs a check to validate if this fixed the failure.
d. However, "Failed to download metadata for repo" error is again thrown.
e. try_mitigate_issues_if_any() invokes itself recursively, to resolve this issue.
Since there is no upper bound on the # of retry attempts that can be made to fix the issue, this goes into an unbounded loop, ultimately resulting in timeout of the extension.
Customer Impact: Auto Assessment gets stuck while showing ‘In Progress’ on the portal.

What's the Fix:

Retry Limit Control: Added a retry_count and max_retries mechanism to prevent infinite loops by limiting the number of attempts to fix errors.
Tracking Seen Errors: Introduced a seen_errors set to track previously encountered errors (out) and preemptively exit the retry loop to avoid retrying fixes for the same error repeatedly.
Duplicate Error Detection: If the current error (out) is already in seen_errors, the function logs the issue as a customer environment error and exits gracefully.

Unit Tests:

Added 4 unit tests.

Result:

How to interpret the above?
1. Output error received to try_mitigate_issues_if_any() .
2. Tries to fix the error in self.check_known_issues_and_attempt_fix() via disablerepo command.
3. Performs post mitigation check, hits an error & recursively invokes try_mitigate_issues_if_any() with this error.
4. Error signature is already present in seen_errors, logs error message (highlighted in yellow) & exits.

FAQ: Why is MAX_RETRY_ATTEMPTS_FOR_ERROR_MITIGATION = 10 (A very high value)?

Answer: TL;DR - To give self.check_known_issues_and_attempt_fix() a chance to fix things on its own, especially if the error stack is large.

For example,
Error 1 -> Try to mitigate, which raises Error 2 -> call self.check_known_issues_and_attempt_fix() , which raises Error 3 -> call self.check_known_issues_and_attempt_fix() , which raises Error 4 -> call self.check_known_issues_and_attempt_fix() , which raises Error 5 -> call self.check_known_issues_and_attempt_fix() , after which there are no other errors.

If we set MAX_RETRY_ATTEMPTS_FOR_ERROR_MITIGATION to a lower value like 3, it will pre-emptively exit the loop, although there might be no fault in the first place.

If we see the same error again and again, the check against seen_errors should exit the loop as expected.

Added a robust retry handler logic to prevent unbounded retries while trying to mitigate errors

Added: Raise exception (optional param) on error mitigation failure

Modified: TC to reflect raising exception instead of logging

…hel/limitretries

…tigate YUM Update Errors (#302) - Bring changes from Forked repo branch to main repo branch

Copilot

Pull Request Overview

This PR fixes a bug in the YUM package manager where unbounded retries during error mitigation caused the extension to timeout. Key changes include:

Adding a retry limit mechanism with retry_count and max_retries to prevent infinite recursion.
Introducing a seen_errors set to avoid duplicate error mitigation attempts.
Adding 4 new unit tests to verify that the retry limit is enforced.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
src/core/tests/library/LegacyEnvLayerExtensions.py	Adds a test scenario for issue mitigation retries
src/core/tests/Test_YumPackageManager.py	Adds 4 units tests
src/core/src/package_managers/YumPackageManager.py	Updates error mitigation logic with retry control and error logging
src/core/src/bootstrap/Constants.py	Introduces the MAX_RETRY_ATTEMPTS_FOR_ERROR_MITIGATION constant

src/core/src/package_managers/YumPackageManager.py

rirakshi-dev · 2025-04-01T21:27:32Z

@microsoft-github-policy-service agree company="Microsoft"

@microsoft-github-policy-service agree company="Microsoft"

codecov · 2025-04-01T21:27:40Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.51%. Comparing base (5faac3e) to head (42ba8b2).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #303      +/-   ##
==========================================
+ Coverage   92.49%   92.51%   +0.02%     
==========================================
  Files         100      100              
  Lines       16304    16351      +47     
==========================================
+ Hits        15080    15127      +47     
  Misses       1224     1224

Flag	Coverage Δ
python27	`92.51% <100.00%> (+0.02%)`	⬆️
python39	`92.51% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Added: New UT to test retry condition

Added: 3 more UTs

feng-j678

the pr description is awesome and very detailed

src/core/src/bootstrap/Constants.py

src/core/src/package_managers/YumPackageManager.py

Fix: Modified test cases to cover sad path

This release includes: [x] Engg. hygiene: Remove TelemetryWriter related log noise [#312](#312) [x] Bugfix: Mitigate external Ubuntu Pro Client issue [#308](#308) [x] Feature: Adding support for Azure Linux 2.0 in Tdnf Package Manager [#311](#311) [x] Eng. sys: Upgrade CICD pipeline from Python 3.9 to Python 3.12 [#309](#309) [x] Coverage: Increase code coverage - TimerManager and ServiceManager [#307](#307) [x] Bugfix: Unit tests broken in Python 3.12 [#306](#306) [x] Feature: Adding Azure Linux 3.0 Base Support [#293](#293) [x] Bugfix: Retry Handler to Prevent Unbounded Retries while trying to Mitigate YUM Update Errors [#303](#303) [x] BugFix: CentOS VMs not installing patches during Auto Patching [#298](#298) [x] Bugfix: Auto-assessment - Restricting execution permissions to root user/ owner [#299](#299)

rirakshi-dev and others added 7 commits January 22, 2025 22:28

Added: Retry logic to try_mitigate_issues_if_any

ecee674

Added a robust retry handler logic to prevent unbounded retries while trying to mitigate errors

Added: Unit tests for handling retries on YUM update failure

06aa020

Moved: Max Retries count to constants

f6e1293

Update YumPackageManager.py

ec58797

Added: Raise exception (optional param) on error mitigation failure

Update Test_YumPackageManager.py

3d2d77e

Modified: TC to reflect raising exception instead of logging

Merge remote-tracking branch 'upstream/master' into rirakshi/bugfix/r…

8871706

…hel/limitretries

Bugfix: Retry Handler to Prevent Unbounded Retries while trying to Mi…

da91fc1

…tigate YUM Update Errors (#302) - Bring changes from Forked repo branch to main repo branch

Copilot AI review requested due to automatic review settings April 1, 2025 21:25

rirakshi-dev requested review from kjohn-msft, najams and rane-rajasi as code owners April 1, 2025 21:25

rirakshi-dev enabled auto-merge (squash) April 1, 2025 21:26

Copilot AI reviewed Apr 1, 2025

View reviewed changes

src/core/src/package_managers/YumPackageManager.py Show resolved Hide resolved

rirakshi-dev added 2 commits April 2, 2025 03:42

Update Test_YumPackageManager.py

c4ef4ab

Added: New UT to test retry condition

Update Test_YumPackageManager.py

de0181f

Added: 3 more UTs

feng-j678 suggested changes Apr 1, 2025

View reviewed changes

src/core/src/bootstrap/Constants.py Show resolved Hide resolved

src/core/src/package_managers/YumPackageManager.py Show resolved Hide resolved

src/core/src/package_managers/YumPackageManager.py Show resolved Hide resolved

Update Test_YumPackageManager.py

42ba8b2

Fix: Modified test cases to cover sad path

kjohn-msft approved these changes Apr 3, 2025

View reviewed changes

feng-j678 approved these changes Apr 3, 2025

View reviewed changes

rirakshi-dev merged commit 1c95e78 into master Apr 3, 2025
7 checks passed

rirakshi-dev deleted the rirakshi/bugfix/rhel/limitretries branch April 3, 2025 16:55

rane-rajasi mentioned this pull request Apr 25, 2025

LPE Release 1.6.59 #316

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bugfix: Retry Handler to Prevent Unbounded Retries while trying to Mitigate YUM Update Errors #303

Bugfix: Retry Handler to Prevent Unbounded Retries while trying to Mitigate YUM Update Errors #303

Uh oh!

rirakshi-dev commented Apr 1, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment •

edited by rirakshi-dev

Loading

Uh oh!

Uh oh!

rirakshi-dev commented Apr 1, 2025

Uh oh!

codecov bot commented Apr 1, 2025 •

edited

Loading

Uh oh!

feng-j678 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Bugfix: Retry Handler to Prevent Unbounded Retries while trying to Mitigate YUM Update Errors #303

Bugfix: Retry Handler to Prevent Unbounded Retries while trying to Mitigate YUM Update Errors #303

Uh oh!

Conversation

rirakshi-dev commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

FAQ: Why is MAX_RETRY_ATTEMPTS_FOR_ERROR_MITIGATION = 10 (A very high value)?

Uh oh!

Copilot AI left a comment • edited by rirakshi-dev Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

rirakshi-dev commented Apr 1, 2025

Uh oh!

codecov bot commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

feng-j678 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rirakshi-dev commented Apr 1, 2025 •

edited

Loading

Copilot AI left a comment •

edited by rirakshi-dev

Loading

codecov bot commented Apr 1, 2025 •

edited

Loading