Skip to content

Conversation

@rirakshi-dev
Copy link
Contributor

@rirakshi-dev rirakshi-dev commented Apr 1, 2025

Issue

Bug 30347555: [AzGPS][Linux][Guest] Limit number of retries on Known Errors - YUM - RHEL

This bugfix task was created due to a customer reported issue.

What's Broken:

  1. Yum Update hits the following error: Failed to download metadata for repo 'EfficiOS-RHEL7-x86-64': repomd.xml parser error: Parse error at line: 33 (xmlParseStartTag: invalid element name
  2. Inside YumPackageManager.py
    a. try_mitigate_issues_if_any() invokes self.check_known_issues_and_attempt_fix() to self-mitigate the issue.
    b. self.check_known_issues_and_attempt_fix() finds the error signature, "Failed to download metadata for repo" in its list, and invokes "sudo yum update -y --disablerepo='*' --enablerepo='*microsoft*'"
    c. This command executes successfully, and try_mitigate_issues_if_any() performs a check to validate if this fixed the failure.
    d. However, "Failed to download metadata for repo" error is again thrown.
    e. try_mitigate_issues_if_any() invokes itself recursively, to resolve this issue.
  3. Since there is no upper bound on the # of retry attempts that can be made to fix the issue, this goes into an unbounded loop, ultimately resulting in timeout of the extension.
  4. Customer Impact: Auto Assessment gets stuck while showing ‘In Progress’ on the portal.

What's the Fix:

  • Retry Limit Control: Added a retry_count and max_retries mechanism to prevent infinite loops by limiting the number of attempts to fix errors.

  • Tracking Seen Errors: Introduced a seen_errors set to track previously encountered errors (out) and preemptively exit the retry loop to avoid retrying fixes for the same error repeatedly.

  • Duplicate Error Detection: If the current error (out) is already in seen_errors, the function logs the issue as a customer environment error and exits gracefully.

Unit Tests:

Added 4 unit tests.

Result:
image

How to interpret the above?
1. Output error received to try_mitigate_issues_if_any() .
2. Tries to fix the error in self.check_known_issues_and_attempt_fix() via disablerepo command.
3. Performs post mitigation check, hits an error & recursively invokes try_mitigate_issues_if_any() with this error.
4. Error signature is already present in seen_errors, logs error message (highlighted in yellow) & exits.

FAQ: Why is MAX_RETRY_ATTEMPTS_FOR_ERROR_MITIGATION = 10 (A very high value)?

Answer: TL;DR - To give self.check_known_issues_and_attempt_fix() a chance to fix things on its own, especially if the error stack is large.

For example,
Error 1 -> Try to mitigate, which raises Error 2 -> call self.check_known_issues_and_attempt_fix() , which raises Error 3 -> call self.check_known_issues_and_attempt_fix() , which raises Error 4 -> call self.check_known_issues_and_attempt_fix() , which raises Error 5 -> call self.check_known_issues_and_attempt_fix() , after which there are no other errors.

If we set MAX_RETRY_ATTEMPTS_FOR_ERROR_MITIGATION to a lower value like 3, it will pre-emptively exit the loop, although there might be no fault in the first place.

If we see the same error again and again, the check against seen_errors should exit the loop as expected.

rirakshi-dev and others added 7 commits January 22, 2025 22:28
Added a robust retry handler logic to prevent unbounded retries while trying to mitigate errors
Added: Raise exception (optional param) on error mitigation failure
Modified: TC to reflect raising exception instead of logging
…tigate YUM Update Errors (#302)

- Bring changes from Forked repo branch to main repo branch
Copilot AI review requested due to automatic review settings April 1, 2025 21:25
@rirakshi-dev rirakshi-dev enabled auto-merge (squash) April 1, 2025 21:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug in the YUM package manager where unbounded retries during error mitigation caused the extension to timeout. Key changes include:

  • Adding a retry limit mechanism with retry_count and max_retries to prevent infinite recursion.
  • Introducing a seen_errors set to avoid duplicate error mitigation attempts.
  • Adding 4 new unit tests to verify that the retry limit is enforced.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/core/tests/library/LegacyEnvLayerExtensions.py Adds a test scenario for issue mitigation retries
src/core/tests/Test_YumPackageManager.py Adds 4 units tests
src/core/src/package_managers/YumPackageManager.py Updates error mitigation logic with retry control and error logging
src/core/src/bootstrap/Constants.py Introduces the MAX_RETRY_ATTEMPTS_FOR_ERROR_MITIGATION constant

@rirakshi-dev
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Microsoft"

@microsoft-github-policy-service agree company="Microsoft"

@codecov
Copy link

codecov bot commented Apr 1, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.51%. Comparing base (5faac3e) to head (42ba8b2).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #303      +/-   ##
==========================================
+ Coverage   92.49%   92.51%   +0.02%     
==========================================
  Files         100      100              
  Lines       16304    16351      +47     
==========================================
+ Hits        15080    15127      +47     
  Misses       1224     1224              
Flag Coverage Δ
python27 92.51% <100.00%> (+0.02%) ⬆️
python39 92.51% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Added: New UT to test retry condition
Copy link
Contributor

@feng-j678 feng-j678 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pr description is awesome and very detailed

Fix: Modified test cases to cover sad path
@rirakshi-dev rirakshi-dev merged commit 1c95e78 into master Apr 3, 2025
7 checks passed
@rirakshi-dev rirakshi-dev deleted the rirakshi/bugfix/rhel/limitretries branch April 3, 2025 16:55
@rane-rajasi rane-rajasi mentioned this pull request Apr 25, 2025
rane-rajasi added a commit that referenced this pull request Apr 25, 2025
This release includes:
[x] Engg. hygiene: Remove TelemetryWriter related log noise
[#312](#312)
[x] Bugfix: Mitigate external Ubuntu Pro Client issue
[#308](#308)
[x] Feature: Adding support for Azure Linux 2.0 in Tdnf Package Manager
[#311](#311)
[x] Eng. sys: Upgrade CICD pipeline from Python 3.9 to Python 3.12
[#309](#309)
[x] Coverage: Increase code coverage - TimerManager and ServiceManager
[#307](#307)
[x] Bugfix: Unit tests broken in Python 3.12
[#306](#306)
[x] Feature: Adding Azure Linux 3.0 Base Support
[#293](#293)
[x] Bugfix: Retry Handler to Prevent Unbounded Retries while trying to
Mitigate YUM Update Errors
[#303](#303)
[x] BugFix: CentOS VMs not installing patches during Auto Patching
[#298](#298)
[x] Bugfix: Auto-assessment - Restricting execution permissions to root
user/ owner
[#299](#299)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants