Skip to content

Conversation

@GAURAVRAMRAKHYANI
Copy link
Contributor

@GAURAVRAMRAKHYANI GAURAVRAMRAKHYANI commented Dec 15, 2022

Background: Current methods to query core execution time don’t scale well in Kusto. A single-line log can be queried better at scale across machines.

Added logs to get time taken by assessment and install updates jobs. Added relevant details in logs like patch service used (apt, yum or zypper), machine ram, machine cpu, OS details, etc.

Following are the examples of the added log for assessment and install updates job:

Assessment:

{'job': 'Assessment', 'startTime': '2022-12-15 13:40:34.314455', 'endTime': '2022-12-15 13:40:41.887585', 'patchService': 'AptitudePackageManager', 'machineDetails': {'platform_name': 'Ubuntu', 'platform_version': '20.04', 'machine_cpu': 'Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz', 'machine_arch': 'x86_64', 'disk_type': 'Hard drive'}} [TC=105]

Install Updates:

{'job': 'Install Updates', 'startTime': '2022-12-15 12:36:18.153067', 'endTime': '2022-12-15 12:56:34.743747', 'installed_update_count': '13', 'update_run_successful': 'True', 'maintenance_window': '3:55:00', 'maintenance_window_exceeded': 'False', 'patchService': 'AptitudePackageManager', 'machineDetails': {'platform_name': 'Ubuntu', 'platform_version': '20.04', 'machine_cpu': 'Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz', 'machine_arch': 'x86_64', 'disk_type': 'Hard drive'}} [TC=1216]

Manual testings performed:
Performed Check for updates and One-time update and checked that the new logs added are present in the logs file.

… Added relevant details in logs like patch service used (apt, yum or zypper), machine ram, machine cpu, OS details, etc.
@GAURAVRAMRAKHYANI GAURAVRAMRAKHYANI changed the title Add logs to get time taken by assessment job and install updates job.… Add logs to get time taken by assessment job and install updates job Dec 15, 2022
@codecov
Copy link

codecov bot commented Dec 15, 2022

Codecov Report

Merging #161 (6ca8ce0) into master (d192d35) will increase coverage by 0.22%.
The diff coverage is 98.95%.

@@            Coverage Diff             @@
##           master     #161      +/-   ##
==========================================
+ Coverage   88.52%   88.75%   +0.22%     
==========================================
  Files          86       88       +2     
  Lines       12598    12881     +283     
==========================================
+ Hits        11152    11432     +280     
- Misses       1446     1449       +3     
Flag Coverage Δ
python27 87.23% <98.95%> (+0.38%) ⬆️
python39 88.72% <98.95%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/core/src/core_logic/PatchInstaller.py 85.39% <87.50%> (+0.13%) ⬆️
src/core/tests/Test_Stopwatch.py 99.03% <99.03%> (ø)
src/core/src/bootstrap/Constants.py 98.52% <100.00%> (+0.15%) ⬆️
src/core/src/core_logic/MaintenanceWindow.py 87.09% <100.00%> (+5.70%) ⬆️
src/core/src/core_logic/PatchAssessor.py 99.35% <100.00%> (+0.04%) ⬆️
src/core/src/core_logic/Stopwatch.py 100.00% <100.00%> (ø)
src/core/src/service_interfaces/TelemetryWriter.py 74.09% <100.00%> (-1.17%) ⬇️
src/core/tests/Test_MaintenanceWindow.py 98.66% <100.00%> (+0.51%) ⬆️
src/core/tests/Test_PatchAssessor.py 99.20% <100.00%> (+0.16%) ⬆️
src/core/tests/Test_PatchInstaller.py 99.55% <100.00%> (+0.07%) ⬆️
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@rane-rajasi
Copy link
Contributor

A few questions and my thoughts here:

Adding new log statements to determine assessment and install jobs duration, seems a straightforward method. However, we already have the issue of the overall operation being slow because of the agent's restrictions in consuming telemetry events. Even though these are just 2 logs statements, it still contributes to the reduced pace. Most of the information added in these new logs is redundant (machine info is already logged, start and endtime are added in the status file)

Question:

  • Are we only looking at extension telemetry to determine assessment or installation duration? Since this info is already available in status file, can that be used instead? We won't need any code changes in that case

@GAURAVRAMRAKHYANI
Copy link
Contributor Author

GAURAVRAMRAKHYANI commented Jan 3, 2023

A few questions and my thoughts here:

Adding new log statements to determine assessment and install jobs duration, seems a straightforward method. However, we already have the issue of the overall operation being slow because of the agent's restrictions in consuming telemetry events. Even though these are just 2 logs statements, it still contributes to the reduced pace. Most of the information added in these new logs is redundant (machine info is already logged, start and endtime are added in the status file)

Question:

  • Are we only looking at extension telemetry to determine assessment or installation duration? Since this info is already available in status file, can that be used instead? We won't need any code changes in that case

Yes, the information which we are logging here are already there in different places but to get all the related information for a job, currently we need join in kusto queries which is not scalable. We are logging all the required information in one log so that we don't need join in kusto queries. We need efficient kusto queries because we are going to use kusto queries in future to calculate improvements that we are going to do in the script.

@kjohn-msft
Copy link
Collaborator

Please include test code to mock and validate changed functionality - you can do this after nailing down comments in the prod code.

@kjohn-msft
Copy link
Collaborator

A few questions and my thoughts here:

Adding new log statements to determine assessment and install jobs duration, seems a straightforward method. However, we already have the issue of the overall operation being slow because of the agent's restrictions in consuming telemetry events. Even though these are just 2 logs statements, it still contributes to the reduced pace. Most of the information added in these new logs is redundant (machine info is already logged, start and endtime are added in the status file)

Question:

  • Are we only looking at extension telemetry to determine assessment or installation duration? Since this info is already available in status file, can that be used instead? We won't need any code changes in that case

To @rane-rajasi's note here (which I only saw after I wrote other comments), there's no need to write the existing telemetry event for machine_info if what's being written here happens reliably. Rajasi please review if that mitigates some of the valid concern raised.

Copy link
Collaborator

@kjohn-msft kjohn-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are inline in PR.

(a) Creating separate class Stopwatch
(b) Renaming variable names to be consistent
(c) Renaming write_machine_config_info() to set_machine_config_info() as we are not logging machine_info in this function.
(d) Creating constant for strings which are used in logs as they are referred multiple places.
(e) Pre calculating time taken by task and percentage of maintenance window used instead of calculating them in kusto query.
(f) Adding number_of_tries (for assessment) and number_of_rounds (for patch installation)
(g) Removing import TelemetryWriter as it is already available in PatchAssessor and PatchInstaller
Copy link
Contributor

@rane-rajasi rane-rajasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments inline

(a) Using numbered index placeholder
(b) Do not explicitly throw exception if total_time_in_minutes == 0 in get_percentage_maintenance_window_used
…on't want script to stop abruptly just due to some exception in writing performance logs.
GAURAVRAMRAKHYANI added a commit that referenced this pull request Feb 10, 2023
Done code changes for Aptitude package manager and did manual testing.

Code changes for Yum and Zipper is pending should be completed by next week.
Adding unit tests is pending.

Also, due to signature change of the method install_update_and_dependencies, there are some tests failing. I will work on those tests once we confirm the signature of the method.

Using Stopwatch implementation in this PR. The implementation of Stopwatch is in the PR #161 which is going to be merged soon.
@kjohn-msft
Copy link
Collaborator

@GAURAVRAMRAKHYANI - can you please merge the latest from master please? There shouldn't be any conflict but it helps with the PR. @rane-rajasi please consider you're weekly planning and provide an ETA for final review. The ETA from me will be EOD business hours PST. @SathishMSFT, please review and sign off if you're satisfied with the changes.

@GAURAVRAMRAKHYANI
Copy link
Contributor Author

@GAURAVRAMRAKHYANI - can you please merge the latest from master please? There shouldn't be any conflict but it helps with the PR. @rane-rajasi please consider you're weekly planning and provide an ETA for final review. The ETA from me will be EOD business hours PST. @SathishMSFT, please review and sign off if you're satisfied with the changes.

Merged latest master branch

Copy link
Contributor

@rane-rajasi rane-rajasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment on write_assessment_perf_logs()

Copy link
Member

@SathishMSFT SathishMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@GAURAVRAMRAKHYANI GAURAVRAMRAKHYANI merged commit 006d225 into master Feb 15, 2023
@GAURAVRAMRAKHYANI GAURAVRAMRAKHYANI deleted the garamrak-addTimeTakenLogs branch February 15, 2023 05:31
kjohn-msft pushed a commit that referenced this pull request May 18, 2023
* Implement patching in batches to improve performance.
Done code changes for Aptitude package manager and did manual testing.

Code changes for Yum and Zipper is pending should be completed by next week.
Adding unit tests is pending.

Also, due to signature change of the method install_update_and_dependencies, there are some tests failing. I will work on those tests once we confirm the signature of the method.

Using Stopwatch implementation in this PR. The implementation of Stopwatch is in the PR #161 which is going to be merged soon.

* Fixed an issue in is_packages_install_time_available i

* Code changes for yum and zypper package manager for parallel patching.
Did manual testing and checked that packages are getting installed and the installation status of the packages are shown correctly in the portal.

* For yum, add packages with different architecture in list of dependencies.

* Correct function call in Yum for extract_dependencies

* Following changes are done in this iteration. All the existing unit tests are succeeding now with these changes.

(a) updated function name is_packages_install_time_available to is_package_install_time_available

(b) Made reboot_manager mandatory in the function is_packages_install_time_available

(c) Created a function install_update_and_dependencies_and_get_status

* (a) Handled the scenario that if there is not enough remaining time to install packages in  batch then do not stop the complete process of installing patches. The reamaining packages can be installed sequentially and there is already check for maintenance window cutoff time when installing packages sequentially so that the installation is stopped when there is not enough remaining time in maintenance window for installing package.
Tested this scenario.

(b) After batch patching, there is call get_remaining_packages_to_install. The time to execute get_remaining_packages_to_install should not be counted in time taken by nistalling packages in batches. Hence, stopping the stopwatch before calling get_remaining_packages_to_install and writing batch installation details in the telemetry after the call get_remaining_packages_to_install so that we also get the remaining packages to install in the telemetry.
Tested this scenario.

* Added test test_is_package_install_time_available

* Addressed the comments. Some of the changes done are:
(a) using self.execution_config.reboot_setting instead of reboot_manager.reboot_setting
(b) Reordering import of modules
(c) Moving add_arch_dependencies_yum funciton to package_manager and updating its name to add_arch_dependencies

* (a) Added unit tests for code coverage
(b) refactored the method get_dependent_list. Removed that method include_dependencies as it is not required

* include_dependencies method from apt and zypper package managers
Using list() instead of copy() as copy is not supported in python2.7

* updated UA_ESM_REQUIRED to UA_ESM_Required in the test test_skip_package_version_UA_ESM_REQUIRED

* Added two unit tests:
(a) test_dependent_package_excluded
(b) test_arch_dependency_install_success

* Add three new test cases:
test_dependent_package_excluded_and_not_enough_time_for_batch_patching
test_no_updates_to_install
test_get_remaining_packages_to_install

* Added 3 new test cases:
test_include_dependency_apt
test_include_dependency_yum
test_include_dependency_zypper

* Incorporated PR comments. Mostly renaming of variables and edit logs. Using package in new code instead of patch and update except for already present variables and functions.

* Incorporated PR comments. Some of the changes are:
(a) Updated <>_parent_update_count to <>_parent_package_install_count
(b) Removed the log ("\n\nInstalling patches in sequence...")
(c) Added a prefix string in the performance logs for install package, install batch and the process of install packages in batches. This will help in kusto queries.

* Incorporating PR comments

* Call self.package_manager.is_reboot_pending() instead of self.is_reboot_pending() after merge from master because is_reboot_pending() is moved to package_manager

* Incorporate PR comments

* Incorporated PR comments

* The changes contains following:
(a) Incorporated PR comments

(b) Using all_package_versions instead of package_versions while finding version of the dependent package in the function include_dependencies in the file PatchInstaller.py. The list all_package_versions contains versions of all the packages available to install but the list package_versions contains only those packages which are selected by user. There might be dependency on the package which is not selected by user and hence all_package_versions should be used.

(c) Making a list containing not attempted packages and failed packages in batch patching process. In fallback to sequential patching, this list is used as the input packages to install. The method get_remaining_packages_to_install is no longer required. But due to this change, the newly available packages in brief time while batch patching was done are not handled. This was the case before this PR also, the newly added packages while installing the packages were not handled. To make it simple, it is better to not handle that case in this big PR. It is difficult to test this scenario.

* Changes in this commit:
(a) Addressing PR comments
(b) Adding some comments in the code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants