-
Notifications
You must be signed in to change notification settings - Fork 3
thermal: core/ACPI: Fix processor cooling device regression #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jamieNguyenNVIDIA
wants to merge
136
commits into
NVIDIA-BaseOS-6:main
Choose a base branch
from
jamieNguyenNVIDIA:thermal-core-patches
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
thermal: core/ACPI: Fix processor cooling device regression #12
jamieNguyenNVIDIA
wants to merge
136
commits into
NVIDIA-BaseOS-6:main
from
jamieNguyenNVIDIA:thermal-core-patches
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Andrea Righi <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Andrea Righi <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Andrea Righi <[email protected]>
With the new annotations schema we don't need to adjust annotations via local-mangle anymore. Same about copying configs via copy-files. Signed-off-by: Andrea Righi <[email protected]>
Include debian.master/config/annotations and run updateconfigs. Signed-off-by: Andrea Righi <[email protected]>
Ignore: yes Signed-off-by: Andrea Righi <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2019126 Properties: no-test-build Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Andrea Righi <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Andrea Righi <[email protected]>
…dversion" This patch is required by Rust and it can potentially break user-space. It is safer to revert this in all the kernel backported to old releases. Signed-off-by: Andrea Righi <[email protected]>
Ignore: yes Signed-off-by: Andrea Righi <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2021604 Properties: no-test-build Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Ignore: yes Signed-off-by: Luke Nowakowski-Krijger <[email protected]>
We don't want to support or build rust in Jammy so override it in the local-mangle. Ignore: yes Signed-off-by: Luke Nowakowski-Krijger <[email protected]>
Using the default gcc-11 compiler in Jammy changes some gcc features so update them in the annotations. Ignore: yes Signed-off-by: Luke Nowakowski-Krijger <[email protected]>
Ignore: yes Signed-off-by: Luke Nowakowski-Krijger <[email protected]>
Replace the micellaneous changelog entries with an earlier revert with proper title and LP bug. Also move "enable rust only in the master kernel for amd64" commit to generic packaging resync. Ignore: yes Signed-off-by: Luke Nowakowski-Krijger <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2024539 Properties: no-test-build Signed-off-by: Luke Nowakowski-Krijger <[email protected]>
Signed-off-by: Luke Nowakowski-Krijger <[email protected]>
The build dependencies are configured to install rust, so it will be considered available. This should be as it is set in the parent. Ignore: yes Signed-off-by: Stefan Bader <[email protected]>
Fixup build adjusting the expected config setting for CONFIG_RUST_IS_AVAILABLE. The rust package gets installed by the build dependencies, so it will be available. We just not enable things for HWE kernels. Signed-off-by: Stefan Bader <[email protected]>
Change URL locations in getabis to linux-hwe-6.2, add new entry to the build#2 PPA, and drop the development URLs. Ignore: yes Signed-off-by: Stefan Bader <[email protected]>
Ignore: yes Signed-off-by: Stefan Bader <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2026752 Properties: no-test-build Signed-off-by: Stefan Bader <[email protected]>
Signed-off-by: Stefan Bader <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Stefan Bader <[email protected]>
With updated pahole we have Rust potentially available. Adjust the annotations file to keep it disabled. Ignore: yes Signed-off-by: Stefan Bader <[email protected]>
This feature is now available in Lunar and Jammy so we no longer need an adjustment for the HWE kernel. Ignore: yes Signed-off-by: Stefan Bader <[email protected]>
Signed-off-by: Ian May <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2031584 Signed-off-by: Sourab Gupta <[email protected]> Acked-by: Brad Figg <[email protected]> Acked-by: Ian May <[email protected]> Acked-by: Jacob Martin <[email protected]> Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Brad Figg <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
Ignore: yes Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2033312 Properties: no-test-build Signed-off-by: Brad Figg <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2033685 Fix warning message from smatch tool: | smatch warnings: | drivers/perf/arm_cspmu/arm_cspmu.c:1075 arm_cspmu_find_cpu_container() | warn: variable dereferenced before check 'cpu_dev' (see line 1073) Reported-by: kernel test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Link: https://lore.kernel.org/r/[email protected]/ Reviewed-by: Suzuki K Poulose <[email protected]> Signed-off-by: Besar Wicaksono <[email protected]> Acked-by: Randy Dunlap <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 16e1583) Acked-by: Jamie Nguyen <[email protected]> Acked-by: Brad Figg <[email protected]> Acked-by: Ian May <[email protected]> Acked-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2033685 Don't register a resource for the second page unless the dual-page extension flag is actually present to say it's valid. CC: Lorenzo Pieralisi <[email protected]> CC: Sudeep Holla <[email protected]> Reviewed-by: Suzuki K Poulose <[email protected]> Reviewed-by: Hanjun Guo <[email protected]> Reviewed-and-tested-by: Ilkka Koskinen <[email protected]> Signed-off-by: Robin Murphy <[email protected]> Acked-by: Lorenzo Pieralisi <[email protected]> Link: https://lore.kernel.org/r/63b34656e1f7b41bcb458fb6d7496e04db757e0d.1685983270.git.robin.murphy@arm.com Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 87b3b6d) Acked-by: Jamie Nguyen <[email protected]> Acked-by: Brad Figg <[email protected]> Acked-by: Ian May <[email protected]> Acked-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2033685 Build-wise, the ACPI dependency consists of only a couple of things which could probably stand being factored out into ACPI helpers anyway. However for the immediate concern of working towards Devicetree support here, it's easy enough to make a few tweaks to contain the affected code locally, such that we can relax the Kconfig dependency. Reviewed-and-Tested-by: Suzuki K Poulose <[email protected]> Reviewed-by: Ilkka Koskinen <[email protected]> Signed-off-by: Robin Murphy <[email protected]> Link: https://lore.kernel.org/r/9d126711c7498b199b3e6f5cf48ca60ffb9df54c.1685983270.git.robin.murphy@arm.com Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit f9bd34e) Acked-by: Jamie Nguyen <[email protected]> Acked-by: Brad Figg <[email protected]> Acked-by: Ian May <[email protected]> Acked-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2033685 The functional paths of the driver need not care about ACPI, so abstract the property of atomic doubleword access as its own flag (repacking the structure for a better fit). We also do not need to go poking directly at the APMT for standard resources which the ACPI layer has already dealt with, so deal with the optional MMIO page and interrupt in the normal firmware-agnostic manner. The few remaining portions of probing that *are* APMT-specific can still easily retrieve the APMT pointer as needed without us having to carry a duplicate copy around everywhere. Reviewed-by: Suzuki K Poulose <[email protected]> Signed-off-by: Robin Murphy <[email protected]> Reviewed-and-tested-by: Ilkka Koskinen <[email protected]> Link: https://lore.kernel.org/r/88f97268603e1aa6016d178982a1dc2861f6770d.1685983270.git.robin.murphy@arm.com Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit d2e3bb5) Acked-by: Jamie Nguyen <[email protected]> Acked-by: Brad Figg <[email protected]> Acked-by: Ian May <[email protected]> Acked-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2033685 Add missing MODULE_DEVICE_TABLE definition to generate modalias, which enables module autoloading. Signed-off-by: Ilkka Koskinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 7e51d05) Acked-by: Jamie Nguyen <[email protected]> Acked-by: Brad Figg <[email protected]> Acked-by: Ian May <[email protected]> Acked-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2037688 Set the following configs: CONFIG_SPI_TEGRA210_QUAD=y CONFIG_TCG_TIS_SPI=y On Grace systems, the IMA driver emits the following log: ima: No TPM chip found, activating TPM-bypass! This occurs because the IMA driver initializes before we are able to detect the TPM. This will always be the case when the drivers required to communicate with the TPM, spi_tegra210_quad and tpm_tis_spi, are built as modules. Having these drivers as built-ins ensures that the TPM is available before the IMA driver initializes. Signed-off-by: Jamie Nguyen <[email protected]> Acked-by: Brad Figg <[email protected]> Acked-by: Ian May <[email protected]> Acked-by: Jacob Martin <[email protected]> Signed-off-by: Brad Figg <[email protected]>
…rsions and add nvidia build depends for nvidia-fs-dkms BugLink: https://bugs.launchpad.net/bugs/2038099 Signed-off-by: Brad Figg <[email protected]> Acked-by: Ian May <[email protected]> Acked-by: Jacob Martin <[email protected]>
Ignore: yes Signed-off-by: Ian May <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Ian May <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Ian May <[email protected]>
Ignore: yes Signed-off-by: Ian May <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2038074 Properties: no-test-build Signed-off-by: Ian May <[email protected]>
Ignore: yes Signed-off-by: Ian May <[email protected]>
…port Add support for exposing rprovides data for standalone modules too. Switch to exposing provides as a shared debian/substvar file and use that in the templates. Ignore: yes Signed-off-by: Ian May <[email protected]>
Signed-off-by: Ian May <[email protected]>
The cpufreq policy notifier in the ACPI processor driver may as well be registered before the driver itself, which causes acpi_processor_cpufreq_init to be true (unless the notifier registration fails, which is unlikely at that point) when the ACPI CPU thermal cooling devices are registered, so the processor_get_max_state() result does not change while acpi_processor_driver_init() is running. Change the ordering in acpi_processor_driver_init() accordingly to prevent the max_state value from remaining 0 permanently for all ACPI CPU cooling devices due to setting acpi_processor_cpufreq_init too late. [Note that processor_get_max_state() may still return different values at different times after this change, depending on the cpufreq driver registration time, but that issue needs to be addressed separately.] Fixes: a365105("thermal: sysfs: Reuse cdev->max_state") Reported-by: Wang, Quanxian <[email protected]> Link: https://lore.kernel.org/linux-pm/[email protected] Signed-off-by: Rafael J. Wysocki <[email protected]> Tested-by: Zhang Rui <[email protected]> Reviewed-by: Zhang Rui <[email protected]> (cherry picked from commit c0e0421) Acked-by: Nathan Hartman <[email protected]> Acked-by: Jamie Nguyen <[email protected]>
Introduce a helper function, thermal_cooling_device_present(), for checking if the given cooling device is in the list of registered cooling devices to avoid some code duplication in a subsequent patch. No expected functional impact. Signed-off-by: Rafael J. Wysocki <[email protected]> Tested-by: Zhang Rui <[email protected]> Reviewed-by: Zhang Rui <[email protected]> (cherry picked from commit c43198a) Acked-by: Nathan Hartman <[email protected]> Acked-by: Jamie Nguyen <[email protected]>
Introduce a core thermal API function, thermal_cooling_device_update(), for updating the max_state value for a cooling device and rearranging its statistics in sysfs after a possible change of its ->get_max_state() callback return value. That callback is now invoked only once, during cooling device registration, to populate the max_state field in the cooling device object, so if its return value changes, it needs to be invoked again and the new return value needs to be stored as max_state. Moreover, the statistics presented in sysfs need to be rearranged in general, because there may not be enough room in them to store data for all of the possible states (in the case when max_state grows). The new function takes care of that (and some other minor things related to it), but some extra locking and lockdep annotations are added in several places too to protect against crashes in the cases when the statistics are not present or when a stale max_state value might be used by sysfs attributes. Note that the actual user of the new function will be added separately. Link: https://lore.kernel.org/linux-pm/[email protected]/ Signed-off-by: Rafael J. Wysocki <[email protected]> Tested-by: Zhang Rui <[email protected]> Reviewed-by: Zhang Rui <[email protected]> (cherry picked from commit 790930f) Acked-by: Nathan Hartman <[email protected]> Acked-by: Jamie Nguyen <[email protected]>
…y changes When a cpufreq policy appears or goes away, the CPU cooling devices for the CPUs covered by that policy need to be updated so that the new processor_get_max_state() value is stored as max_state and the statistics in sysfs are rearranged for each of them. Do that accordingly in acpi_thermal_cpufreq_init() and acpi_thermal_cpufreq_exit(). Fixes: a365105("thermal: sysfs: Reuse cdev->max_state") Reported-by: Wang, Quanxian <[email protected]> Link: https://lore.kernel.org/linux-pm/[email protected] Signed-off-by: Rafael J. Wysocki <[email protected]> Tested-by: Zhang Rui <[email protected]> Reviewed-by: Zhang Rui <[email protected]> (cherry picked from commit 22c52fa) Acked-by: Nathan Hartman <[email protected]> Acked-by: Jamie Nguyen <[email protected]>
The lockdep_assert_held() calls added to cooling_device_stats_setup() and cooling_device_stats_destroy() by commit 790930f ("thermal: core: Introduce thermal_cooling_device_update()") trigger false-positive lockdep reports in code paths that are not subject to race conditions (before cooling device registration and after cooling device removal). For this reason, remove the lockdep_assert_held() calls from both cooling_device_stats_setup() and cooling_device_stats_destroy() and add one to thermal_cooling_device_stats_reinit() that has to be called under the cdev lock. Fixes: 790930f ("thermal: core: Introduce thermal_cooling_device_update()") Link: https://lore.kernel.org/linux-acpi/[email protected] Reported-by: Imre Deak <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]> (cherry picked from commit b57841f) Acked-by: Jamie Nguyen <[email protected]>
062608c to
33ce203
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
From: "Rafael J. Wysocki" [email protected]
To: Linux PM [email protected]
Cc: Zhang Rui [email protected],
Linux ACPI [email protected],
LKML [email protected],
Daniel Lezcano [email protected],
Srinivas Pandruvada [email protected],
Viresh Kumar [email protected],
Quanxian Wang [email protected]
Subject: [PATCH v1 0/4] thermal: core/ACPI: Fix processor cooling device regression
Date: Fri, 03 Mar 2023 20:18:30 +0100 [thread overview]
Message-ID: 2148907.irdbgypaU6@kreacher (raw)
Hi All,
As reported by Rui in this thread:
Link: https://lore.kernel.org/linux-pm/[email protected]/
some recent changes in the thermal core cause the CPU cooling devices
registered by the ACPI processor driver to become unusable in some cases
and somewhat crippled in general.
The problem is that the ACPI processor driver changes its ->get_max_state()
callback return value depending on whether or not cpufreq is available and
there is a cpufreq policy for a given CPU. However, the thermal core has
always assumed that the return value of that callback will not change, which
in fact is relied on by the cooling device statistics code. In particular,
when the ->get_max_state() grows, the memory buffer allocated for storing the
statistics will be too small and corruption may ensue as a result.
For this reason, the issue needs to be addressed in the ACPI processor driver
and not in the thermal core, but the core needs to help somewhat too. Namely,
it needs to provide a helper allowing an interested driver to update the
max_state value for an already registered cooling device in certain situations
which will also cause the statistics to be rebuilt.
This series implements the above and for details please refer to the individual
patch chagelogs.
Please also note that it has only been lightly tested, so more testing and of
course review of it is welcome.
Thanks!