Skip to content

Conversation

@jamieNguyenNVIDIA
Copy link
Contributor

From: "Rafael J. Wysocki" [email protected]
To: Linux PM [email protected]
Cc: Zhang Rui [email protected],
Linux ACPI [email protected],
LKML [email protected],
Daniel Lezcano [email protected],
Srinivas Pandruvada [email protected],
Viresh Kumar [email protected],
Quanxian Wang [email protected]
Subject: [PATCH v1 0/4] thermal: core/ACPI: Fix processor cooling device regression
Date: Fri, 03 Mar 2023 20:18:30 +0100 [thread overview]
Message-ID: 2148907.irdbgypaU6@kreacher (raw)

Hi All,

As reported by Rui in this thread:

Link: https://lore.kernel.org/linux-pm/[email protected]/

some recent changes in the thermal core cause the CPU cooling devices
registered by the ACPI processor driver to become unusable in some cases
and somewhat crippled in general.

The problem is that the ACPI processor driver changes its ->get_max_state()
callback return value depending on whether or not cpufreq is available and
there is a cpufreq policy for a given CPU. However, the thermal core has
always assumed that the return value of that callback will not change, which
in fact is relied on by the cooling device statistics code. In particular,
when the ->get_max_state() grows, the memory buffer allocated for storing the
statistics will be too small and corruption may ensue as a result.

For this reason, the issue needs to be addressed in the ACPI processor driver
and not in the thermal core, but the core needs to help somewhat too. Namely,
it needs to provide a helper allowing an interested driver to update the
max_state value for an already registered cooling device in certain situations
which will also cause the statistics to be rebuilt.

This series implements the above and for details please refer to the individual
patch chagelogs.

Please also note that it has only been lightly tested, so more testing and of
course review of it is welcome.

Thanks!

Andrea Righi and others added 30 commits October 9, 2023 16:26
With the new annotations schema we don't need to adjust annotations via
local-mangle anymore. Same about copying configs via copy-files.

Signed-off-by: Andrea Righi <[email protected]>
Include debian.master/config/annotations and run updateconfigs.

Signed-off-by: Andrea Righi <[email protected]>
Ignore: yes
Signed-off-by: Andrea Righi <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2019126
Properties: no-test-build
Signed-off-by: Andrea Righi <[email protected]>
…dversion"

This patch is required by Rust and it can potentially break user-space.
It is safer to revert this in all the kernel backported to old releases.

Signed-off-by: Andrea Righi <[email protected]>
Ignore: yes
Signed-off-by: Andrea Righi <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2021604
Properties: no-test-build
Signed-off-by: Andrea Righi <[email protected]>
We don't want to support or build rust in Jammy so override it in the
local-mangle.

Ignore: yes
Signed-off-by: Luke Nowakowski-Krijger <[email protected]>
Using the default gcc-11 compiler in Jammy changes some gcc features
so update them in the annotations.

Ignore: yes
Signed-off-by: Luke Nowakowski-Krijger <[email protected]>
Ignore: yes
Signed-off-by: Luke Nowakowski-Krijger <[email protected]>
Replace the micellaneous changelog entries with an earlier revert
with proper title and LP bug. Also move "enable rust only in the
master kernel for amd64" commit to generic packaging resync.

Ignore: yes
Signed-off-by: Luke Nowakowski-Krijger <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2024539
Properties: no-test-build
Signed-off-by: Luke Nowakowski-Krijger <[email protected]>
Signed-off-by: Luke Nowakowski-Krijger <[email protected]>
The build dependencies are configured to install rust, so it will be
considered available. This should be as it is set in the parent.

Ignore: yes
Signed-off-by: Stefan Bader <[email protected]>
Fixup build adjusting the expected config setting for
CONFIG_RUST_IS_AVAILABLE. The rust package gets installed
by the build dependencies, so it will be available. We
just not enable things for HWE kernels.

Signed-off-by: Stefan Bader <[email protected]>
Change URL locations in getabis to linux-hwe-6.2, add new entry to the
build#2 PPA, and drop the development URLs.

Ignore: yes
Signed-off-by: Stefan Bader <[email protected]>
Ignore: yes
Signed-off-by: Stefan Bader <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2026752
Properties: no-test-build
Signed-off-by: Stefan Bader <[email protected]>
With updated pahole we have Rust potentially available. Adjust the
annotations file to keep it disabled.

Ignore: yes
Signed-off-by: Stefan Bader <[email protected]>
This feature is now available in Lunar and Jammy so we no longer need an
adjustment for the HWE kernel.

Ignore: yes
Signed-off-by: Stefan Bader <[email protected]>
ianmay81 and others added 28 commits October 10, 2023 14:30
    BugLink: https://bugs.launchpad.net/bugs/2031584

    Signed-off-by: Sourab Gupta <[email protected]>
    Acked-by: Brad Figg <[email protected]>
Acked-by: Ian May <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
Ignore: yes
Signed-off-by: Brad Figg <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2033685

Fix warning message from smatch tool:
  | smatch warnings:
  | drivers/perf/arm_cspmu/arm_cspmu.c:1075 arm_cspmu_find_cpu_container()
  |    warn: variable dereferenced before check 'cpu_dev' (see line 1073)

Reported-by: kernel test robot <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
Link: https://lore.kernel.org/r/[email protected]/
Reviewed-by: Suzuki K Poulose <[email protected]>
Signed-off-by: Besar Wicaksono <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 16e1583)
Acked-by: Jamie Nguyen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Ian May <[email protected]>
Acked-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2033685

Don't register a resource for the second page unless the dual-page
extension flag is actually present to say it's valid.

CC: Lorenzo Pieralisi <[email protected]>
CC: Sudeep Holla <[email protected]>
Reviewed-by: Suzuki K Poulose <[email protected]>
Reviewed-by: Hanjun Guo <[email protected]>
Reviewed-and-tested-by: Ilkka Koskinen <[email protected]>
Signed-off-by: Robin Murphy <[email protected]>
Acked-by: Lorenzo Pieralisi <[email protected]>
Link: https://lore.kernel.org/r/63b34656e1f7b41bcb458fb6d7496e04db757e0d.1685983270.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 87b3b6d)
Acked-by: Jamie Nguyen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Ian May <[email protected]>
Acked-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2033685

Build-wise, the ACPI dependency consists of only a couple of things
which could probably stand being factored out into ACPI helpers anyway.
However for the immediate concern of working towards Devicetree support
here, it's easy enough to make a few tweaks to contain the affected code
locally, such that we can relax the Kconfig dependency.

Reviewed-and-Tested-by: Suzuki K Poulose <[email protected]>
Reviewed-by: Ilkka Koskinen <[email protected]>
Signed-off-by: Robin Murphy <[email protected]>
Link: https://lore.kernel.org/r/9d126711c7498b199b3e6f5cf48ca60ffb9df54c.1685983270.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit f9bd34e)
Acked-by: Jamie Nguyen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Ian May <[email protected]>
Acked-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2033685

The functional paths of the driver need not care about ACPI, so abstract
the property of atomic doubleword access as its own flag (repacking the
structure for a better fit). We also do not need to go poking directly
at the APMT for standard resources which the ACPI layer has already
dealt with, so deal with the optional MMIO page and interrupt in the
normal firmware-agnostic manner. The few remaining portions of probing
that *are* APMT-specific can still easily retrieve the APMT pointer as
needed without us having to carry a duplicate copy around everywhere.

Reviewed-by: Suzuki K Poulose <[email protected]>
Signed-off-by: Robin Murphy <[email protected]>
Reviewed-and-tested-by: Ilkka Koskinen <[email protected]>
Link: https://lore.kernel.org/r/88f97268603e1aa6016d178982a1dc2861f6770d.1685983270.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit d2e3bb5)
Acked-by: Jamie Nguyen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Ian May <[email protected]>
Acked-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2033685

Add missing MODULE_DEVICE_TABLE definition to generate modalias, which
enables module autoloading.

Signed-off-by: Ilkka Koskinen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 7e51d05)
Acked-by: Jamie Nguyen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Ian May <[email protected]>
Acked-by: Jacob Martin <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2037688

Set the following configs:

  CONFIG_SPI_TEGRA210_QUAD=y
  CONFIG_TCG_TIS_SPI=y

On Grace systems, the IMA driver emits the following log:

  ima: No TPM chip found, activating TPM-bypass!

This occurs because the IMA driver initializes before we are able to detect
the TPM. This will always be the case when the drivers required to
communicate with the TPM, spi_tegra210_quad and tpm_tis_spi, are built as
modules.

Having these drivers as built-ins ensures that the TPM is available before
the IMA driver initializes.

Signed-off-by: Jamie Nguyen <[email protected]>
Acked-by: Brad Figg <[email protected]>
Acked-by: Ian May <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
…rsions and add nvidia build depends for nvidia-fs-dkms

BugLink: https://bugs.launchpad.net/bugs/2038099

Signed-off-by: Brad Figg <[email protected]>
Acked-by: Ian May <[email protected]>
Acked-by: Jacob Martin <[email protected]>
Ignore: yes
Signed-off-by: Ian May <[email protected]>
…port

Add support for exposing rprovides data for standalone modules too.
Switch to exposing provides as a shared debian/substvar file and use
that in the templates.

Ignore: yes
Signed-off-by: Ian May <[email protected]>
The cpufreq policy notifier in the ACPI processor driver may as
well be registered before the driver itself, which causes
acpi_processor_cpufreq_init to be true (unless the notifier
registration fails, which is unlikely at that point) when the
ACPI CPU thermal cooling devices are registered, so the
processor_get_max_state() result does not change while
acpi_processor_driver_init() is running.

Change the ordering in acpi_processor_driver_init() accordingly
to prevent the max_state value from remaining 0 permanently for all
ACPI CPU cooling devices due to setting acpi_processor_cpufreq_init
too late.  [Note that processor_get_max_state() may still return
different values at different times after this change, depending on
the cpufreq driver registration time, but that issue needs to be
addressed separately.]

Fixes: a365105("thermal: sysfs: Reuse cdev->max_state")
Reported-by: Wang, Quanxian <[email protected]>
Link: https://lore.kernel.org/linux-pm/[email protected]
Signed-off-by: Rafael J. Wysocki <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Reviewed-by: Zhang Rui <[email protected]>
(cherry picked from commit c0e0421)
Acked-by: Nathan Hartman <[email protected]>
Acked-by: Jamie Nguyen <[email protected]>
Introduce a helper function, thermal_cooling_device_present(), for
checking if the given cooling device is in the list of registered
cooling devices to avoid some code duplication in a subsequent
patch.

No expected functional impact.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Reviewed-by: Zhang Rui <[email protected]>
(cherry picked from commit c43198a)
Acked-by: Nathan Hartman <[email protected]>
Acked-by: Jamie Nguyen <[email protected]>
Introduce a core thermal API function, thermal_cooling_device_update(),
for updating the max_state value for a cooling device and rearranging
its statistics in sysfs after a possible change of its ->get_max_state()
callback return value.

That callback is now invoked only once, during cooling device
registration, to populate the max_state field in the cooling device
object, so if its return value changes, it needs to be invoked again
and the new return value needs to be stored as max_state.  Moreover,
the statistics presented in sysfs need to be rearranged in general,
because there may not be enough room in them to store data for all
of the possible states (in the case when max_state grows).

The new function takes care of that (and some other minor things
related to it), but some extra locking and lockdep annotations are
added in several places too to protect against crashes in the cases
when the statistics are not present or when a stale max_state value
might be used by sysfs attributes.

Note that the actual user of the new function will be added separately.

Link: https://lore.kernel.org/linux-pm/[email protected]/
Signed-off-by: Rafael J. Wysocki <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Reviewed-by: Zhang Rui <[email protected]>
(cherry picked from commit 790930f)
Acked-by: Nathan Hartman <[email protected]>
Acked-by: Jamie Nguyen <[email protected]>
…y changes

When a cpufreq policy appears or goes away, the CPU cooling devices for
the CPUs covered by that policy need to be updated so that the new
processor_get_max_state() value is stored as max_state and the
statistics in sysfs are rearranged for each of them.

Do that accordingly in acpi_thermal_cpufreq_init() and
acpi_thermal_cpufreq_exit().

Fixes: a365105("thermal: sysfs: Reuse cdev->max_state")
Reported-by: Wang, Quanxian <[email protected]>
Link: https://lore.kernel.org/linux-pm/[email protected]
Signed-off-by: Rafael J. Wysocki <[email protected]>
Tested-by: Zhang Rui <[email protected]>
Reviewed-by: Zhang Rui <[email protected]>
(cherry picked from commit 22c52fa)
Acked-by: Nathan Hartman <[email protected]>
Acked-by: Jamie Nguyen <[email protected]>
The lockdep_assert_held() calls added to cooling_device_stats_setup()
and cooling_device_stats_destroy() by commit 790930f ("thermal:
core: Introduce thermal_cooling_device_update()") trigger false-positive
lockdep reports in code paths that are not subject to race conditions
(before cooling device registration and after cooling device removal).

For this reason, remove the lockdep_assert_held() calls from both
cooling_device_stats_setup() and cooling_device_stats_destroy() and
add one to thermal_cooling_device_stats_reinit() that has to be called
under the cdev lock.

Fixes: 790930f ("thermal: core: Introduce thermal_cooling_device_update()")
Link: https://lore.kernel.org/linux-acpi/[email protected]
Reported-by: Imre Deak <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
(cherry picked from commit b57841f)
Acked-by: Jamie Nguyen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.