forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 10
Tracking: qgroup fixes from Qu #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…linux/kernel/git/sunxi/linux into fixes Allwinner fixes for 4.11, bis Two fixes for the recent A33 cpufreq support, and one to fix a missing register in the A64 USB PHY node. * tag 'sunxi-fixes-for-4.11-2' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux: arm64: allwinner: a64: add pmu0 regs for USB PHY ARM: sun8i: a33: add operating-points-v2 property to all nodes ARM: sun8i: a33: remove highest OPP to fix CPU crashes Signed-off-by: Olof Johansson <[email protected]>
… into fixes mvebu fixes for 4.11 (part 1) Fix build of the board code for orion5x when some parts are configured as module. * tag 'mvebu-fixes-4.11-1' of git://git.infradead.org/linux-mvebu: ARM: orion5x: only call into phylib when available Signed-off-by: Olof Johansson <[email protected]>
Doug Smythies reports oops with KSM in this backtrace, I've been seeing the same: page_vma_mapped_walk+0xe6/0x5b0 page_referenced_one+0x91/0x1a0 rmap_walk_ksm+0x100/0x190 rmap_walk+0x4f/0x60 page_referenced+0x149/0x170 shrink_active_list+0x1c2/0x430 shrink_node_memcg+0x67a/0x7a0 shrink_node+0xe1/0x320 kswapd+0x34b/0x720 Just as observed in commit 4b0ece6 ("mm: migrate: fix remove_migration_pte() for ksm pages"), you cannot use page->index calculations on ksm pages. page_vma_mapped_walk() is relying on __vma_address(), where a ksm page can lead it off the end of the page table, and into whatever nonsense is in the next page, ending as an oops inside check_pte()'s pte_page(). KSM tells page_vma_mapped_walk() exactly where to look for the page, it does not need any page->index calculation: and that's so also for all the normal and file and anon pages - just not for THPs and their subpages. Get out early in most cases: instead of a PageKsm test, move down the earlier not-THP-page test, as suggested by Kirill. I'm also slightly worried that this loop can stray into other vmas, so added a vm_end test to prevent surprises; though I have not imagined anything worse than a very contrived case, in which a page mlocked in the next vma might be reclaimed because it is not mlocked in this vma. Fixes: ace71a1 ("mm: introduce page_vma_mapped_walk()") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Reported-by: Doug Smythies <[email protected]> Tested-by: Doug Smythies <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
fdinfo for userfault file descriptor reports UFFD_API_FEATURES. Up until recently, the UFFD_API_FEATURES was defined as 0, therefore corresponding field in fdinfo always contained zero. Now, with introduction of several additional features, UFFD_API_FEATURES is not longer 0 and it seems better to report actual features requested for the userfaultfd object described by the fdinfo. First, the applications that were using userfault will still see zero at the features field in fdinfo. Next, reporting actual features rather than available features, gives clear indication of what userfault features are used by an application. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Fixes: 11fb998 ("mm: move most file-based accounting to the node") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Polyakov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> [4.8+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
When __{start,end}_ro_after_init is referenced from C code, we run into
the following build errors on blackfin:
kernel/extable.c:169: undefined reference to `__start_ro_after_init'
kernel/extable.c:169: undefined reference to `__end_ro_after_init'
The build error is due to the fact that blackfin is one of the few
arches that prepends an underscore '_' to all symbols defined in C.
Fix this by wrapping __{start,end}_ro_after_init in vmlinux.lds.h with
VMLINUX_SYMBOL(), which adds the necessary prefix for arches that have
HAVE_UNDERSCORE_SYMBOL_PREFIX.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jessica Yu <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Eddie Kovsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
In PT_SEIZED + LISTEN mode STOP/CONT signals cause a wakeup against __TASK_TRACED. If this races with the ptrace_unfreeze_traced at the end of a PTRACE_LISTEN, this can wake the task /after/ the check against __TASK_TRACED, but before the reset of state to TASK_TRACED. This causes it to instead clobber TASK_WAKING, allowing a subsequent wakeup against TRACED while the task is still on the rq wake_list, corrupting it. Oleg said: "The kernel can crash or this can lead to other hard-to-debug problems. In short, "task->state = TASK_TRACED" in ptrace_unfreeze_traced() assumes that nobody else can wake it up, but PTRACE_LISTEN breaks the contract. Obviusly it is very wrong to manipulate task->state if this task is already running, or WAKING, or it sleeps again" [[email protected]: coding-style fixes] Fixes: 9899d11 ("ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ben Segall <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Setting thp defrag mode of "defer+madvise" actually sets "defer" in the kernel due to the name similarity and the out-of-order way the string is checked in defrag_store(). Check the string in the correct order so that TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG is set appropriately for "defer+madvise". Fixes: 21440d7 ("mm, thp: add new defer+madvise defrag option") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
While running generic/340 in my test setup I hit the following race. It
can happen with kernels that support FS DAX PMDs, so v4.10 thru
v4.11-rc5.
Thread 1 Thread 2
-------- --------
dax_iomap_pmd_fault()
grab_mapping_entry()
spin_lock_irq()
get_unlocked_mapping_entry()
'entry' is NULL, can't call lock_slot()
spin_unlock_irq()
radix_tree_preload()
dax_iomap_pmd_fault()
grab_mapping_entry()
spin_lock_irq()
get_unlocked_mapping_entry()
...
lock_slot()
spin_unlock_irq()
dax_pmd_insert_mapping()
<inserts a PMD mapping>
spin_lock_irq()
__radix_tree_insert() fails with -EEXIST
<fall back to 4k fault, and die horribly
when inserting a 4k entry where a PMD exists>
The issue is that we have to drop mapping->tree_lock while calling
radix_tree_preload(), but since we didn't have a radix tree entry to
lock (unlike in the pmd_downgrade case) we have no protection against
Thread 2 coming along and inserting a PMD at the same index. For 4k
entries we handled this with a special-case response to -EEXIST coming
from the __radix_tree_insert(), but this doesn't save us for PMDs
because the -EEXIST case can also mean that we collided with a 4k entry
in the radix tree at a different index, but one that is covered by our
PMD range.
So, correctly handle both the 4k and 2M collision cases by explicitly
re-checking the radix tree for an entry at our index once we reacquire
mapping->tree_lock.
This patch has made it through a clean xfstests run with the current
v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
loop. It used to fail within the first 10 iterations.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ross Zwisler <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: <[email protected]> [4.10+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
We got need_resched() warnings in swap_cgroup_swapoff() because swap_cgroup_ctrl[type].length is particularly large. Reschedule when needed. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Set current email address to replace previous employers email addresses. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jeffy Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
per cpu lru caches. This seems more than necessary because both can run
on a single WQ. Both do not block on locks requiring a memory
allocation nor perform any allocations themselves. We will save one
rescuer thread this way.
On the other hand drain_all_pages() queues work on the system wq which
doesn't have rescuer and so this depend on memory allocation (when all
workers are stuck allocating and new ones cannot be created).
Initially we thought this would be more of a theoretical problem but
Hugh Dickins has reported:
: 4.11-rc has been giving me hangs after hours of swapping load. At
: first they looked like memory leaks ("fork: Cannot allocate memory");
: but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
: before looking at /proc/meminfo one time, and the stat_refresh stuck
: in D state, waiting for completion of flush_work like many kworkers.
: kthreadd waiting for completion of flush_work in drain_all_pages().
This worker should be using WQ_RECLAIM as well in order to guarantee a
forward progress. We can reuse the same one as for lru draining and
vmstat.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Suggested-by: Tetsuo Handa <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Tested-by: Yang Li <[email protected]>
Tested-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Merge misc fixes from Andrew Morton: "10 fixes" * emailed patches from Andrew Morton <[email protected]>: mm: move pcp and lru-pcp draining into single wq mailmap: update Yakir Yang email address mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff() dax: fix radix tree insertion race mm, thp: fix setting of defer+madvise thp defrag mode ptrace: fix PTRACE_LISTEN race corrupting task->state vmlinux.lds: add missing VMLINUX_SYMBOL macros mm/page_alloc.c: fix print order in show_free_areas() userfaultfd: report actual registered features in fdinfo mm: fix page_vma_mapped_walk() for ksm pages
…udit Pull audit cleanup from Paul Moore: "A week later than I had hoped, but as promised, here is the audit uninline-fix we talked about during the last audit pull request. The patch is slightly different than what we originally discussed as it made more sense to keep the audit_signal_info() function in auditsc.c rather than move it and bunch of other related variables/definitions into audit.c/audit.h. At some point in the future I need to look at how the audit code is organized across kernel/audit*, I suspect we could do things a bit better, but it doesn't seem like a -rc release is a good place for that ;) Regardless, this patch passes our tests without problem and looks good for v4.11" * 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit: audit: move audit_signal_info() into kernel/auditsc.c
Pull KVM fixes from Radim Krčmář:
"ARM:
- Fix a problem with GICv3 userspace save/restore
- Clarify GICv2 userspace save/restore ABI
- Be more careful in clearing GIC LRs
- Add missing synchronization primitive to our MMU handling code
PPC:
- Check for a NULL return from kzalloc
s390:
- Prevent translation exception errors on valid page tables for the
instruction-exection-protection support
x86:
- Fix Page-Modification Logging when running a nested guest"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: PPC: Book3S HV: Check for kmalloc errors in ioctl
KVM: nVMX: initialize PML fields in vmcs02
KVM: nVMX: do not leak PML full vmexit to L1
KVM: arm/arm64: vgic: Fix GICC_PMR uaccess on GICv3 and clarify ABI
KVM: arm64: Ensure LRs are clear when they should be
kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd
KVM: s390: remove change-recording override support
arm/arm64: KVM: Take mmap_sem in kvm_arch_prepare_memory_region
arm/arm64: KVM: Take mmap_sem in stage2_unmap_vm
Pull sparc fixes from David Miller: "Several fixes here, mostly having to due with either build errors or memory corruptions depending upon whether you have THP enabled or not" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc: remove unused wp_works_ok macro sparc32: Export vac_cache_size to fix build error sparc64: Fix memory corruption when THP is enabled sparc64: Fix kernel panic due to erroneous #ifdef surrounding pmd_write() arch/sparc: Avoid DCTI Couples sparc64: kern_addr_valid regression sparc64: Add support for 2G hugepages sparc64: Fix size check in huge_pte_alloc
vfs_llseek will check whether the file mode has FMODE_LSEEK, no return failure. But ashmem can be lseek, so add FMODE_LSEEK to ashmem file. Comment From Greg Hackmann: ashmem_llseek() passes the llseek() call through to the backing shmem file. 91360b0 ("ashmem: use vfs_llseek()") changed this from directly calling the file's llseek() op into a VFS layer call. This also adds a check for the FMODE_LSEEK bit, so without that bit ashmem_llseek() now always fails with -ESPIPE. Fixes: 91360b0 ("ashmem: use vfs_llseek()") Signed-off-by: Shuxiao Zhang <[email protected]> Tested-by: Greg Hackmann <[email protected]> Cc: stable <[email protected]> # 3.18+ Signed-off-by: Greg Kroah-Hartman <[email protected]>
pppol2tp_setsockopt() unconditionally overwrites the error value returned by pppol2tp_tunnel_setsockopt() or pppol2tp_session_setsockopt(), thus hiding errors from userspace. Fixes: fd558d1 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts") Signed-off-by: Guillaume Nault <[email protected]> Signed-off-by: David S. Miller <[email protected]>
pppol2tp_getsockopt() doesn't take into account the error code returned by pppol2tp_tunnel_getsockopt() or pppol2tp_session_getsockopt(). If error occurs there, pppol2tp_getsockopt() continues unconditionally and reports erroneous values. Fixes: fd558d1 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts") Signed-off-by: Guillaume Nault <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Guillaume Nault says: ==================== l2tp: fix error handling of PPPoL2TP socket options Fix pppol2tp_[gs]etsockopt() so that they don't ignore errors returned by their helper functions. ==================== Signed-off-by: David S. Miller <[email protected]>
Because TCP_MIB_OUTRSTS is an important count, so always increase it whatever send it successfully or not. Now move the increment of TCP_MIB_OUTRSTS to the top of tcp_send_active_reset to make sure it is increased always even though fail to alloc skb. Signed-off-by: Gao Feng <[email protected]> Signed-off-by: David S. Miller <[email protected]>
A patch documenting how to specify which kernels a particular fix should be backported to (seemingly) inadvertently added a minus sign after the kernel version. This particular stable-tag format had never been used prior to this patch, and was neither present when the patch in question was first submitted (it was added in v2 without any comment). Drop the minus sign to avoid any confusion. Fixes: fdc81b7 ("stable_kernel_rules: Add clause about specification of kernel versions to patch.") Signed-off-by: Johan Hovold <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
ops->show() can return a negative error code. Commit 65da348 ("sysfs: correctly handle short reads on PREALLOC attrs.") (in v4.4) caused this to be stored in an unsigned 'size_t' variable, so errors would look like large numbers. As a result, if an error is returned, sysfs_kf_read() will return the value of 'count', typically 4096. Commit 17d0774 ("sysfs: correctly handle read offset on PREALLOC attrs") (in v4.8) extended this error to use the unsigned large 'len' as a size for memmove(). Consequently, if ->show returns an error, then the first read() on the sysfs file will return 4096 and could return uninitialized memory to user-space. If the application performs a subsequent read, this will trigger a memmove() with extremely large count, and is likely to crash the machine is bizarre ways. This bug can currently only be triggered by reading from an md sysfs attribute declared with __ATTR_PREALLOC() during the brief period between when mddev_put() deletes an mddev from the ->all_mddevs list, and when mddev_delayed_delete() - which is scheduled on a workqueue - completes. Before this, an error won't be returned by the ->show() After this, the ->show() won't be called. I can reproduce it reliably only by putting delay like usleep_range(500000,700000); early in mddev_delayed_delete(). Then after creating an md device md0 run echo clear > /sys/block/md0/md/array_state; cat /sys/block/md0/md/array_state The bug can be triggered without the usleep. Fixes: 65da348 ("sysfs: correctly handle short reads on PREALLOC attrs.") Fixes: 17d0774 ("sysfs: correctly handle read offset on PREALLOC attrs") Cc: [email protected] Signed-off-by: NeilBrown <[email protected]> Acked-by: Tejun Heo <[email protected]> Reported-and-tested-by: Miroslav Benes <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
…/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2017-04-04 this is a pull request of two patches for net/master. The first patch by Markus Marb fixes a register read access in the ifi driver. The second patch by Geert Uytterhoeven for the rcar driver remove the printing of a kernel virtual address. ==================== Signed-off-by: David S. Miller <[email protected]>
Separate out kernfs from driver core and add myself as a co-maintainer. Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
…tvec Currently, inputting the following command will succeed but actually the value will be truncated: # echo 0x12ffffffff > /proc/sys/net/ipv4/tcp_notsent_lowat This is not friendly to the user, so instead, we should report error when the value is larger than UINT_MAX. Fixes: e7d316a ("sysctl: handle error writing UINT_MAX to u32 fields") Signed-off-by: Liping Zhang <[email protected]> Cc: Subash Abhinov Kasiviswanathan <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Eric W. Biederman <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
In the case that compat_get_bitmap fails we do not want to copy the bitmap to the user as it will contain uninitialized stack data and leak sensitive data. Signed-off-by: Chris Salls <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
…el/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Some more powerpc fixes for 4.11:
Headed to stable:
- disable HFSCR[TM] if TM is not supported, fixes a potential host
kernel crash triggered by a hostile guest, but only in
configurations that no one uses
- don't try to fix up misaligned load-with-reservation instructions
- fix flush_(d|i)cache_range() called from modules on little endian
kernels
- add missing global TLB invalidate if cxl is active
- fix missing preempt_disable() in crc32c-vpmsum
And a fix for selftests build changes that went in this release:
- selftests/powerpc: Fix standalone powerpc build
Thanks to: Benjamin Herrenschmidt, Frederic Barrat, Oliver O'Halloran,
Paul Mackerras"
* tag 'powerpc-4.11-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/crypto/crc32c-vpmsum: Fix missing preempt_disable()
powerpc/mm: Add missing global TLB invalidate if cxl is active
powerpc/64: Fix flush_(d|i)cache_range() called from modules
powerpc: Don't try to fix up misaligned load-with-reservation instructions
powerpc: Disable HFSCR[TM] if TM is not supported
selftests/powerpc: Fix standalone powerpc build
…nel/git/linusw/linux-pinctrl Pull pin control fix from Linus Walleij: "This late fix for pin control is hopefully the last I send this cycle. The problem was detected early in the v4.11 release cycle and there has been some back and forth on how to solve it. Sadly the proper fix arrives late, but at least not too late. An issue was detected with pin control on the Freescale i.MX after the refactorings for more general group and function handling. We now have the proper fix for this" * tag 'pinctrl-v4.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: pinctrl: core: Fix pinctrl_register_and_init() with pinctrl_enable()
Pull block fixes from Jens Axboe:
"Here's a pull request for 4.11-rc, fixing a set of issues mostly
centered around the new scheduling framework. These have been brewing
for a while, but split up into what we absolutely need in 4.11, and
what we can defer until 4.12. These are well tested, on both single
queue and multiqueue setups, and with and without shared tags. They
fix several hangs that have happened in testing.
This is obviously larger than I would have preferred at this point in
time, but I don't think we can shave much off this and still get the
desired results.
In detail, this pull request contains:
- a set of five fixes for NVMe, mostly from Christoph and one from
Roland.
- a series from Bart, fixing issues with dm-mq and SCSI shared tags
and scheduling. Note that one of those patches commit messages may
read like an optimization, but it is in fact an important fix for
queue restarts in particular.
- a series from Omar, most importantly fixing a hang with multiple
hardware queues when we fail to get a driver tag. Another important
fix in there is for resizing hardware queues, which nbd does when
handling multiple sockets for one connection.
- fixing an imbalance in putting the ctx for hctx request allocations
from Minchan"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: Restart a single queue if tag sets are shared
dm rq: Avoid that request processing stalls sporadically
scsi: Avoid that SCSI queues get stuck
blk-mq: Introduce blk_mq_delay_run_hw_queue()
blk-mq: remap queues when adding/removing hardware queues
blk-mq-sched: fix crash in switch error path
blk-mq-sched: set up scheduler tags when bringing up new queues
blk-mq-sched: refactor scheduler initialization
blk-mq: use the right hctx when getting a driver tag fails
nvmet: fix byte swap in nvmet_parse_io_cmd
nvmet: fix byte swap in nvmet_execute_write_zeroes
nvmet: add missing byte swap in nvmet_get_smart_log
nvme: add missing byte swap in nvme_setup_discard
nvme: Correct NVMF enum values to match NVMe-oF rev 1.0
block: do not put mq context in blk_mq_alloc_request_hctx
kdave
pushed a commit
that referenced
this pull request
Aug 29, 2025
These iterations require the read lock, otherwise RCU lockdep will splat: ============================= WARNING: suspicious RCU usage 6.17.0-rc3-00014-g31419c045d64 #6 Tainted: G O ----------------------------- drivers/base/power/main.c:1333 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 5 locks held by rtcwake/547: #0: 00000000643ab418 (sb_writers#6){.+.+}-{0:0}, at: file_start_write+0x2b/0x3a #1: 0000000067a0ca88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x181/0x24b #2: 00000000631eac40 (kn->active#3){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x191/0x24b #3: 00000000609a1308 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0xaf/0x30b #4: 0000000060c0fdb0 (device_links_srcu){.+.+}-{0:0}, at: device_links_read_lock+0x75/0x98 stack backtrace: CPU: 0 UID: 0 PID: 547 Comm: rtcwake Tainted: G O 6.17.0-rc3-00014-g31419c045d64 #6 VOLUNTARY Tainted: [O]=OOT_MODULE Stack: 223721b3a80 6089eac6 00000001 00000001 ffffff00 6089eac6 00000535 6086e528 721b3ac0 6003c294 00000000 60031fc0 Call Trace: [<600407ed>] show_stack+0x10e/0x127 [<6003c294>] dump_stack_lvl+0x77/0xc6 [<6003c2fd>] dump_stack+0x1a/0x20 [<600bc2f8>] lockdep_rcu_suspicious+0x116/0x13e [<603d8ea1>] dpm_async_suspend_superior+0x117/0x17e [<603d980f>] device_suspend+0x528/0x541 [<603da24b>] dpm_suspend+0x1a2/0x267 [<603da837>] dpm_suspend_start+0x5d/0x72 [<600ca0c9>] suspend_devices_and_enter+0xab/0x736 [...] Add the fourth argument to the iteration to annotate this and avoid the splat. Fixes: 0679963 ("PM: sleep: Make async suspend handle suppliers like parents") Fixes: ed18738 ("PM: sleep: Make async resume handle consumers like children") Signed-off-by: Johannes Berg <[email protected]> Link: https://patch.msgid.link/20250826134348.aba79f6e6299.I9ecf55da46ccf33778f2c018a82e1819d815b348@changeid Signed-off-by: Rafael J. Wysocki <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Sep 1, 2025
When a large VM, specifically one that holds a significant number of PTEs, gets abruptly destroyed, the following warning is seen during the page-table walk: sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE Tainted: [O]=OOT_MODULE Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0x3c/0xb8 dump_stack+0x18/0x30 resched_latency_warn+0x7c/0x88 sched_tick+0x1c4/0x268 update_process_times+0xa8/0xd8 tick_nohz_handler+0xc8/0x168 __hrtimer_run_queues+0x11c/0x338 hrtimer_interrupt+0x104/0x308 arch_timer_handler_phys+0x40/0x58 handle_percpu_devid_irq+0x8c/0x1b0 generic_handle_domain_irq+0x48/0x78 gic_handle_irq+0x1b8/0x408 call_on_irq_stack+0x24/0x30 do_interrupt_handler+0x54/0x78 el1_interrupt+0x44/0x88 el1h_64_irq_handler+0x18/0x28 el1h_64_irq+0x84/0x88 stage2_free_walker+0x30/0xa0 (P) __kvm_pgtable_walk+0x11c/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 kvm_pgtable_walk+0xc4/0x140 kvm_pgtable_stage2_destroy+0x5c/0xf0 kvm_free_stage2_pgd+0x6c/0xe8 kvm_uninit_stage2_mmu+0x24/0x48 kvm_arch_flush_shadow_all+0x80/0xa0 kvm_mmu_notifier_release+0x38/0x78 __mmu_notifier_release+0x15c/0x250 exit_mmap+0x68/0x400 __mmput+0x38/0x1c8 mmput+0x30/0x68 exit_mm+0xd4/0x198 do_exit+0x1a4/0xb00 do_group_exit+0x8c/0x120 get_signal+0x6d4/0x778 do_signal+0x90/0x718 do_notify_resume+0x70/0x170 el0_svc+0x74/0xd8 el0t_64_sync_handler+0x60/0xc8 el0t_64_sync+0x1b0/0x1b8 The warning is seen majorly on the host kernels that are configured not to force-preempt, such as CONFIG_PREEMPT_NONE=y. To avoid this, instead of walking the entire page-table in one go, split it into smaller ranges, by checking for cond_resched() between each range. Since the path is executed during VM destruction, after the page-table structure is unlinked from the KVM MMU, relying on cond_resched_rwlock_write() isn't necessary. Signed-off-by: Raghavendra Rao Ananta <[email protected]> Suggested-by: Oliver Upton <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Oliver Upton <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Sep 5, 2025
When the "proxy" option is enabled on a VXLAN device, the device will suppress ARP requests and IPv6 Neighbor Solicitation messages if it is able to reply on behalf of the remote host. That is, if a matching and valid neighbor entry is configured on the VXLAN device whose MAC address is not behind the "any" remote (0.0.0.0 / ::). The code currently assumes that the FDB entry for the neighbor's MAC address points to a valid remote destination, but this is incorrect if the entry is associated with an FDB nexthop group. This can result in a NPD [1][3] which can be reproduced using [2][4]. Fix by checking that the remote destination exists before dereferencing it. [1] BUG: kernel NULL pointer dereference, address: 0000000000000000 [...] CPU: 4 UID: 0 PID: 365 Comm: arping Not tainted 6.17.0-rc2-virtme-g2a89cb21162c #2 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-4.fc41 04/01/2014 RIP: 0010:vxlan_xmit+0xb58/0x15f0 [...] Call Trace: <TASK> dev_hard_start_xmit+0x5d/0x1c0 __dev_queue_xmit+0x246/0xfd0 packet_sendmsg+0x113a/0x1850 __sock_sendmsg+0x38/0x70 __sys_sendto+0x126/0x180 __x64_sys_sendto+0x24/0x30 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x4b/0x53 [2] #!/bin/bash ip address add 192.0.2.1/32 dev lo ip nexthop add id 1 via 192.0.2.2 fdb ip nexthop add id 10 group 1 fdb ip link add name vx0 up type vxlan id 10010 local 192.0.2.1 dstport 4789 proxy ip neigh add 192.0.2.3 lladdr 00:11:22:33:44:55 nud perm dev vx0 bridge fdb add 00:11:22:33:44:55 dev vx0 self static nhid 10 arping -b -c 1 -s 192.0.2.1 -I vx0 192.0.2.3 [3] BUG: kernel NULL pointer dereference, address: 0000000000000000 [...] CPU: 13 UID: 0 PID: 372 Comm: ndisc6 Not tainted 6.17.0-rc2-virtmne-g6ee90cb26014 #3 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1v996), BIOS 1.17.0-4.fc41 04/01/2x014 RIP: 0010:vxlan_xmit+0x803/0x1600 [...] Call Trace: <TASK> dev_hard_start_xmit+0x5d/0x1c0 __dev_queue_xmit+0x246/0xfd0 ip6_finish_output2+0x210/0x6c0 ip6_finish_output+0x1af/0x2b0 ip6_mr_output+0x92/0x3e0 ip6_send_skb+0x30/0x90 rawv6_sendmsg+0xe6e/0x12e0 __sock_sendmsg+0x38/0x70 __sys_sendto+0x126/0x180 __x64_sys_sendto+0x24/0x30 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f383422ec77 [4] #!/bin/bash ip address add 2001:db8:1::1/128 dev lo ip nexthop add id 1 via 2001:db8:1::1 fdb ip nexthop add id 10 group 1 fdb ip link add name vx0 up type vxlan id 10010 local 2001:db8:1::1 dstport 4789 proxy ip neigh add 2001:db8:1::3 lladdr 00:11:22:33:44:55 nud perm dev vx0 bridge fdb add 00:11:22:33:44:55 dev vx0 self static nhid 10 ndisc6 -r 1 -s 2001:db8:1::1 -w 1 2001:db8:1::3 vx0 Fixes: 1274e1c ("vxlan: ecmp support for mac fdb entries") Reviewed-by: Petr Machata <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Sep 5, 2025
Ido Schimmel says:
====================
vxlan: Fix NPDs when using nexthop objects
With FDB nexthop groups, VXLAN FDB entries do not necessarily point to
a remote destination but rather to an FDB nexthop group. This means that
first_remote_{rcu,rtnl}() can return NULL and a few places in the driver
were not ready for that, resulting in NULL pointer dereferences.
Patches #1-#2 fix these NPDs.
Note that vxlan_fdb_find_uc() still dereferences the remote returned by
first_remote_rcu() without checking that it is not NULL, but this
function is only invoked by a single driver which vetoes the creation of
FDB nexthop groups. I will patch this in net-next to make the code less
fragile.
Patch #3 adds a selftests which exercises these code paths and tests
basic Tx functionality with FDB nexthop groups. I verified that the test
crashes the kernel without the first two patches.
====================
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Sep 5, 2025
When transmitting a PTP frame which is timestamp using 2 step, the following warning appears if CONFIG_PROVE_LOCKING is enabled: ============================= [ BUG: Invalid wait context ] 6.17.0-rc1-00326-ge6160462704e #427 Not tainted ----------------------------- ptp4l/119 is trying to lock: c2a44ed4 (&vsc8531->ts_lock){+.+.}-{3:3}, at: vsc85xx_txtstamp+0x50/0xac other info that might help us debug this: context-{4:4} 4 locks held by ptp4l/119: #0: c145f068 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x58/0x1440 #1: c29df974 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: __dev_queue_xmit+0x5c4/0x1440 #2: c2aaaad0 (_xmit_ETHER#2){+.-.}-{2:2}, at: sch_direct_xmit+0x108/0x350 #3: c2aac170 (&lan966x->tx_lock){+.-.}-{2:2}, at: lan966x_port_xmit+0xd0/0x350 stack backtrace: CPU: 0 UID: 0 PID: 119 Comm: ptp4l Not tainted 6.17.0-rc1-00326-ge6160462704e #427 NONE Hardware name: Generic DT based system Call trace: unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x7c/0xac dump_stack_lvl from __lock_acquire+0x8e8/0x29dc __lock_acquire from lock_acquire+0x108/0x38c lock_acquire from __mutex_lock+0xb0/0xe78 __mutex_lock from mutex_lock_nested+0x1c/0x24 mutex_lock_nested from vsc85xx_txtstamp+0x50/0xac vsc85xx_txtstamp from lan966x_fdma_xmit+0xd8/0x3a8 lan966x_fdma_xmit from lan966x_port_xmit+0x1bc/0x350 lan966x_port_xmit from dev_hard_start_xmit+0xc8/0x2c0 dev_hard_start_xmit from sch_direct_xmit+0x8c/0x350 sch_direct_xmit from __dev_queue_xmit+0x680/0x1440 __dev_queue_xmit from packet_sendmsg+0xfa4/0x1568 packet_sendmsg from __sys_sendto+0x110/0x19c __sys_sendto from sys_send+0x18/0x20 sys_send from ret_fast_syscall+0x0/0x1c Exception stack(0xf0b05fa8 to 0xf0b05ff0) 5fa0: 00000001 0000000 0000000 0004b47a 0000003a 00000000 5fc0: 00000001 0000000 00000000 00000121 0004af58 00044874 00000000 00000000 5fe0: 00000001 bee9d420 00025a10 b6e75c7c So, instead of using the ts_lock for tx_queue, use the spinlock that skb_buff_head has. Reviewed-by: Vadim Fedorenko <[email protected]> Fixes: 7d272e6 ("net: phy: mscc: timestamping and PHC support") Signed-off-by: Horatiu Vultur <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Sep 12, 2025
Problem description
===================
Lockdep reports a possible circular locking dependency (AB/BA) between
&pl->state_mutex and &phy->lock, as follows.
phylink_resolve() // acquires &pl->state_mutex
-> phylink_major_config()
-> phy_config_inband() // acquires &pl->phydev->lock
whereas all the other call sites where &pl->state_mutex and
&pl->phydev->lock have the locking scheme reversed. Everywhere else,
&pl->phydev->lock is acquired at the top level, and &pl->state_mutex at
the lower level. A clear example is phylink_bringup_phy().
The outlier is the newly introduced phy_config_inband() and the existing
lock order is the correct one. To understand why it cannot be the other
way around, it is sufficient to consider phylink_phy_change(), phylink's
callback from the PHY device's phy->phy_link_change() virtual method,
invoked by the PHY state machine.
phy_link_up() and phy_link_down(), the (indirect) callers of
phylink_phy_change(), are called with &phydev->lock acquired.
Then phylink_phy_change() acquires its own &pl->state_mutex, to
serialize changes made to its pl->phy_state and pl->link_config.
So all other instances of &pl->state_mutex and &phydev->lock must be
consistent with this order.
Problem impact
==============
I think the kernel runs a serious deadlock risk if an existing
phylink_resolve() thread, which results in a phy_config_inband() call,
is concurrent with a phy_link_up() or phy_link_down() call, which will
deadlock on &pl->state_mutex in phylink_phy_change(). Practically
speaking, the impact may be limited by the slow speed of the medium
auto-negotiation protocol, which makes it unlikely for the current state
to still be unresolved when a new one is detected, but I think the
problem is there. Nonetheless, the problem was discovered using lockdep.
Proposed solution
=================
Practically speaking, the phy_config_inband() requirement of having
phydev->lock acquired must transfer to the caller (phylink is the only
caller). There, it must bubble up until immediately before
&pl->state_mutex is acquired, for the cases where that takes place.
Solution details, considerations, notes
=======================================
This is the phy_config_inband() call graph:
sfp_upstream_ops :: connect_phy()
|
v
phylink_sfp_connect_phy()
|
v
phylink_sfp_config_phy()
|
| sfp_upstream_ops :: module_insert()
| |
| v
| phylink_sfp_module_insert()
| |
| | sfp_upstream_ops :: module_start()
| | |
| | v
| | phylink_sfp_module_start()
| | |
| v v
| phylink_sfp_config_optical()
phylink_start() | |
| phylink_resume() v v
| | phylink_sfp_set_config()
| | |
v v v
phylink_mac_initial_config()
| phylink_resolve()
| | phylink_ethtool_ksettings_set()
v v v
phylink_major_config()
|
v
phy_config_inband()
phylink_major_config() caller #1, phylink_mac_initial_config(), does not
acquire &pl->state_mutex nor do its callers. It must acquire
&pl->phydev->lock prior to calling phylink_major_config().
phylink_major_config() caller #2, phylink_resolve() acquires
&pl->state_mutex, thus also needs to acquire &pl->phydev->lock.
phylink_major_config() caller #3, phylink_ethtool_ksettings_set(), is
completely uninteresting, because it only calls phylink_major_config()
if pl->phydev is NULL (otherwise it calls phy_ethtool_ksettings_set()).
We need to change nothing there.
Other solutions
===============
The lock inversion between &pl->state_mutex and &pl->phydev->lock has
occurred at least once before, as seen in commit c718af2 ("net:
phylink: fix ethtool -A with attached PHYs"). The solution there was to
simply not call phy_set_asym_pause() under the &pl->state_mutex. That
cannot be extended to our case though, where the phy_config_inband()
call is much deeper inside the &pl->state_mutex section.
Fixes: 5fd0f1a ("net: phylink: add negotiation of in-band capabilities")
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Russell King (Oracle) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Sep 19, 2025
…ux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 changes for 6.17, round #3 - Invalidate nested MMUs upon freeing the PGD to avoid WARNs when visiting from an MMU notifier - Fixes to the TLB match process and TLB invalidation range for managing the VCNR pseudo-TLB - Prevent SPE from erroneously profiling guests due to UNKNOWN reset values in PMSCR_EL1 - Fix save/restore of host MDCR_EL2 to account for eagerly programming at vcpu_load() on VHE systems - Correct lock ordering when dealing with VGIC LPIs, avoiding scenarios where an xarray's spinlock was nested with a *raw* spinlock - Permit stage-2 read permission aborts which are possible in the case of NV depending on the guest hypervisor's stage-2 translation - Call raw_spin_unlock() instead of the internal spinlock API - Fix parameter ordering when assigning VBAR_EL1
kdave
pushed a commit
that referenced
this pull request
Sep 25, 2025
This fixes the following UAF caused by not properly locking hdev when processing HCI_EV_NUM_COMP_PKTS: BUG: KASAN: slab-use-after-free in hci_conn_tx_dequeue+0x1be/0x220 net/bluetooth/hci_conn.c:3036 Read of size 4 at addr ffff8880740f0940 by task kworker/u11:0/54 CPU: 1 UID: 0 PID: 54 Comm: kworker/u11:0 Not tainted 6.16.0-rc7 #3 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Workqueue: hci1 hci_rx_work Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x230 mm/kasan/report.c:480 kasan_report+0x118/0x150 mm/kasan/report.c:593 hci_conn_tx_dequeue+0x1be/0x220 net/bluetooth/hci_conn.c:3036 hci_num_comp_pkts_evt+0x1c8/0xa50 net/bluetooth/hci_event.c:4404 hci_event_func net/bluetooth/hci_event.c:7477 [inline] hci_event_packet+0x7e0/0x1200 net/bluetooth/hci_event.c:7531 hci_rx_work+0x46a/0xe80 net/bluetooth/hci_core.c:4070 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 </TASK> Allocated by task 54: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:394 kasan_kmalloc include/linux/kasan.h:260 [inline] __kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4359 kmalloc_noprof include/linux/slab.h:905 [inline] kzalloc_noprof include/linux/slab.h:1039 [inline] __hci_conn_add+0x233/0x1b30 net/bluetooth/hci_conn.c:939 le_conn_complete_evt+0x3d6/0x1220 net/bluetooth/hci_event.c:5628 hci_le_enh_conn_complete_evt+0x189/0x470 net/bluetooth/hci_event.c:5794 hci_event_func net/bluetooth/hci_event.c:7474 [inline] hci_event_packet+0x78c/0x1200 net/bluetooth/hci_event.c:7531 hci_rx_work+0x46a/0xe80 net/bluetooth/hci_core.c:4070 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 Freed by task 9572: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x62/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2381 [inline] slab_free mm/slub.c:4643 [inline] kfree+0x18e/0x440 mm/slub.c:4842 device_release+0x9c/0x1c0 kobject_cleanup lib/kobject.c:689 [inline] kobject_release lib/kobject.c:720 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x22b/0x480 lib/kobject.c:737 hci_conn_cleanup net/bluetooth/hci_conn.c:175 [inline] hci_conn_del+0x8ff/0xcb0 net/bluetooth/hci_conn.c:1173 hci_abort_conn_sync+0x5d1/0xdf0 net/bluetooth/hci_sync.c:5689 hci_cmd_sync_work+0x210/0x3a0 net/bluetooth/hci_sync.c:332 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 Fixes: 134f4b3 ("Bluetooth: add support for skb TX SND/COMPLETION timestamping") Reported-by: Junvyyang, Tencent Zhuque Lab <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Sep 25, 2025
This fixes the following UFA in hci_acl_create_conn_sync where a connection still pending is command submission (conn->state == BT_OPEN) maybe freed, also since this also can happen with the likes of hci_le_create_conn_sync fix it as well: BUG: KASAN: slab-use-after-free in hci_acl_create_conn_sync+0x5ef/0x790 net/bluetooth/hci_sync.c:6861 Write of size 2 at addr ffff88805ffcc038 by task kworker/u11:2/9541 CPU: 1 UID: 0 PID: 9541 Comm: kworker/u11:2 Not tainted 6.16.0-rc7 #3 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Workqueue: hci3 hci_cmd_sync_work Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x230 mm/kasan/report.c:480 kasan_report+0x118/0x150 mm/kasan/report.c:593 hci_acl_create_conn_sync+0x5ef/0x790 net/bluetooth/hci_sync.c:6861 hci_cmd_sync_work+0x210/0x3a0 net/bluetooth/hci_sync.c:332 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 </TASK> Allocated by task 123736: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:394 kasan_kmalloc include/linux/kasan.h:260 [inline] __kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4359 kmalloc_noprof include/linux/slab.h:905 [inline] kzalloc_noprof include/linux/slab.h:1039 [inline] __hci_conn_add+0x233/0x1b30 net/bluetooth/hci_conn.c:939 hci_conn_add_unset net/bluetooth/hci_conn.c:1051 [inline] hci_connect_acl+0x16c/0x4e0 net/bluetooth/hci_conn.c:1634 pair_device+0x418/0xa70 net/bluetooth/mgmt.c:3556 hci_mgmt_cmd+0x9c9/0xef0 net/bluetooth/hci_sock.c:1719 hci_sock_sendmsg+0x6ca/0xef0 net/bluetooth/hci_sock.c:1839 sock_sendmsg_nosec net/socket.c:712 [inline] __sock_sendmsg+0x219/0x270 net/socket.c:727 sock_write_iter+0x258/0x330 net/socket.c:1131 new_sync_write fs/read_write.c:593 [inline] vfs_write+0x54b/0xa90 fs/read_write.c:686 ksys_write+0x145/0x250 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 103680: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x62/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2381 [inline] slab_free mm/slub.c:4643 [inline] kfree+0x18e/0x440 mm/slub.c:4842 device_release+0x9c/0x1c0 kobject_cleanup lib/kobject.c:689 [inline] kobject_release lib/kobject.c:720 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x22b/0x480 lib/kobject.c:737 hci_conn_cleanup net/bluetooth/hci_conn.c:175 [inline] hci_conn_del+0x8ff/0xcb0 net/bluetooth/hci_conn.c:1173 hci_conn_complete_evt+0x3c7/0x1040 net/bluetooth/hci_event.c:3199 hci_event_func net/bluetooth/hci_event.c:7477 [inline] hci_event_packet+0x7e0/0x1200 net/bluetooth/hci_event.c:7531 hci_rx_work+0x46a/0xe80 net/bluetooth/hci_core.c:4070 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 Last potentially related work creation: kasan_save_stack+0x3e/0x60 mm/kasan/common.c:47 kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:548 insert_work+0x3d/0x330 kernel/workqueue.c:2183 __queue_work+0xbd9/0xfe0 kernel/workqueue.c:2345 queue_delayed_work_on+0x18b/0x280 kernel/workqueue.c:2561 pairing_complete+0x1e7/0x2b0 net/bluetooth/mgmt.c:3451 pairing_complete_cb+0x1ac/0x230 net/bluetooth/mgmt.c:3487 hci_connect_cfm include/net/bluetooth/hci_core.h:2064 [inline] hci_conn_failed+0x24d/0x310 net/bluetooth/hci_conn.c:1275 hci_conn_complete_evt+0x3c7/0x1040 net/bluetooth/hci_event.c:3199 hci_event_func net/bluetooth/hci_event.c:7477 [inline] hci_event_packet+0x7e0/0x1200 net/bluetooth/hci_event.c:7531 hci_rx_work+0x46a/0xe80 net/bluetooth/hci_core.c:4070 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 Fixes: aef2aa4 ("Bluetooth: hci_event: Fix creating hci_conn object on error status") Reported-by: Junvyyang, Tencent Zhuque Lab <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Sep 25, 2025
Ido Schimmel says: ==================== nexthop: Various fixes Patch #1 fixes a NPD that was recently reported by syzbot. Patch #2 fixes an issue in the existing FIB nexthop selftest. Patch #3 extends the selftest with test cases for the bug that was fixed in the first patch. ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Sep 30, 2025
We check the version of SPE twice, and we'll add one more check in the next commit so factor out a macro to do this. Change the #3 magic number to the actual SPE version define (V1p2) to make it more readable. No functional changes intended. Tested-by: Leo Yan <[email protected]> Reviewed-by: Leo Yan <[email protected]> Signed-off-by: James Clark <[email protected]> Signed-off-by: Will Deacon <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Oct 13, 2025
The test starts a workload and then opens events. If the events fail to open, for example because of perf_event_paranoid, the gopipe of the workload is leaked and the file descriptor leak check fails when the test exits. To avoid this cancel the workload when opening the events fails. Before: ``` $ perf test -vv 7 7: PERF_RECORD_* events & perf_sample fields: --- start --- test child forked, pid 1189568 Using CPUID GenuineIntel-6-B7-1 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 sys_perf_event_open failed, error -13 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 exclude_kernel 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 = 3 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 sys_perf_event_open failed, error -13 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 exclude_kernel 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 = 3 Attempt to add: software/cpu-clock/ ..after resolving event: software/config=0/ cpu-clock -> software/cpu-clock/ ------------------------------------------------------------ perf_event_attr: type 1 (PERF_TYPE_SOFTWARE) size 136 config 0x9 (PERF_COUNT_SW_DUMMY) sample_type IP|TID|TIME|CPU read_format ID|LOST disabled 1 inherit 1 mmap 1 comm 1 enable_on_exec 1 task 1 sample_id_all 1 mmap2 1 comm_exec 1 ksymbol 1 bpf_event 1 { wakeup_events, wakeup_watermark } 1 ------------------------------------------------------------ sys_perf_event_open: pid 1189569 cpu 0 group_fd -1 flags 0x8 sys_perf_event_open failed, error -13 perf_evlist__open: Permission denied ---- end(-2) ---- Leak of file descriptor 6 that opened: 'pipe:[14200347]' ---- unexpected signal (6) ---- iFailed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon #0 0x565358f6666e in child_test_sig_handler builtin-test.c:311 #1 0x7f29ce849df0 in __restore_rt libc_sigaction.c:0 #2 0x7f29ce89e95c in __pthread_kill_implementation pthread_kill.c:44 #3 0x7f29ce849cc2 in raise raise.c:27 #4 0x7f29ce8324ac in abort abort.c:81 #5 0x565358f662d4 in check_leaks builtin-test.c:226 #6 0x565358f6682e in run_test_child builtin-test.c:344 #7 0x565358ef7121 in start_command run-command.c:128 torvalds#8 0x565358f67273 in start_test builtin-test.c:545 torvalds#9 0x565358f6771d in __cmd_test builtin-test.c:647 torvalds#10 0x565358f682bd in cmd_test builtin-test.c:849 torvalds#11 0x565358ee5ded in run_builtin perf.c:349 #12 0x565358ee6085 in handle_internal_command perf.c:401 #13 0x565358ee61de in run_argv perf.c:448 torvalds#14 0x565358ee6527 in main perf.c:555 #15 0x7f29ce833ca8 in __libc_start_call_main libc_start_call_main.h:74 #16 0x7f29ce833d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128 #17 0x565358e391c1 in _start perf[851c1] 7: PERF_RECORD_* events & perf_sample fields : FAILED! ``` After: ``` $ perf test 7 7: PERF_RECORD_* events & perf_sample fields : Skip (permissions) ``` Fixes: 16d00fe ("perf tests: Move test__PERF_RECORD into separate object") Signed-off-by: Ian Rogers <[email protected]> Tested-by: Arnaldo Carvalho de Melo <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Athira Rajeev <[email protected]> Cc: Chun-Tse Shao <[email protected]> Cc: Howard Chu <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James Clark <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kan Liang <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Oct 13, 2025
Don't emulate branch instructions, e.g. CALL/RET/JMP etc., that are affected by Shadow Stacks and/or Indirect Branch Tracking when said features are enabled in the guest, as fully emulating CET would require significant complexity for no practical benefit (KVM shouldn't need to emulate branch instructions on modern hosts). Simply doing nothing isn't an option as that would allow a malicious entity to subvert CET protections via the emulator. To detect instructions that are subject to IBT or affect IBT state, use the existing IsBranch flag along with the source operand type to detect indirect branches, and the existing NearBranch flag to detect far JMPs and CALLs, all of which are effectively indirect. Explicitly check for emulation of IRET, FAR RET (IMM), and SYSEXIT (the ret-like far branches) instead of adding another flag, e.g. IsRet, as it's unlikely the emulator will ever need to check for return-like instructions outside of this one specific flow. Use an allow-list instead of a deny-list because (a) it's a shorter list and (b) so that a missed entry gets a false positive, not a false negative (i.e. reject emulation instead of clobbering CET state). For Shadow Stacks, explicitly track instructions that directly affect the current SSP, as KVM's emulator doesn't have existing flags that can be used to precisely detect such instructions. Alternatively, the em_xxx() helpers could directly check for ShadowStack interactions, but using a dedicated flag is arguably easier to audit, and allows for handling both IBT and SHSTK in one fell swoop. Note! On far transfers, do NOT consult the current privilege level and instead treat SHSTK/IBT as being enabled if they're enabled for User *or* Supervisor mode. On inter-privilege level far transfers, SHSTK and IBT can be in play for the target privilege level, i.e. checking the current privilege could get a false negative, and KVM doesn't know the target privilege level until emulation gets under way. Note #2, FAR JMP from 64-bit mode to compatibility mode interacts with the current SSP, but only to ensure SSP[63:32] == 0. Don't tag FAR JMP as SHSTK, which would be rather confusing and would result in FAR JMP being rejected unnecessarily the vast majority of the time (ignoring that it's unlikely to ever be emulated). A future commit will add the #GP(0) check for the specific FAR JMP scenario. Note #3, task switches also modify SSP and so need to be rejected. That too will be addressed in a future commit. Suggested-by: Chao Gao <[email protected]> Originally-by: Yang Weijiang <[email protected]> Cc: Mathias Krause <[email protected]> Cc: John Allen <[email protected]> Cc: Rick Edgecombe <[email protected]> Reviewed-by: Chao Gao <[email protected]> Reviewed-by: Binbin Wu <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Oct 13, 2025
Before disabling SR-IOV via config space accesses to the parent PF, sriov_disable() first removes the PCI devices representing the VFs. Since commit 9d16947 ("PCI: Add global pci_lock_rescan_remove()") such removal operations are serialized against concurrent remove and rescan using the pci_rescan_remove_lock. No such locking was ever added in sriov_disable() however. In particular when commit 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") factored out the PCI device removal into sriov_del_vfs() there was still no locking around the pci_iov_remove_virtfn() calls. On s390 the lack of serialization in sriov_disable() may cause double remove and list corruption with the below (amended) trace being observed: PSW: 0704c00180000000 0000000c914e4b38 (klist_put+56) GPRS: 000003800313fb48 0000000000000000 0000000100000001 0000000000000001 00000000f9b520a8 0000000000000000 0000000000002fbd 00000000f4cc9480 0000000000000001 0000000000000000 0000000000000000 0000000180692828 00000000818e8000 000003800313fe2c 000003800313fb20 000003800313fad8 #0 [3800313fb20] device_del at c9158ad5c #1 [3800313fb88] pci_remove_bus_device at c915105ba #2 [3800313fbd0] pci_iov_remove_virtfn at c9152f198 #3 [3800313fc28] zpci_iov_remove_virtfn at c90fb67c0 #4 [3800313fc60] zpci_bus_remove_device at c90fb6104 #5 [3800313fca0] __zpci_event_availability at c90fb3dca #6 [3800313fd08] chsc_process_sei_nt0 at c918fe4a2 #7 [3800313fd60] crw_collect_info at c91905822 torvalds#8 [3800313fe10] kthread at c90feb390 torvalds#9 [3800313fe68] __ret_from_fork at c90f6aa64 torvalds#10 [3800313fe98] ret_from_fork at c9194f3f2. This is because in addition to sriov_disable() removing the VFs, the platform also generates hot-unplug events for the VFs. This being the reverse operation to the hotplug events generated by sriov_enable() and handled via pdev->no_vf_scan. And while the event processing takes pci_rescan_remove_lock and checks whether the struct pci_dev still exists, the lack of synchronization makes this checking racy. Other races may also be possible of course though given that this lack of locking persisted so long observable races seem very rare. Even on s390 the list corruption was only observed with certain devices since the platform events are only triggered by config accesses after the removal, so as long as the removal finished synchronously they would not race. Either way the locking is missing so fix this by adding it to the sriov_del_vfs() helper. Just like PCI rescan-remove, locking is also missing in sriov_add_vfs() including for the error case where pci_stop_and_remove_bus_device() is called without the PCI rescan-remove lock being held. Even in the non-error case, adding new PCI devices and buses should be serialized via the PCI rescan-remove lock. Add the necessary locking. Fixes: 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") Signed-off-by: Niklas Schnelle <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Benjamin Block <[email protected]> Reviewed-by: Farhan Ali <[email protected]> Reviewed-by: Julian Ruess <[email protected]> Cc: [email protected] Link: https://patch.msgid.link/[email protected]
kdave
pushed a commit
that referenced
this pull request
Oct 13, 2025
The ns_bpf_qdisc selftest triggers a kernel panic: Oops[#1]: CPU 0 Unable to handle kernel paging request at virtual address 0000000000741d58, era == 90000000851b5ac0, ra == 90000000851b5aa4 CPU: 0 UID: 0 PID: 449 Comm: test_progs Tainted: G OE 6.16.0+ #3 PREEMPT(full) Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU QEMU Virtual Machine, BIOS unknown 2/2/2022 pc 90000000851b5ac0 ra 90000000851b5aa4 tp 90000001076b8000 sp 90000001076bb600 a0 0000000000741ce8 a1 0000000000000001 a2 90000001076bb5c0 a3 0000000000000008 a4 90000001004c4620 a5 9000000100741ce8 a6 0000000000000000 a7 0100000000000000 t0 0000000000000010 t1 0000000000000000 t2 9000000104d24d30 t3 0000000000000001 t4 4f2317da8a7e08c4 t5 fffffefffc002f00 t6 90000001004c4620 t7 ffffffffc61c5b3d t8 0000000000000000 u0 0000000000000001 s9 0000000000000050 s0 90000001075bc800 s1 0000000000000040 s2 900000010597c400 s3 0000000000000008 s4 90000001075bc880 s5 90000001075bc8f0 s6 0000000000000000 s7 0000000000741ce8 s8 0000000000000000 ra: 90000000851b5aa4 __qdisc_run+0xac/0x8d8 ERA: 90000000851b5ac0 __qdisc_run+0xc8/0x8d8 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 00000004 (PPLV0 +PIE -PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) BADV: 0000000000741d58 PRID: 0014c010 (Loongson-64bit, Loongson-3A5000) Modules linked in: bpf_testmod(OE) [last unloaded: bpf_testmod(OE)] Process test_progs (pid: 449, threadinfo=000000009af02b3a, task=00000000e9ba4956) Stack : 0000000000000000 90000001075bc8ac 90000000869524a8 9000000100741ce8 90000001075bc800 9000000100415300 90000001075bc8ac 0000000000000000 900000010597c400 900000008694a000 0000000000000000 9000000105b59000 90000001075bc800 9000000100741ce8 0000000000000050 900000008513000c 9000000086936000 0000000100094d4c fffffff400676208 0000000000000000 9000000105b59000 900000008694a000 9000000086bf0dc0 9000000105b59000 9000000086bf0d68 9000000085147010 90000001075be788 0000000000000000 9000000086bf0f98 0000000000000001 0000000000000010 9000000006015840 0000000000000000 9000000086be6c40 0000000000000000 0000000000000000 0000000000000000 4f2317da8a7e08c4 0000000000000101 4f2317da8a7e08c4 ... Call Trace: [<90000000851b5ac0>] __qdisc_run+0xc8/0x8d8 [<9000000085130008>] __dev_queue_xmit+0x578/0x10f0 [<90000000853701c0>] ip6_finish_output2+0x2f0/0x950 [<9000000085374bc8>] ip6_finish_output+0x2b8/0x448 [<9000000085370b24>] ip6_xmit+0x304/0x858 [<90000000853c4438>] inet6_csk_xmit+0x100/0x170 [<90000000852b32f0>] __tcp_transmit_skb+0x490/0xdd0 [<90000000852b47fc>] tcp_connect+0xbcc/0x1168 [<90000000853b9088>] tcp_v6_connect+0x580/0x8a0 [<90000000852e7738>] __inet_stream_connect+0x170/0x480 [<90000000852e7a98>] inet_stream_connect+0x50/0x88 [<90000000850f2814>] __sys_connect+0xe4/0x110 [<90000000850f2858>] sys_connect+0x18/0x28 [<9000000085520c94>] do_syscall+0x94/0x1a0 [<9000000083df1fb8>] handle_syscall+0xb8/0x158 Code: 4001ad80 2400873 2400832d <240073cc> 001137ff 001133ff 6407b41f 001503cc 0280041d ---[ end trace 0000000000000000 ]--- The bpf_fifo_dequeue prog returns a skb which is a pointer. The pointer is treated as a 32bit value and sign extend to 64bit in epilogue. This behavior is right for most bpf prog types but wrong for struct ops which requires LoongArch ABI. So let's sign extend struct ops return values according to the LoongArch ABI ([1]) and return value spec in function model. [1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html Cc: [email protected] Fixes: 6abf17d ("LoongArch: BPF: Add struct ops support for trampoline") Signed-off-by: Hengqi Chen <[email protected]> Signed-off-by: Huacai Chen <[email protected]>
asj
pushed a commit
to asj/btrfs-linux
that referenced
this pull request
Oct 15, 2025
…ux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 changes for 6.17, round btrfs#3 - Invalidate nested MMUs upon freeing the PGD to avoid WARNs when visiting from an MMU notifier - Fixes to the TLB match process and TLB invalidation range for managing the VCNR pseudo-TLB - Prevent SPE from erroneously profiling guests due to UNKNOWN reset values in PMSCR_EL1 - Fix save/restore of host MDCR_EL2 to account for eagerly programming at vcpu_load() on VHE systems - Correct lock ordering when dealing with VGIC LPIs, avoiding scenarios where an xarray's spinlock was nested with a *raw* spinlock - Permit stage-2 read permission aborts which are possible in the case of NV depending on the guest hypervisor's stage-2 translation - Call raw_spin_unlock() instead of the internal spinlock API - Fix parameter ordering when assigning VBAR_EL1 [Pull into kvm/master to fix conflicts. - Paolo]
asj
pushed a commit
to asj/btrfs-linux
that referenced
this pull request
Oct 15, 2025
Write combining is an optimization feature in CPUs that is frequently used by modern devices to generate 32 or 64 byte TLPs at the PCIe level. These large TLPs allow certain optimizations in the driver to HW communication that improve performance. As WC is unpredictable and optional the HW designs all tolerate cases where combining doesn't happen and simply experience a performance degradation. Unfortunately many virtualization environments on all architectures have done things that completely disable WC inside the VM with no generic way to detect this. For example WC was fully blocked in ARM64 KVM until commit 8c47ce3 ("KVM: arm64: Set io memory s2 pte as normalnc for vfio pci device"). Trying to use WC when it is known not to work has a measurable performance cost (~5%). Long ago mlx5 developed an boot time algorithm to test if WC is available or not by using unique mlx5 HW features to measure how many large TLPs the device is receiving. The SW generates a large number of combining opportunities and if any succeed then WC is declared working. In mlx5 the WC optimization feature is never used by the kernel except for the boot time test. The WC is only used by userspace in rdma-core. Sadly modern ARM CPUs, especially NVIDIA Grace, have a combining implementation that is very unreliable compared to pretty much everything prior. This is being fixed architecturally in new CPUs with a new ST64B instruction, but current shipping devices suffer this problem. Unreliable means the SW can present thousands of combining opportunities and the HW will not combine for any of them, which creates a performance degradation, and critically fails the mlx5 boot test. However, the CPU is very sensitive to the instruction sequence used, with the better options being sufficiently good that the performance loss from the unreliable CPU is not measurable. Broadly there are several options, from worst to best: 1) A C loop doing a u64 memcpy. This was used prior to commit ef30228 ("IB/mlx5: Use __iowrite64_copy() for write combining stores") and failed almost all the time on Grace CPUs. 2) ARM64 assembly with consecutive 8 byte stores. This was implemented as an arch-generic __iowriteXX_copy() family of functions suitable for performance use in drivers for WC. commit ead7911 ("arm64/io: Provide a WC friendly __iowriteXX_copy()") provided the ARM implementation. 3) ARM64 assembly with consecutive 16 byte stores. This was rejected from kernel use over fears of virtualization failures. Common ARM VMMs will crash if STP is used against emulated memory. 4) A single NEON store instruction. Userspace has used this option for a very long time, it performs well. 5) For future silicon the new ST64B instruction is guaranteed to generate a 64 byte TLP 100% of the time The past upgrade from btrfs#1 to btrfs#2 was thought to be sufficient to solve this problem. However, more testing on more systems shows that btrfs#3 is still problematic at a low frequency and the kernel test fails. Thus, make the mlx5 use the same instructions as userspace during the boot time WC self test. This way the WC test matches the userspace and will properly detect the ability of HW to support the WC workload that userspace will generate. While btrfs#4 still has imperfect combining performance, it is substantially better than btrfs#2, and does actually give a performance win to applications. Self-test failures with btrfs#2 are like 3/10 boots, on some systems, btrfs#4 has never seen a boot failure. There is no real general use case for a NEON based WC flow in the kernel. This is not suitable for any performance path work as getting into/out of a NEON context is fairly expensive compared to the gain of WC. Future CPUs are going to fix this issue by using an new ARM instruction and __iowriteXX_copy() will be updated to use that automatically, probably using the ALTERNATES mechanism. Since this problem is constrained to mlx5's unique situation of needing a non-performance code path to duplicate what mlx5 userspace is doing as a matter of self-testing, implement it as a one line inline assembly in the driver directly. Lastly, this was concluded from the discussion with ARM maintainers which confirms that this is the best approach for the solution: https://lore.kernel.org/r/[email protected] Signed-off-by: Patrisious Haddad <[email protected]> Reviewed-by: Michael Guralnik <[email protected]> Reviewed-by: Moshe Shemesh <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Oct 16, 2025
Since blamed commit, unregister_netdevice_many_notify() takes the netdev
mutex if the device needs it.
If the device list is too long, this will lock more device mutexes than
lockdep can handle:
unshare -n \
bash -c 'for i in $(seq 1 100);do ip link add foo$i type dummy;done'
BUG: MAX_LOCK_DEPTH too low!
turning off the locking correctness validator.
depth: 48 max: 48!
48 locks held by kworker/u16:1/69:
#0: ..148 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work
#1: ..d40 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work
#2: ..bd0 (pernet_ops_rwsem){++++}-{4:4}, at: cleanup_net
#3: ..aa8 (rtnl_mutex){+.+.}-{4:4}, at: default_device_exit_batch
#4: ..cb0 (&dev_instance_lock_key#3){+.+.}-{4:4}, at: unregister_netdevice_many_notify
[..]
Add a helper to close and then unlock a list of net_devices.
Devices that are not up have to be skipped - netif_close_many always
removes them from the list without any other actions taken, so they'd
remain in locked state.
Close devices whenever we've used up half of the tracking slots or we
processed entire list without hitting the limit.
Fixes: 7e4d784 ("net: hold netdev instance lock during rtnetlink operations")
Signed-off-by: Florian Westphal <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Oct 20, 2025
Expand the prefault memory selftest to add a regression test for a KVM bug where KVM's retry logic would result in (breakable) deadlock due to the memslot deletion waiting on prefaulting to release SRCU, and prefaulting waiting on the memslot to fully disappear (KVM uses a two-step process to delete memslots, and KVM x86 retries page faults if a to-be-deleted, a.k.a. INVALID, memslot is encountered). To exercise concurrent memslot remove, spawn a second thread to initiate memslot removal at roughly the same time as prefaulting. Test memslot removal for all testcases, i.e. don't limit concurrent removal to only the success case. There are essentially three prefault scenarios (so far) that are of interest: 1. Success 2. ENOENT due to no memslot 3. EAGAIN due to INVALID memslot For all intents and purposes, #1 and #2 are mutually exclusive, or rather, easier to test via separate testcases since writing to non-existent memory is trivial. But for #3, making it mutually exclusive with #1 _or_ #2 is actually more complex than testing memslot removal for all scenarios. The only requirement to let memslot removal coexist with other scenarios is a way to guarantee a stable result, e.g. that the "no memslot" test observes ENOENT, not EAGAIN, for the final checks. So, rather than make memslot removal mutually exclusive with the ENOENT scenario, simply restore the memslot and retry prefaulting. For the "no memslot" case, KVM_PRE_FAULT_MEMORY should be idempotent, i.e. should always fail with ENOENT regardless of how many times userspace attempts prefaulting. Pass in both the base GPA and the offset (instead of the "full" GPA) so that the worker can recreate the memslot. Signed-off-by: Yan Zhao <[email protected]> Co-developed-by: Sean Christopherson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Oct 27, 2025
The original code causes a circular locking dependency found by lockdep. ====================================================== WARNING: possible circular locking dependency detected 6.16.0-rc6-lgci-xe-xe-pw-151626v3+ #1 Tainted: G S U ------------------------------------------------------ xe_fault_inject/5091 is trying to acquire lock: ffff888156815688 ((work_completion)(&(&devcd->del_wk)->work)){+.+.}-{0:0}, at: __flush_work+0x25d/0x660 but task is already holding lock: ffff888156815620 (&devcd->mutex){+.+.}-{3:3}, at: dev_coredump_put+0x3f/0xa0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&devcd->mutex){+.+.}-{3:3}: mutex_lock_nested+0x4e/0xc0 devcd_data_write+0x27/0x90 sysfs_kf_bin_write+0x80/0xf0 kernfs_fop_write_iter+0x169/0x220 vfs_write+0x293/0x560 ksys_write+0x72/0xf0 __x64_sys_write+0x19/0x30 x64_sys_call+0x2bf/0x2660 do_syscall_64+0x93/0xb60 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (kn->active#236){++++}-{0:0}: kernfs_drain+0x1e2/0x200 __kernfs_remove+0xae/0x400 kernfs_remove_by_name_ns+0x5d/0xc0 remove_files+0x54/0x70 sysfs_remove_group+0x3d/0xa0 sysfs_remove_groups+0x2e/0x60 device_remove_attrs+0xc7/0x100 device_del+0x15d/0x3b0 devcd_del+0x19/0x30 process_one_work+0x22b/0x6f0 worker_thread+0x1e8/0x3d0 kthread+0x11c/0x250 ret_from_fork+0x26c/0x2e0 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&devcd->del_wk)->work)){+.+.}-{0:0}: __lock_acquire+0x1661/0x2860 lock_acquire+0xc4/0x2f0 __flush_work+0x27a/0x660 flush_delayed_work+0x5d/0xa0 dev_coredump_put+0x63/0xa0 xe_driver_devcoredump_fini+0x12/0x20 [xe] devm_action_release+0x12/0x30 release_nodes+0x3a/0x120 devres_release_all+0x8a/0xd0 device_unbind_cleanup+0x12/0x80 device_release_driver_internal+0x23a/0x280 device_driver_detach+0x14/0x20 unbind_store+0xaf/0xc0 drv_attr_store+0x21/0x50 sysfs_kf_write+0x4a/0x80 kernfs_fop_write_iter+0x169/0x220 vfs_write+0x293/0x560 ksys_write+0x72/0xf0 __x64_sys_write+0x19/0x30 x64_sys_call+0x2bf/0x2660 do_syscall_64+0x93/0xb60 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&devcd->del_wk)->work) --> kn->active#236 --> &devcd->mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&devcd->mutex); lock(kn->active#236); lock(&devcd->mutex); lock((work_completion)(&(&devcd->del_wk)->work)); *** DEADLOCK *** 5 locks held by xe_fault_inject/5091: #0: ffff8881129f9488 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x72/0xf0 #1: ffff88810c755078 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x123/0x220 #2: ffff8881054811a0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x55/0x280 #3: ffff888156815620 (&devcd->mutex){+.+.}-{3:3}, at: dev_coredump_put+0x3f/0xa0 #4: ffffffff8359e020 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x72/0x660 stack backtrace: CPU: 14 UID: 0 PID: 5091 Comm: xe_fault_inject Tainted: G S U 6.16.0-rc6-lgci-xe-xe-pw-151626v3+ #1 PREEMPT_{RT,(lazy)} Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER Hardware name: Micro-Star International Co., Ltd. MS-7D25/PRO Z690-A DDR4(MS-7D25), BIOS 1.10 12/13/2021 Call Trace: <TASK> dump_stack_lvl+0x91/0xf0 dump_stack+0x10/0x20 print_circular_bug+0x285/0x360 check_noncircular+0x135/0x150 ? register_lock_class+0x48/0x4a0 __lock_acquire+0x1661/0x2860 lock_acquire+0xc4/0x2f0 ? __flush_work+0x25d/0x660 ? mark_held_locks+0x46/0x90 ? __flush_work+0x25d/0x660 __flush_work+0x27a/0x660 ? __flush_work+0x25d/0x660 ? trace_hardirqs_on+0x1e/0xd0 ? __pfx_wq_barrier_func+0x10/0x10 flush_delayed_work+0x5d/0xa0 dev_coredump_put+0x63/0xa0 xe_driver_devcoredump_fini+0x12/0x20 [xe] devm_action_release+0x12/0x30 release_nodes+0x3a/0x120 devres_release_all+0x8a/0xd0 device_unbind_cleanup+0x12/0x80 device_release_driver_internal+0x23a/0x280 ? bus_find_device+0xa8/0xe0 device_driver_detach+0x14/0x20 unbind_store+0xaf/0xc0 drv_attr_store+0x21/0x50 sysfs_kf_write+0x4a/0x80 kernfs_fop_write_iter+0x169/0x220 vfs_write+0x293/0x560 ksys_write+0x72/0xf0 __x64_sys_write+0x19/0x30 x64_sys_call+0x2bf/0x2660 do_syscall_64+0x93/0xb60 ? __f_unlock_pos+0x15/0x20 ? __x64_sys_getdents64+0x9b/0x130 ? __pfx_filldir64+0x10/0x10 ? do_syscall_64+0x1a2/0xb60 ? clear_bhb_loop+0x30/0x80 ? clear_bhb_loop+0x30/0x80 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x76e292edd574 Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d d5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89 RSP: 002b:00007fffe247a828 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000076e292edd574 RDX: 000000000000000c RSI: 00006267f6306063 RDI: 000000000000000b RBP: 000000000000000c R08: 000076e292fc4b20 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00006267f6306063 R13: 000000000000000b R14: 00006267e6859c00 R15: 000076e29322a000 </TASK> xe 0000:03:00.0: [drm] Xe device coredump has been deleted. Fixes: 01daccf ("devcoredump : Serialize devcd_del work") Cc: Mukesh Ojha <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Johannes Berg <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: [email protected] Cc: [email protected] # v6.1+ Signed-off-by: Maarten Lankhorst <[email protected]> Cc: Matthew Brost <[email protected]> Acked-by: Mukesh Ojha <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Nov 10, 2025
On completion of i915_vma_pin_ww(), a synchronous variant of dma_fence_work_commit() is called. When pinning a VMA to GGTT address space on a Cherry View family processor, or on a Broxton generation SoC with VTD enabled, i.e., when stop_machine() is then called from intel_ggtt_bind_vma(), that can potentially lead to lock inversion among reservation_ww and cpu_hotplug locks. [86.861179] ====================================================== [86.861193] WARNING: possible circular locking dependency detected [86.861209] 6.15.0-rc5-CI_DRM_16515-gca0305cadc2d+ #1 Tainted: G U [86.861226] ------------------------------------------------------ [86.861238] i915_module_loa/1432 is trying to acquire lock: [86.861252] ffffffff83489090 (cpu_hotplug_lock){++++}-{0:0}, at: stop_machine+0x1c/0x50 [86.861290] but task is already holding lock: [86.861303] ffffc90002e0b4c8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915] [86.862233] which lock already depends on the new lock. [86.862251] the existing dependency chain (in reverse order) is: [86.862265] -> #5 (reservation_ww_class_mutex){+.+.}-{3:3}: [86.862292] dma_resv_lockdep+0x19a/0x390 [86.862315] do_one_initcall+0x60/0x3f0 [86.862334] kernel_init_freeable+0x3cd/0x680 [86.862353] kernel_init+0x1b/0x200 [86.862369] ret_from_fork+0x47/0x70 [86.862383] ret_from_fork_asm+0x1a/0x30 [86.862399] -> #4 (reservation_ww_class_acquire){+.+.}-{0:0}: [86.862425] dma_resv_lockdep+0x178/0x390 [86.862440] do_one_initcall+0x60/0x3f0 [86.862454] kernel_init_freeable+0x3cd/0x680 [86.862470] kernel_init+0x1b/0x200 [86.862482] ret_from_fork+0x47/0x70 [86.862495] ret_from_fork_asm+0x1a/0x30 [86.862509] -> #3 (&mm->mmap_lock){++++}-{3:3}: [86.862531] down_read_killable+0x46/0x1e0 [86.862546] lock_mm_and_find_vma+0xa2/0x280 [86.862561] do_user_addr_fault+0x266/0x8e0 [86.862578] exc_page_fault+0x8a/0x2f0 [86.862593] asm_exc_page_fault+0x27/0x30 [86.862607] filldir64+0xeb/0x180 [86.862620] kernfs_fop_readdir+0x118/0x480 [86.862635] iterate_dir+0xcf/0x2b0 [86.862648] __x64_sys_getdents64+0x84/0x140 [86.862661] x64_sys_call+0x1058/0x2660 [86.862675] do_syscall_64+0x91/0xe90 [86.862689] entry_SYSCALL_64_after_hwframe+0x76/0x7e [86.862703] -> #2 (&root->kernfs_rwsem){++++}-{3:3}: [86.862725] down_write+0x3e/0xf0 [86.862738] kernfs_add_one+0x30/0x3c0 [86.862751] kernfs_create_dir_ns+0x53/0xb0 [86.862765] internal_create_group+0x134/0x4c0 [86.862779] sysfs_create_group+0x13/0x20 [86.862792] topology_add_dev+0x1d/0x30 [86.862806] cpuhp_invoke_callback+0x4b5/0x850 [86.862822] cpuhp_issue_call+0xbf/0x1f0 [86.862836] __cpuhp_setup_state_cpuslocked+0x111/0x320 [86.862852] __cpuhp_setup_state+0xb0/0x220 [86.862866] topology_sysfs_init+0x30/0x50 [86.862879] do_one_initcall+0x60/0x3f0 [86.862893] kernel_init_freeable+0x3cd/0x680 [86.862908] kernel_init+0x1b/0x200 [86.862921] ret_from_fork+0x47/0x70 [86.862934] ret_from_fork_asm+0x1a/0x30 [86.862947] -> #1 (cpuhp_state_mutex){+.+.}-{3:3}: [86.862969] __mutex_lock+0xaa/0xed0 [86.862982] mutex_lock_nested+0x1b/0x30 [86.862995] __cpuhp_setup_state_cpuslocked+0x67/0x320 [86.863012] __cpuhp_setup_state+0xb0/0x220 [86.863026] page_alloc_init_cpuhp+0x2d/0x60 [86.863041] mm_core_init+0x22/0x2d0 [86.863054] start_kernel+0x576/0xbd0 [86.863068] x86_64_start_reservations+0x18/0x30 [86.863084] x86_64_start_kernel+0xbf/0x110 [86.863098] common_startup_64+0x13e/0x141 [86.863114] -> #0 (cpu_hotplug_lock){++++}-{0:0}: [86.863135] __lock_acquire+0x1635/0x2810 [86.863152] lock_acquire+0xc4/0x2f0 [86.863166] cpus_read_lock+0x41/0x100 [86.863180] stop_machine+0x1c/0x50 [86.863194] bxt_vtd_ggtt_insert_entries__BKL+0x3b/0x60 [i915] [86.863987] intel_ggtt_bind_vma+0x43/0x70 [i915] [86.864735] __vma_bind+0x55/0x70 [i915] [86.865510] fence_work+0x26/0xa0 [i915] [86.866248] fence_notify+0xa1/0x140 [i915] [86.866983] __i915_sw_fence_complete+0x8f/0x270 [i915] [86.867719] i915_sw_fence_commit+0x39/0x60 [i915] [86.868453] i915_vma_pin_ww+0x462/0x1360 [i915] [86.869228] i915_vma_pin.constprop.0+0x133/0x1d0 [i915] [86.870001] initial_plane_vma+0x307/0x840 [i915] [86.870774] intel_initial_plane_config+0x33f/0x670 [i915] [86.871546] intel_display_driver_probe_nogem+0x1c6/0x260 [i915] [86.872330] i915_driver_probe+0x7fa/0xe80 [i915] [86.873057] i915_pci_probe+0xe6/0x220 [i915] [86.873782] local_pci_probe+0x47/0xb0 [86.873802] pci_device_probe+0xf3/0x260 [86.873817] really_probe+0xf1/0x3c0 [86.873833] __driver_probe_device+0x8c/0x180 [86.873848] driver_probe_device+0x24/0xd0 [86.873862] __driver_attach+0x10f/0x220 [86.873876] bus_for_each_dev+0x7f/0xe0 [86.873892] driver_attach+0x1e/0x30 [86.873904] bus_add_driver+0x151/0x290 [86.873917] driver_register+0x5e/0x130 [86.873931] __pci_register_driver+0x7d/0x90 [86.873945] i915_pci_register_driver+0x23/0x30 [i915] [86.874678] i915_init+0x37/0x120 [i915] [86.875347] do_one_initcall+0x60/0x3f0 [86.875369] do_init_module+0x97/0x2a0 [86.875385] load_module+0x2c54/0x2d80 [86.875398] init_module_from_file+0x96/0xe0 [86.875413] idempotent_init_module+0x117/0x330 [86.875426] __x64_sys_finit_module+0x77/0x100 [86.875440] x64_sys_call+0x24de/0x2660 [86.875454] do_syscall_64+0x91/0xe90 [86.875470] entry_SYSCALL_64_after_hwframe+0x76/0x7e [86.875486] other info that might help us debug this: [86.875502] Chain exists of: cpu_hotplug_lock --> reservation_ww_class_acquire --> reservation_ww_class_mutex [86.875539] Possible unsafe locking scenario: [86.875552] CPU0 CPU1 [86.875563] ---- ---- [86.875573] lock(reservation_ww_class_mutex); [86.875588] lock(reservation_ww_class_acquire); [86.875606] lock(reservation_ww_class_mutex); [86.875624] rlock(cpu_hotplug_lock); [86.875637] *** DEADLOCK *** [86.875650] 3 locks held by i915_module_loa/1432: [86.875663] #0: ffff888101f5c1b0 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x104/0x220 [86.875699] #1: ffffc90002e0b4a0 (reservation_ww_class_acquire){+.+.}-{0:0}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915] [86.876512] #2: ffffc90002e0b4c8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915] [86.877305] stack backtrace: [86.877326] CPU: 0 UID: 0 PID: 1432 Comm: i915_module_loa Tainted: G U 6.15.0-rc5-CI_DRM_16515-gca0305cadc2d+ #1 PREEMPT(voluntary) [86.877334] Tainted: [U]=USER [86.877336] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0079.2020.0420.1316 04/20/2020 [86.877339] Call Trace: [86.877344] <TASK> [86.877353] dump_stack_lvl+0x91/0xf0 [86.877364] dump_stack+0x10/0x20 [86.877369] print_circular_bug+0x285/0x360 [86.877379] check_noncircular+0x135/0x150 [86.877390] __lock_acquire+0x1635/0x2810 [86.877403] lock_acquire+0xc4/0x2f0 [86.877408] ? stop_machine+0x1c/0x50 [86.877422] ? __pfx_bxt_vtd_ggtt_insert_entries__cb+0x10/0x10 [i915] [86.878173] cpus_read_lock+0x41/0x100 [86.878182] ? stop_machine+0x1c/0x50 [86.878191] ? __pfx_bxt_vtd_ggtt_insert_entries__cb+0x10/0x10 [i915] [86.878916] stop_machine+0x1c/0x50 [86.878927] bxt_vtd_ggtt_insert_entries__BKL+0x3b/0x60 [i915] [86.879652] intel_ggtt_bind_vma+0x43/0x70 [i915] [86.880375] __vma_bind+0x55/0x70 [i915] [86.881133] fence_work+0x26/0xa0 [i915] [86.881851] fence_notify+0xa1/0x140 [i915] [86.882566] __i915_sw_fence_complete+0x8f/0x270 [i915] [86.883286] i915_sw_fence_commit+0x39/0x60 [i915] [86.884003] i915_vma_pin_ww+0x462/0x1360 [i915] [86.884756] ? i915_vma_pin.constprop.0+0x6c/0x1d0 [i915] [86.885513] i915_vma_pin.constprop.0+0x133/0x1d0 [i915] [86.886281] initial_plane_vma+0x307/0x840 [i915] [86.887049] intel_initial_plane_config+0x33f/0x670 [i915] [86.887819] intel_display_driver_probe_nogem+0x1c6/0x260 [i915] [86.888587] i915_driver_probe+0x7fa/0xe80 [i915] [86.889293] ? mutex_unlock+0x12/0x20 [86.889301] ? drm_privacy_screen_get+0x171/0x190 [86.889308] ? acpi_dev_found+0x66/0x80 [86.889321] i915_pci_probe+0xe6/0x220 [i915] [86.890038] local_pci_probe+0x47/0xb0 [86.890049] pci_device_probe+0xf3/0x260 [86.890058] really_probe+0xf1/0x3c0 [86.890067] __driver_probe_device+0x8c/0x180 [86.890072] driver_probe_device+0x24/0xd0 [86.890078] __driver_attach+0x10f/0x220 [86.890083] ? __pfx___driver_attach+0x10/0x10 [86.890088] bus_for_each_dev+0x7f/0xe0 [86.890097] driver_attach+0x1e/0x30 [86.890101] bus_add_driver+0x151/0x290 [86.890107] driver_register+0x5e/0x130 [86.890113] __pci_register_driver+0x7d/0x90 [86.890119] i915_pci_register_driver+0x23/0x30 [i915] [86.890833] i915_init+0x37/0x120 [i915] [86.891482] ? __pfx_i915_init+0x10/0x10 [i915] [86.892135] do_one_initcall+0x60/0x3f0 [86.892145] ? __kmalloc_cache_noprof+0x33f/0x470 [86.892157] do_init_module+0x97/0x2a0 [86.892164] load_module+0x2c54/0x2d80 [86.892168] ? __kernel_read+0x15c/0x300 [86.892185] ? kernel_read_file+0x2b1/0x320 [86.892195] init_module_from_file+0x96/0xe0 [86.892199] ? init_module_from_file+0x96/0xe0 [86.892211] idempotent_init_module+0x117/0x330 [86.892224] __x64_sys_finit_module+0x77/0x100 [86.892230] x64_sys_call+0x24de/0x2660 [86.892236] do_syscall_64+0x91/0xe90 [86.892243] ? irqentry_exit+0x77/0xb0 [86.892249] ? sysvec_apic_timer_interrupt+0x57/0xc0 [86.892256] entry_SYSCALL_64_after_hwframe+0x76/0x7e [86.892261] RIP: 0033:0x7303e1b2725d [86.892271] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b bb 0d 00 f7 d8 64 89 01 48 [86.892276] RSP: 002b:00007ffddd1fdb38 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [86.892281] RAX: ffffffffffffffda RBX: 00005d771d88fd90 RCX: 00007303e1b2725d [86.892285] RDX: 0000000000000000 RSI: 00005d771d893aa0 RDI: 000000000000000c [86.892287] RBP: 00007ffddd1fdbf0 R08: 0000000000000040 R09: 00007ffddd1fdb80 [86.892289] R10: 00007303e1c03b20 R11: 0000000000000246 R12: 00005d771d893aa0 [86.892292] R13: 0000000000000000 R14: 00005d771d88f0d0 R15: 00005d771d895710 [86.892304] </TASK> Call asynchronous variant of dma_fence_work_commit() in that case. v3: Provide more verbose in-line comment (Andi), - mention target environments in commit message. Fixes: 7d1c261 ("drm/i915: Take reservation lock around i915_vma_pin.") Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14985 Cc: Andi Shyti <[email protected]> Signed-off-by: Janusz Krzysztofik <[email protected]> Reviewed-by: Sebastian Brzezinka <[email protected]> Reviewed-by: Krzysztof Karas <[email protected]> Acked-by: Andi Shyti <[email protected]> Signed-off-by: Andi Shyti <[email protected]> Link: https://lore.kernel.org/r/[email protected] (cherry picked from commit 648ef13) Signed-off-by: Rodrigo Vivi <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Nov 10, 2025
Michael Chan says: ==================== bnxt_en: Bug fixes Patches 1, 3, and 4 are bug fixes related to the FW log tracing driver coredump feature recently added in 6.13. Patch #1 adds the necessary call to shutdown the FW logging DMA during PCI shutdown. Patch #3 fixes a possible null pointer derefernce when using early versions of the FW with this feature. Patch #4 adds the coredump header information unconditionally to make it more robust. Patch #2 fixes a possible memory leak during PTP shutdown. Patch #5 eliminates a dmesg warning when doing devlink reload. ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Nov 11, 2025
Add VMX exit handlers for SEAMCALL and TDCALL to inject a #UD if a non-TD guest attempts to execute SEAMCALL or TDCALL. Neither SEAMCALL nor TDCALL is gated by any software enablement other than VMXON, and so will generate a VM-Exit instead of e.g. a native #UD when executed from the guest kernel. Note! No unprivileged DoS of the L1 kernel is possible as TDCALL and SEAMCALL #GP at CPL > 0, and the CPL check is performed prior to the VMX non-root (VM-Exit) check, i.e. userspace can't crash the VM. And for a nested guest, KVM forwards unknown exits to L1, i.e. an L2 kernel can crash itself, but not L1. Note #2! The Intel® Trust Domain CPU Architectural Extensions spec's pseudocode shows the CPL > 0 check for SEAMCALL coming _after_ the VM-Exit, but that appears to be a documentation bug (likely because the CPL > 0 check was incorrectly bundled with other lower-priority #GP checks). Testing on SPR and EMR shows that the CPL > 0 check is performed before the VMX non-root check, i.e. SEAMCALL #GPs when executed in usermode. Note #3! The aforementioned Trust Domain spec uses confusing pseudocode that says that SEAMCALL will #UD if executed "inSEAM", but "inSEAM" specifically means in SEAM Root Mode, i.e. in the TDX-Module. The long- form description explicitly states that SEAMCALL generates an exit when executed in "SEAM VMX non-root operation". But that's a moot point as the TDX-Module injects #UD if the guest attempts to execute SEAMCALL, as documented in the "Unconditionally Blocked Instructions" section of the TDX-Module base specification. Cc: [email protected] Cc: Kai Huang <[email protected]> Cc: Xiaoyao Li <[email protected]> Cc: Rick Edgecombe <[email protected]> Cc: Dan Williams <[email protected]> Cc: Binbin Wu <[email protected]> Reviewed-by: Kai Huang <[email protected]> Reviewed-by: Binbin Wu <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Nov 19, 2025
…/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.18, take #3 - Only adjust the ID registers when no irqchip has been created once per VM run, instead of doing it once per vcpu, as this otherwise triggers a pretty bad conbsistency check failure in the sysreg code. - Make sure the per-vcpu Fine Grain Traps are computed before we load the system registers on the HW, as we otherwise start running without anything set until the first preemption of the vcpu.
kdave
pushed a commit
that referenced
this pull request
Nov 28, 2025
For some reason, of_find_node_with_property() is creating a spinlock recursion issue along with fwnode_count_parents(), and this issue is making all MediaTek boards unbootable. As of kernel v6.18-rc6, there are only three users of this function, one of which is this driver. Migrate away from of_find_node_with_property() by adding a local scpsys_get_legacy_regmap_node() function, which acts similarly to of_find_node_with_property(), and calling the former in place of the latter. This resolves the following spinlock recursion issue: [ 1.773979] BUG: spinlock recursion on CPU#2, kworker/u24:1/60 [ 1.790485] lock: devtree_lock+0x0/0x40, .magic: dead4ead, .owner: kworker/u24:1/60, .owner_cpu: 2 [ 1.791644] CPU: 2 UID: 0 PID: 60 Comm: kworker/u24:1 Tainted: G W 6.18.0-rc6 #3 PREEMPT [ 1.791649] Tainted: [W]=WARN [ 1.791650] Hardware name: MediaTek Genio-510 EVK (DT) [ 1.791653] Workqueue: events_unbound deferred_probe_work_func [ 1.791658] Call trace: [ 1.791659] show_stack+0x18/0x30 (C) [ 1.791664] dump_stack_lvl+0x68/0x94 [ 1.791668] dump_stack+0x18/0x24 [ 1.791672] spin_dump+0x78/0x88 [ 1.791678] do_raw_spin_lock+0x110/0x140 [ 1.791684] _raw_spin_lock_irqsave+0x58/0x6c [ 1.791690] of_get_parent+0x28/0x74 [ 1.791694] of_fwnode_get_parent+0x38/0x7c [ 1.791700] fwnode_count_parents+0x34/0xf0 [ 1.791705] fwnode_full_name_string+0x28/0x120 [ 1.791710] device_node_string+0x3e4/0x50c [ 1.791715] pointer+0x294/0x430 [ 1.791718] vsnprintf+0x21c/0x5bc [ 1.791722] vprintk_store+0x108/0x47c [ 1.791728] vprintk_emit+0xc4/0x350 [ 1.791732] vprintk_default+0x34/0x40 [ 1.791736] vprintk+0x24/0x30 [ 1.791740] _printk+0x60/0x8c [ 1.791744] of_node_release+0x154/0x194 [ 1.791749] kobject_put+0xa0/0x120 [ 1.791753] of_node_put+0x18/0x28 [ 1.791756] of_find_node_with_property+0x74/0x100 [ 1.791761] scpsys_probe+0x338/0x5e0 [ 1.791765] platform_probe+0x5c/0xa4 [ 1.791770] really_probe+0xbc/0x2ac [ 1.791774] __driver_probe_device+0x78/0x118 [ 1.791779] driver_probe_device+0x3c/0x170 [ 1.791783] __device_attach_driver+0xb8/0x150 [ 1.791788] bus_for_each_drv+0x88/0xe8 [ 1.791792] __device_attach+0x9c/0x1a0 [ 1.791796] device_initial_probe+0x14/0x20 [ 1.791801] bus_probe_device+0xa0/0xa4 [ 1.791805] deferred_probe_work_func+0x88/0xd0 [ 1.791809] process_one_work+0x1e8/0x448 [ 1.791813] worker_thread+0x1ac/0x340 [ 1.791816] kthread+0x138/0x220 [ 1.791821] ret_from_fork+0x10/0x20 Fixes: c29345f ("pmdomain: mediatek: Refactor bus protection regmaps retrieval") Signed-off-by: AngeloGioacchino Del Regno <[email protected]> Tested-by: Louis-Alexis Eyraud <[email protected]> Tested-by: Macpaul Lin <[email protected]> Reviewed-by: Macpaul Lin <[email protected]> Signed-off-by: Ulf Hansson <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Nov 28, 2025
The U-Blox EVK-M101 enumerates as 1546:0506 [1] with four FTDI interfaces: - EVK-M101 current sensors - EVK-M101 I2C - EVK-M101 UART - EVK-M101 port D Only the third USB interface is a UART. This change lets ftdi_sio probe the VID/PID and registers only interface #3 as a TTY, leaving the rest available for other drivers. [1] usb 5-1.3: new high-speed USB device number 11 using xhci_hcd usb 5-1.3: New USB device found, idVendor=1546, idProduct=0506, bcdDevice= 8.00 usb 5-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=0 usb 5-1.3: Product: EVK-M101 usb 5-1.3: Manufacturer: u-blox AG Datasheet: https://content.u-blox.com/sites/default/files/documents/EVK-M10_UserGuide_UBX-21003949.pdf Signed-off-by: Oleksandr Suvorov <[email protected]> Link: https://lore.kernel.org/[email protected]/ Cc: [email protected] Signed-off-by: Johan Hovold <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Dec 19, 2025
We sometimes observe use-after-free when dereferencing a neighbour [1]. The problem seems to be that the driver stores a pointer to the neighbour, but without holding a reference on it. A reference is only taken when the neighbour is used by a nexthop. Fix by simplifying the reference counting scheme. Always take a reference when storing a neighbour pointer in a neighbour entry. Avoid taking a referencing when the neighbour is used by a nexthop as the neighbour entry associated with the nexthop already holds a reference. Tested by running the test that uncovered the problem over 300 times. Without this patch the problem was reproduced after a handful of iterations. [1] BUG: KASAN: slab-use-after-free in mlxsw_sp_neigh_entry_update+0x2d4/0x310 Read of size 8 at addr ffff88817f8e3420 by task ip/3929 CPU: 3 UID: 0 PID: 3929 Comm: ip Not tainted 6.18.0-rc4-virtme-g36b21a067510 #3 PREEMPT(full) Hardware name: Nvidia SN5600/VMOD0013, BIOS 5.13 05/31/2023 Call Trace: <TASK> dump_stack_lvl+0x6f/0xa0 print_address_description.constprop.0+0x6e/0x300 print_report+0xfc/0x1fb kasan_report+0xe4/0x110 mlxsw_sp_neigh_entry_update+0x2d4/0x310 mlxsw_sp_router_rif_gone_sync+0x35f/0x510 mlxsw_sp_rif_destroy+0x1ea/0x730 mlxsw_sp_inetaddr_port_vlan_event+0xa1/0x1b0 __mlxsw_sp_inetaddr_lag_event+0xcc/0x130 __mlxsw_sp_inetaddr_event+0xf5/0x3c0 mlxsw_sp_router_netdevice_event+0x1015/0x1580 notifier_call_chain+0xcc/0x150 call_netdevice_notifiers_info+0x7e/0x100 __netdev_upper_dev_unlink+0x10b/0x210 netdev_upper_dev_unlink+0x79/0xa0 vrf_del_slave+0x18/0x50 do_set_master+0x146/0x7d0 do_setlink.isra.0+0x9a0/0x2880 rtnl_newlink+0x637/0xb20 rtnetlink_rcv_msg+0x6fe/0xb90 netlink_rcv_skb+0x123/0x380 netlink_unicast+0x4a3/0x770 netlink_sendmsg+0x75b/0xc90 __sock_sendmsg+0xbe/0x160 ____sys_sendmsg+0x5b2/0x7d0 ___sys_sendmsg+0xfd/0x180 __sys_sendmsg+0x124/0x1c0 do_syscall_64+0xbb/0xfd0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 [...] Allocated by task 109: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x7b/0x90 __kmalloc_noprof+0x2c1/0x790 neigh_alloc+0x6af/0x8f0 ___neigh_create+0x63/0xe90 mlxsw_sp_nexthop_neigh_init+0x430/0x7e0 mlxsw_sp_nexthop_type_init+0x212/0x960 mlxsw_sp_nexthop6_group_info_init.constprop.0+0x81f/0x1280 mlxsw_sp_nexthop6_group_get+0x392/0x6a0 mlxsw_sp_fib6_entry_create+0x46a/0xfd0 mlxsw_sp_router_fib6_replace+0x1ed/0x5f0 mlxsw_sp_router_fib6_event_work+0x10a/0x2a0 process_one_work+0xd57/0x1390 worker_thread+0x4d6/0xd40 kthread+0x355/0x5b0 ret_from_fork+0x1d4/0x270 ret_from_fork_asm+0x11/0x20 Freed by task 154: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 __kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x43/0x70 kmem_cache_free_bulk.part.0+0x1eb/0x5e0 kvfree_rcu_bulk+0x1f2/0x260 kfree_rcu_work+0x130/0x1b0 process_one_work+0xd57/0x1390 worker_thread+0x4d6/0xd40 kthread+0x355/0x5b0 ret_from_fork+0x1d4/0x270 ret_from_fork_asm+0x11/0x20 Last potentially related work creation: kasan_save_stack+0x30/0x50 kasan_record_aux_stack+0x8c/0xa0 kvfree_call_rcu+0x93/0x5b0 mlxsw_sp_router_neigh_event_work+0x67d/0x860 process_one_work+0xd57/0x1390 worker_thread+0x4d6/0xd40 kthread+0x355/0x5b0 ret_from_fork+0x1d4/0x270 ret_from_fork_asm+0x11/0x20 Fixes: 6cf3c97 ("mlxsw: spectrum_router: Add private neigh table") Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: Petr Machata <[email protected]> Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/92d75e21d95d163a41b5cea67a15cd33f547cba6.1764695650.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Dec 19, 2025
Petr Machata says: ==================== selftests: forwarding: vxlan_bridge_1q_mc_ul: Fix flakiness The net/forwarding/vxlan_bridge_1q_mc_ul selftest runs an overlay traffic, forwarded over a multicast-routed VXLAN underlay. In order to determine whether packets reach their intended destination, it uses a TC match. For convenience, it uses a flower match, which however does not allow matching on the encapsulated packet. So various service traffic ends up being indistinguishable from the test packets, and ends up confusing the test. To alleviate the problem, the test uses sleep to allow the necessary service traffic to run and clear the channel, before running the test traffic. This worked for a while, but lately we have nevertheless seen flakiness of the test in the CI. In this patchset, first generalize tc_rule_stats_get() to support u32 in patch #1, then in patch #2 convert the test to use u32 to allow parsing deeper into the packet, and in #3 drop the now-unnecessary sleep. ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Dec 19, 2025
Fix a loop scenario of ethx:egress->ethx:egress
Example setup to reproduce:
tc qdisc add dev ethx root handle 1: drr
tc filter add dev ethx parent 1: protocol ip prio 1 matchall \
action mirred egress redirect dev ethx
Now ping out of ethx and you get a deadlock:
[ 116.892898][ T307] ============================================
[ 116.893182][ T307] WARNING: possible recursive locking detected
[ 116.893418][ T307] 6.18.0-rc6-01205-ge05021a829b8-dirty #204 Not tainted
[ 116.893682][ T307] --------------------------------------------
[ 116.893926][ T307] ping/307 is trying to acquire lock:
[ 116.894133][ T307] ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50
[ 116.894517][ T307]
[ 116.894517][ T307] but task is already holding lock:
[ 116.894836][ T307] ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50
[ 116.895252][ T307]
[ 116.895252][ T307] other info that might help us debug this:
[ 116.895608][ T307] Possible unsafe locking scenario:
[ 116.895608][ T307]
[ 116.895901][ T307] CPU0
[ 116.896057][ T307] ----
[ 116.896200][ T307] lock(&sch->root_lock_key);
[ 116.896392][ T307] lock(&sch->root_lock_key);
[ 116.896605][ T307]
[ 116.896605][ T307] *** DEADLOCK ***
[ 116.896605][ T307]
[ 116.896864][ T307] May be due to missing lock nesting notation
[ 116.896864][ T307]
[ 116.897123][ T307] 6 locks held by ping/307:
[ 116.897302][ T307] #0: ffff88800b4b0250 (sk_lock-AF_INET){+.+.}-{0:0}, at: raw_sendmsg+0xb20/0x2cf0
[ 116.897808][ T307] #1: ffffffff88c839c0 (rcu_read_lock){....}-{1:3}, at: ip_output+0xa9/0x600
[ 116.898138][ T307] #2: ffffffff88c839c0 (rcu_read_lock){....}-{1:3}, at: ip_finish_output2+0x2c6/0x1ee0
[ 116.898459][ T307] #3: ffffffff88c83960 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x200/0x3b50
[ 116.898782][ T307] #4: ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50
[ 116.899132][ T307] #5: ffffffff88c83960 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x200/0x3b50
[ 116.899442][ T307]
[ 116.899442][ T307] stack backtrace:
[ 116.899667][ T307] CPU: 2 UID: 0 PID: 307 Comm: ping Not tainted 6.18.0-rc6-01205-ge05021a829b8-dirty #204 PREEMPT(voluntary)
[ 116.899672][ T307] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 116.899675][ T307] Call Trace:
[ 116.899678][ T307] <TASK>
[ 116.899680][ T307] dump_stack_lvl+0x6f/0xb0
[ 116.899688][ T307] print_deadlock_bug.cold+0xc0/0xdc
[ 116.899695][ T307] __lock_acquire+0x11f7/0x1be0
[ 116.899704][ T307] lock_acquire+0x162/0x300
[ 116.899707][ T307] ? __dev_queue_xmit+0x2210/0x3b50
[ 116.899713][ T307] ? srso_alias_return_thunk+0x5/0xfbef5
[ 116.899717][ T307] ? stack_trace_save+0x93/0xd0
[ 116.899723][ T307] _raw_spin_lock+0x30/0x40
[ 116.899728][ T307] ? __dev_queue_xmit+0x2210/0x3b50
[ 116.899731][ T307] __dev_queue_xmit+0x2210/0x3b50
Fixes: 178ca30 ("Revert "net/sched: Fix mirred deadlock on device recursion"")
Tested-by: Victor Nogueira <[email protected]>
Signed-off-by: Jamal Hadi Salim <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
kdave
pushed a commit
that referenced
this pull request
Dec 20, 2025
Reject attempts to disable KVM_MEM_GUEST_MEMFD on a memslot that was initially created with a guest_memfd binding, as KVM doesn't support toggling KVM_MEM_GUEST_MEMFD on existing memslots. KVM prevents enabling KVM_MEM_GUEST_MEMFD, but doesn't prevent clearing the flag. Failure to reject the new memslot results in a use-after-free due to KVM not unbinding from the guest_memfd instance. Unbinding on a FLAGS_ONLY change is easy enough, and can/will be done as a hardening measure (in anticipation of KVM supporting dirty logging on guest_memfd at some point), but fixing the use-after-free would only address the immediate symptom. ================================================================== BUG: KASAN: slab-use-after-free in kvm_gmem_release+0x362/0x400 [kvm] Write of size 8 at addr ffff8881111ae908 by task repro/745 CPU: 7 UID: 1000 PID: 745 Comm: repro Not tainted 6.18.0-rc6-115d5de2eef3-next-kasan #3 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x51/0x60 print_report+0xcb/0x5c0 kasan_report+0xb4/0xe0 kvm_gmem_release+0x362/0x400 [kvm] __fput+0x2fa/0x9d0 task_work_run+0x12c/0x200 do_exit+0x6ae/0x2100 do_group_exit+0xa8/0x230 __x64_sys_exit_group+0x3a/0x50 x64_sys_call+0x737/0x740 do_syscall_64+0x5b/0x900 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f581f2eac31 </TASK> Allocated by task 745 on cpu 6 at 9.746971s: kasan_save_stack+0x20/0x40 kasan_save_track+0x13/0x50 __kasan_kmalloc+0x77/0x90 kvm_set_memory_region.part.0+0x652/0x1110 [kvm] kvm_vm_ioctl+0x14b0/0x3290 [kvm] __x64_sys_ioctl+0x129/0x1a0 do_syscall_64+0x5b/0x900 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Freed by task 745 on cpu 6 at 9.747467s: kasan_save_stack+0x20/0x40 kasan_save_track+0x13/0x50 __kasan_save_free_info+0x37/0x50 __kasan_slab_free+0x3b/0x60 kfree+0xf5/0x440 kvm_set_memslot+0x3c2/0x1160 [kvm] kvm_set_memory_region.part.0+0x86a/0x1110 [kvm] kvm_vm_ioctl+0x14b0/0x3290 [kvm] __x64_sys_ioctl+0x129/0x1a0 do_syscall_64+0x5b/0x900 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Reported-by: Alexander Potapenko <[email protected]> Fixes: a7800aa ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory") Cc: [email protected] Link: https://patch.msgid.link/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
http://lkml.kernel.org/g/[email protected]
The remaining qgroup fixes patches, based on the Chris' for-linus-4.12