Skip to content

Conversation

@tmineno
Copy link

@tmineno tmineno commented Feb 8, 2015

Fix ad9361_rfpll_recalc_rate and ad9361_rfpll_set_rate.
After this fix, these methods may return or set correct value even when
user use SDM Bypass or SDM PD (0x232[7:6] or 0x272[7:6]).

This request is same as No-OS one.

Fix ad9361_rfpll_recalc_rate and ad9361_rfpll_set_rate.
After this fix, these methods return or set correct value even when
user use SDM Bypass or SDM PD (0x232[7:6] or 0x272[7:6]).
@mhennerich
Copy link
Contributor

Hi,

I pushed a similar patch
68317ed

Thanks,
Michael

@mhennerich mhennerich closed this Feb 9, 2015
lclausen-adi pushed a commit that referenced this pull request Mar 31, 2015
Otherwise rcu_irq_{enter,exit}() do not happen and we get dumps like:

====================
[  188.275021] ===============================
[  188.309351] [ INFO: suspicious RCU usage. ]
[  188.343737] 3.18.0-rc3-00068-g20f3963-dirty #54 Not tainted
[  188.394786] -------------------------------
[  188.429170] include/linux/rcupdate.h:883 rcu_read_lock() used
illegally while idle!
[  188.505235]
other info that might help us debug this:

[  188.554230]
RCU used illegally from idle CPU!
rcu_scheduler_active = 1, debug_locks = 0
[  188.637587] RCU used illegally from extended quiescent state!
[  188.690684] 3 locks held by swapper/7/0:
[  188.721932]  #0:  (&x->wait#11){......}, at: [<0000000000495de8>] complete+0x8/0x60
[  188.797994]  #1:  (&p->pi_lock){-.-.-.}, at: [<000000000048510c>] try_to_wake_up+0xc/0x400
[  188.881343]  #2:  (rcu_read_lock){......}, at: [<000000000048a910>] select_task_rq_fair+0x90/0xb40
[  188.973043]stack backtrace:
[  188.993879] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 3.18.0-rc3-00068-g20f3963-dirty #54
[  189.076187] Call Trace:
[  189.089719]  [0000000000499360] lockdep_rcu_suspicious+0xe0/0x100
[  189.147035]  [000000000048a99c] select_task_rq_fair+0x11c/0xb40
[  189.202253]  [00000000004852d8] try_to_wake_up+0x1d8/0x400
[  189.252258]  [000000000048554c] default_wake_function+0xc/0x20
[  189.306435]  [0000000000495554] __wake_up_common+0x34/0x80
[  189.356448]  [00000000004955b4] __wake_up_locked+0x14/0x40
[  189.406456]  [0000000000495e08] complete+0x28/0x60
[  189.448142]  [0000000000636e28] blk_end_sync_rq+0x8/0x20
[  189.496057]  [0000000000639898] __blk_mq_end_request+0x18/0x60
[  189.550249]  [00000000006ee014] scsi_end_request+0x94/0x180
[  189.601286]  [00000000006ee334] scsi_io_completion+0x1d4/0x600
[  189.655463]  [00000000006e51c4] scsi_finish_command+0xc4/0xe0
[  189.708598]  [00000000006ed958] scsi_softirq_done+0x118/0x140
[  189.761735]  [00000000006398ec] __blk_mq_complete_request_remote+0xc/0x20
[  189.827383]  [00000000004c75d0] generic_smp_call_function_single_interrupt+0x150/0x1c0
[  189.906581]  [000000000043e514] smp_call_function_single_client+0x14/0x40
====================

Based almost entirely upon a patch by Paul E. McKenney.

Reported-by: Meelis Roos <[email protected]>
Tested-by: Meelis Roos <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
lclausen-adi pushed a commit that referenced this pull request Mar 31, 2015
The cx23885 driver still used sg++ instead of sg = sg_next(sg). This worked with
vb1 since that filled in the sglist manually, page-by-page, but it fails with vb2
which uses core scatterlist code that can combine contiguous scatterlist entries
into one larger entry.

This bug led to the following crash as reported by Mariusz:

[20712.990258] BUG: Bad page state in process vb2-cx23885[0]  pfn:2ca34
[20712.990265] page:ffffea00009c3b60 count:-1 mapcount:0 mapping:          (null) index:0x0
[20712.990266] flags: 0x4000000000000000()
[20712.990268] page dumped because: nonzero _count
[20712.990269] Modules linked in: tun binfmt_misc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_mark xt_REDIRECT xt_limit xt_conntrack xt_nat xt_tcpudp iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables sit ip_tunnel nvidia(PO) stb6100 stv090x cx88_dvb videobuf_dvb cx88_vp3054_i2c tuner kvm_amd kvm cx8802 k10temp cx8800 cx88xx btcx_risc videobuf_dma_sg videobuf_core usb_storage ds2490 usbhid ftdi_sio cx23885 tveeprom cx2341x videobuf2_dvb videobuf2_core videobuf2_dma_sg videobuf2_memops asus_atk0110 snd_emu10k1 snd_hwdep snd_util_mem snd_ac97_codec ac97_bus snd_rawmidi snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd w1_therm wire ipv6
[20712.990301] CPU: 2 PID: 26942 Comm: vb2-cx23885[0] Tainted: P    B   W  O   3.18.0-rc5-00001-gb3652d1 #2
[20712.990303] Hardware name: System manufacturer System Product Name/M4A785TD-V EVO, BIOS 2105    07/23/2010
[20712.990305]  ffffffff81765734 ffff880137683a78 ffffffff815b6b32 0000000000000006
[20712.990307]  ffffea00009c3b60 ffff880137683aa8 ffffffff8108ec27 ffffffff81765712
[20712.990309]  ffffffff8189c840 0000000000000246 ffffea00009c3b60 ffff880137683b78
[20712.990312] Call Trace:
[20712.990317]  [<ffffffff815b6b32>] dump_stack+0x46/0x58
[20712.990321]  [<ffffffff8108ec27>] bad_page+0xe9/0x107
[20712.990323]  [<ffffffff810912ca>] get_page_from_freelist+0x3b2/0x505
[20712.990326]  [<ffffffff8109150a>] __alloc_pages_nodemask+0xed/0x65f
[20712.990330]  [<ffffffff81047a52>] ? ttwu_do_activate.constprop.78+0x57/0x5c
[20712.990332]  [<ffffffff81049ff3>] ? try_to_wake_up+0x21b/0x22d
[20712.990336]  [<ffffffff810070f4>] dma_generic_alloc_coherent+0x6e/0xf5
[20712.990339]  [<ffffffff810261a9>] gart_alloc_coherent+0x105/0x114
[20712.990341]  [<ffffffff81025963>] ? flush_gart+0x39/0x3d
[20712.990343]  [<ffffffff810260a4>] ? gart_map_sg+0x3a0/0x3a0
[20712.990349]  [<ffffffffa0141a1e>] cx23885_risc_databuffer+0xa7/0x133 [cx23885]
[20712.990354]  [<ffffffffa0142764>] cx23885_buf_prepare+0x121/0x134 [cx23885]
[20712.990359]  [<ffffffffa0144210>] buffer_prepare+0x14/0x16 [cx23885]
[20712.990363]  [<ffffffffa011f101>] __buf_prepare+0x190/0x279 [videobuf2_core]
[20712.990366]  [<ffffffffa011d906>] ? vb2_queue_or_prepare_buf+0xb8/0xc0 [videobuf2_core]
[20712.990369]  [<ffffffffa011f34b>] vb2_internal_qbuf+0x51/0x1e5 [videobuf2_core]
[20712.990372]  [<ffffffffa0120537>] vb2_thread+0x199/0x1f6 [videobuf2_core]
[20712.990376]  [<ffffffffa012039e>] ? vb2_fop_write+0xdf/0xdf [videobuf2_core]
[20712.990379]  [<ffffffff81043e61>] kthread+0xdf/0xe7
[20712.990381]  [<ffffffff81043d82>] ? kthread_create_on_node+0x16d/0x16d
[20712.990384]  [<ffffffff815bd46c>] ret_from_fork+0x7c/0xb0
[20712.990386]  [<ffffffff81043d82>] ? kthread_create_on_node+0x16d/0x16d

Signed-off-by: Hans Verkuil <[email protected]>
Reported-by: Mariusz Bialonczyk <[email protected]>
Tested-by: Mariusz Bialonczyk <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
lclausen-adi pushed a commit that referenced this pull request May 4, 2015
When unloading the module 'g_hid.ko', the urb request will be dequeued and the
completion routine will be excuted. If there is no urb packet, the urb request
will not be added to the endpoint queue and the completion routine pointer in
urb request is NULL.

Accessing to this NULL function pointer will cause the Oops issue reported
below.

Add the code to check if the urb request is in the endpoint queue
or not. If the urb request is not in the endpoint queue, a negative
error code will be returned.

Here is the Oops log:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = dedf0000
[00000000] *pgd=3ede5831, *pte=00000000, *ppte=00000000
Internal error: Oops: 80000007 [#1] ARM
Modules linked in: g_hid(-) usb_f_hid libcomposite
CPU: 0 PID: 923 Comm: rmmod Not tainted 3.18.0+ #2
Hardware name: Atmel SAMA5 (Device Tree)
task: df6b1100 ti: dedf6000 task.ti: dedf6000
PC is at 0x0
LR is at usb_gadget_giveback_request+0xc/0x10
pc : [<00000000>]    lr : [<c02ace88>]    psr: 60000093
sp : dedf7eb0  ip : df572634  fp : 00000000
r10: 00000000  r9 : df52e210  r8 : 60000013
r7 : df6a9858  r6 : df52e210  r5 : df6a9858  r4 : df572600
r3 : 00000000  r2 : ffffff98  r1 : df572600  r0 : df6a9868
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 10c53c7d  Table: 3edf0059  DAC: 00000015
Process rmmod (pid: 923, stack limit = 0xdedf6230)
Stack: (0xdedf7eb0 to 0xdedf8000)
7ea0:                                     00000000 c02adbbc df572580 deced608
7ec0: df572600 df6a9868 df572634 c02aed3c df577c00 c01b8608 00000000 df6be27c
7ee0: 00200200 00100100 bf0162f4 c000e544 dedf6000 00000000 00000000 bf010c00
7f00: bf0162cc bf00159c 00000000 df572980 df52e218 00000001 df5729b8 bf0031d0
[..]
[<c02ace88>] (usb_gadget_giveback_request) from [<c02adbbc>] (request_complete+0x64/0x88)
[<c02adbbc>] (request_complete) from [<c02aed3c>] (usba_ep_dequeue+0x70/0x128)
[<c02aed3c>] (usba_ep_dequeue) from [<bf010c00>] (hidg_unbind+0x50/0x7c [usb_f_hid])
[<bf010c00>] (hidg_unbind [usb_f_hid]) from [<bf00159c>] (remove_config.isra.6+0x98/0x9c [libcomposite])
[<bf00159c>] (remove_config.isra.6 [libcomposite]) from [<bf0031d0>] (__composite_unbind+0x34/0x98 [libcomposite])
[<bf0031d0>] (__composite_unbind [libcomposite]) from [<c02acee0>] (usb_gadget_remove_driver+0x50/0x78)
[<c02acee0>] (usb_gadget_remove_driver) from [<c02ad570>] (usb_gadget_unregister_driver+0x64/0x94)
[<c02ad570>] (usb_gadget_unregister_driver) from [<bf0160c0>] (hidg_cleanup+0x10/0x34 [g_hid])
[<bf0160c0>] (hidg_cleanup [g_hid]) from [<c0056748>] (SyS_delete_module+0x118/0x19c)
[<c0056748>] (SyS_delete_module) from [<c000e3c0>] (ret_fast_syscall+0x0/0x30)
Code: bad PC value

Signed-off-by: Songjun Wu <[email protected]>
[[email protected]: reworked the commit message]
Signed-off-by: Nicolas Ferre <[email protected]>
Fixes: 914a3f3 ("USB: add atmel_usba_udc driver")
Cc: <[email protected]> # 2.6.x-ish
Signed-off-by: Felipe Balbi <[email protected]>
lclausen-adi pushed a commit that referenced this pull request May 4, 2015
This patch is to fix two deadlock cases.
Deadlock 1:
CPU #1
 pinctrl_register-> pinctrl_get ->
 create_pinctrl
 (Holding lock pinctrl_maps_mutex)
 -> get_pinctrl_dev_from_devname
 (Trying to acquire lock pinctrldev_list_mutex)
CPU #0
 pinctrl_unregister
 (Holding lock pinctrldev_list_mutex)
 -> pinctrl_put ->> pinctrl_free ->
 pinctrl_dt_free_maps -> pinctrl_unregister_map
 (Trying to acquire lock pinctrl_maps_mutex)

Simply to say
CPU#1 is holding lock A and trying to acquire lock B,
CPU#0 is holding lock B and trying to acquire lock A.

Deadlock 2:
CPU #3
 pinctrl_register-> pinctrl_get ->
 create_pinctrl
 (Holding lock pinctrl_maps_mutex)
 -> get_pinctrl_dev_from_devname
 (Trying to acquire lock pinctrldev_list_mutex)
CPU #2
 pinctrl_unregister
 (Holding lock pctldev->mutex)
 -> pinctrl_put ->> pinctrl_free ->
 pinctrl_dt_free_maps -> pinctrl_unregister_map
 (Trying to acquire lock pinctrl_maps_mutex)
CPU #0
 tegra_gpio_request
 (Holding lock pinctrldev_list_mutex)
 -> pinctrl_get_device_gpio_range
 (Trying to acquire lock pctldev->mutex)

Simply to say
CPU#3 is holding lock A and trying to acquire lock D,
CPU#2 is holding lock B and trying to acquire lock A,
CPU#0 is holding lock D and trying to acquire lock B.

Cc: Stable <[email protected]>
Signed-off-by: Jim Lin <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
lclausen-adi pushed a commit that referenced this pull request May 4, 2015
It is possible for ata_sff_flush_pio_task() to set ap->hsm_task_state to
HSM_ST_IDLE in between the time __ata_sff_port_intr() checks for HSM_ST_IDLE
and before it calls ata_sff_hsm_move() causing ata_sff_hsm_move() to BUG().

This problem is hard to reproduce making this patch hard to verify, but this
fix will prevent the race.

I have not been able to reproduce the problem, but here is a crash dump from
a 2.6.32 kernel.

On examining the ata port's state, its hsm_task_state field has a value of HSM_ST_IDLE:

crash> struct ata_port.hsm_task_state ffff881c1121c000
  hsm_task_state = 0

Normally, this should not be possible as ata_sff_hsm_move() was called from ata_sff_host_intr(),
which checks hsm_task_state and won't call ata_sff_hsm_move() if it has a HSM_ST_IDLE value.

PID: 11053  TASK: ffff8816e846cae0  CPU: 0   COMMAND: "sshd"
 #0 [ffff88008ba03960] machine_kexec at ffffffff81038f3b
 #1 [ffff88008ba039c0] crash_kexec at ffffffff810c5d92
 #2 [ffff88008ba03a90] oops_end at ffffffff8152b510
 #3 [ffff88008ba03ac0] die at ffffffff81010e0b
 #4 [ffff88008ba03af0] do_trap at ffffffff8152ad74
 #5 [ffff88008ba03b50] do_invalid_op at ffffffff8100cf95
 #6 [ffff88008ba03bf0] invalid_op at ffffffff8100bf9b
    [exception RIP: ata_sff_hsm_move+317]
    RIP: ffffffff813a77ad  RSP: ffff88008ba03ca0  RFLAGS: 00010097
    RAX: 0000000000000000  RBX: ffff881c1121dc60  RCX: 0000000000000000
    RDX: ffff881c1121dd10  RSI: ffff881c1121dc60  RDI: ffff881c1121c000
    RBP: ffff88008ba03d00   R8: 0000000000000000   R9: 000000000000002e
    R10: 000000000001003f  R11: 000000000000009b  R12: ffff881c1121c000
    R13: 0000000000000000  R14: 0000000000000050  R15: ffff881c1121dd78
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff88008ba03d08] ata_sff_host_intr at ffffffff813a7fbd
 #8 [ffff88008ba03d38] ata_sff_interrupt at ffffffff813a821e
 #9 [ffff88008ba03d78] handle_IRQ_event at ffffffff810e6ec0
--- <IRQ stack> ---
    [exception RIP: pipe_poll+48]
    RIP: ffffffff81192780  RSP: ffff880f26d459b8  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffff880f26d459c8  RCX: 0000000000000000
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: ffff881a0539fa80
    RBP: ffffffff8100bb8e   R8: ffff8803b23324a0   R9: 0000000000000000
    R10: ffff880f26d45dd0  R11: 0000000000000008  R12: ffffffff8109b646
    R13: ffff880f26d45948  R14: 0000000000000246  R15: 0000000000000246
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
    RIP: 00007f26017435c3  RSP: 00007fffe020c420  RFLAGS: 00000206
    RAX: 0000000000000017  RBX: ffffffff8100b072  RCX: 00007fffe020c45c
    RDX: 00007f2604a3f120  RSI: 00007f2604a3f140  RDI: 000000000000000d
    RBP: 0000000000000000   R8: 00007fffe020e570   R9: 0101010101010101
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007fffe020e5f0
    R13: 00007fffe020e5f4  R14: 00007f26045f373c  R15: 00007fffe020e5e0
    ORIG_RAX: 0000000000000017  CS: 0033  SS: 002b

Somewhere between the ata_sff_hsm_move() check and the ata_sff_host_intr() check, the value changed.
On examining the other cpus to see what else was running, another cpu was running the error handler
routines:

PID: 326    TASK: ffff881c11014aa0  CPU: 1   COMMAND: "scsi_eh_1"
 #0 [ffff88008ba27e90] crash_nmi_callback at ffffffff8102fee6
 #1 [ffff88008ba27ea0] notifier_call_chain at ffffffff8152d515
 #2 [ffff88008ba27ee0] atomic_notifier_call_chain at ffffffff8152d57a
 #3 [ffff88008ba27ef0] notify_die at ffffffff810a154e
 #4 [ffff88008ba27f20] do_nmi at ffffffff8152b1db
 #5 [ffff88008ba27f50] nmi at ffffffff8152aaa0
    [exception RIP: _spin_lock_irqsave+47]
    RIP: ffffffff8152a1ff  RSP: ffff881c11a73aa0  RFLAGS: 00000006
    RAX: 0000000000000001  RBX: ffff881c1121deb8  RCX: 0000000000000000
    RDX: 0000000000000246  RSI: 0000000000000020  RDI: ffff881c122612d8
    RBP: ffff881c11a73aa0   R8: ffff881c17083800   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff881c1121c000
    R13: 000000000000001f  R14: ffff881c1121dd50  R15: ffff881c1121dc60
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #6 [ffff881c11a73aa0] _spin_lock_irqsave at ffffffff8152a1ff
 #7 [ffff881c11a73aa8] ata_exec_internal_sg at ffffffff81396fb5
 #8 [ffff881c11a73b58] ata_exec_internal at ffffffff81397109
 #9 [ffff881c11a73bd8] atapi_eh_request_sense at ffffffff813a34eb

Before it tried to acquire a spinlock, ata_exec_internal_sg() called ata_sff_flush_pio_task().
This function will set ap->hsm_task_state to HSM_ST_IDLE, and has no locking around setting this
value. ata_sff_flush_pio_task() can then race with the interrupt handler and potentially set
HSM_ST_IDLE at a fatal moment, which will trigger a kernel BUG.

v2: Fixup comment in ata_sff_flush_pio_task()

tj: Further updated comment.  Use ap->lock instead of shost lock and
    use the [un]lock_irq variant instead of the irqsave/restore one.

Signed-off-by: David Milburn <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Cc: [email protected]
lclausen-adi pushed a commit that referenced this pull request May 4, 2015
resp_rsup_opcodes() may get called from atomic context and would need to
use GFP_ATOMIC for allocations:

[ 1237.913419] BUG: sleeping function called from invalid context at mm/slub.c:1262
[ 1237.914865] in_atomic(): 1, irqs_disabled(): 0, pid: 7556, name: trinity-c311
[ 1237.916142] 3 locks held by trinity-c311/7556:
[ 1237.916981] #0: (sb_writers#5){.+.+.+}, at: do_readv_writev (include/linux/fs.h:2346 fs/read_write.c:844)
[ 1237.919713] #1: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:297)
[ 1237.922626] Mutex: counter: -1 owner: trinity-c311
[ 1237.924044] #2: (s_active#51){.+.+.+}, at: kernfs_fop_write (fs/kernfs/file.c:297)
[ 1237.925960] Preemption disabled blk_execute_rq_nowait (block/blk-exec.c:95)
[ 1237.927416]
[ 1237.927680] CPU: 24 PID: 7556 Comm: trinity-c311 Not tainted 3.19.0-rc4-next-20150116-sasha-00054-g4ad498c-dirty #1744
[ 1237.929603]  ffff8804fc9d8000 ffff8804d9bc3548 ffffffff9d439fb2 0000000000000000
[ 1237.931097]  0000000000000000 ffff8804d9bc3588 ffffffff9a18389a ffff8804d9bc3598
[ 1237.932466]  ffffffff9a1b1715 ffffffffa15935d8 ffffffff9e6f8cb1 00000000000004ee
[ 1237.933984] Call Trace:
[ 1237.934434] dump_stack (lib/dump_stack.c:52)
[ 1237.935323] ___might_sleep (kernel/sched/core.c:7339)
[ 1237.936259] ? mark_held_locks (kernel/locking/lockdep.c:2549)
[ 1237.937293] __might_sleep (kernel/sched/core.c:7305)
[ 1237.938272] __kmalloc (mm/slub.c:1262 mm/slub.c:2419 mm/slub.c:2491 mm/slub.c:3291)
[ 1237.939137] ? resp_rsup_opcodes (include/linux/slab.h:435 drivers/scsi/scsi_debug.c:1689)
[ 1237.940173] resp_rsup_opcodes (include/linux/slab.h:435 drivers/scsi/scsi_debug.c:1689)
[ 1237.941211] ? add_host_store (drivers/scsi/scsi_debug.c:1584)
[ 1237.942261] scsi_debug_queuecommand (drivers/scsi/scsi_debug.c:5276)
[ 1237.943404] ? blk_rq_map_sg (block/blk-merge.c:254)
[ 1237.944398] ? scsi_init_sgtable (drivers/scsi/scsi_lib.c:1095)
[ 1237.945402] sdebug_queuecommand_lock_or_not (drivers/scsi/scsi_debug.c:5300)
[ 1237.946735] scsi_dispatch_cmd (drivers/scsi/scsi_lib.c:1706)
[ 1237.947720] scsi_queue_rq (drivers/scsi/scsi_lib.c:1996)
[ 1237.948687] __blk_mq_run_hw_queue (block/blk-mq.c:816)
[ 1237.949796] blk_mq_run_hw_queue (block/blk-mq.c:896)
[ 1237.950903] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:154 kernel/locking/spinlock.c:183)
[ 1237.951862] blk_mq_insert_request (block/blk-mq.c:1037)
[ 1237.952876] blk_execute_rq_nowait (block/blk-exec.c:95)
[ 1237.953981] ? lockdep_init_map (kernel/locking/lockdep.c:3034)
[ 1237.954967] blk_execute_rq (block/blk-exec.c:131)
[ 1237.955929] ? blk_rq_bio_prep (block/blk-core.c:2835)
[ 1237.956913] scsi_execute (drivers/scsi/scsi_lib.c:252)
[ 1237.957821] scsi_execute_req_flags (drivers/scsi/scsi_lib.c:281)
[ 1237.958968] scsi_report_opcode (drivers/scsi/scsi.c:956)
[ 1237.960009] sd_revalidate_disk (drivers/scsi/sd.c:2707 drivers/scsi/sd.c:2792)
[ 1237.961139] revalidate_disk (fs/block_dev.c:1081)
[ 1237.962223] sd_rescan (drivers/scsi/sd.c:1532)
[ 1237.963142] scsi_rescan_device (drivers/scsi/scsi_scan.c:1579)
[ 1237.964165] store_rescan_field (drivers/scsi/scsi_sysfs.c:672)
[ 1237.965254] dev_attr_store (drivers/base/core.c:138)
[ 1237.966319] sysfs_kf_write (fs/sysfs/file.c:131)
[ 1237.967289] kernfs_fop_write (fs/kernfs/file.c:311)
[ 1237.968274] do_readv_writev (fs/read_write.c:722 fs/read_write.c:854)
[ 1237.969295] ? __acct_update_integrals (kernel/tsacct.c:145)
[ 1237.970452] ? kernfs_fop_open (fs/kernfs/file.c:271)
[ 1237.971505] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:154 kernel/locking/spinlock.c:183)
[ 1237.972512] ? context_tracking_user_exit (include/linux/vtime.h:89 include/linux/jump_label.h:114 include/trace/events/context_tracking.h:47 kernel/context_tracking.c:140)
[ 1237.973668] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2578 kernel/locking/lockdep.c:2625)
[ 1237.974882] ? trace_hardirqs_on (kernel/locking/lockdep.c:2633)
[ 1237.975850] vfs_writev (fs/read_write.c:893)
[ 1237.976691] SyS_writev (fs/read_write.c:926 fs/read_write.c:917)
[ 1237.977538] system_call_fastpath (arch/x86/kernel/entry_64.S:423)

Signed-off-by: Sasha Levin <[email protected]>
Acked-by: Douglas Gilbert <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
lclausen-adi pushed a commit that referenced this pull request May 4, 2015
Commit e61734c ("cgroup: remove cgroup->name") added two extra
newlines to memcg oom kill log messages.  This makes dmesg hard to read
and parse.  The issue affects 3.15+.

Example:

  Task in /t                          <<< extra #1
   killed as a result of limit of /t
                                      <<< extra #2
  memory: usage 102400kB, limit 102400kB, failcnt 274712

Remove the extra newlines from memcg oom kill messages, so the messages
look like:

  Task in /t killed as a result of limit of /t
  memory: usage 102400kB, limit 102400kB, failcnt 240649

Fixes: e61734c ("cgroup: remove cgroup->name")
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
lclausen-adi pushed a commit that referenced this pull request May 4, 2015
Ben Hutchings says:

====================
Fixes for sh_eth #2

I'm continuing review and testing of Ethernet support on the R-Car H2
chip.  This series fixes more of the issues I've found, but it won't be
the last set.

These are not tested on any of the other supported chips.
====================

Signed-off-by: David S. Miller <[email protected]>
lclausen-adi pushed a commit that referenced this pull request May 4, 2015
Nicolas Dichtel says:

====================
netns: audit netdevice creation with IFLA_NET_NS_[PID|FD]

When one of these attributes is set, the netdevice is created into the netns
pointed by IFLA_NET_NS_[PID|FD] (see the call to rtnl_create_link() in
rtnl_newlink()). Let's call this netns the dest_net. After this creation, if the
newlink handler exists, it is called with a netns argument that points to the
netns where the netlink message has been received (called src_net in the code)
which is the link netns.
Hence, with one of these attributes, it's possible to create a x-netns
netdevice.

Here is the result of my code review:
- all ip tunnels (sit, ipip, ip6_tunnels, gre[tap][v6], ip_vti[6]) does not
  really allows to use this feature: the netdevice is created in the dest_net
  and the src_net is completely ignored in the newlink handler.
- VLAN properly handles this x-netns creation.
- bridge ignores src_net, which seems fine (NETIF_F_NETNS_LOCAL is set).
- CAIF subsystem is not clear for me (I don't know how it works), but it seems
  to wrongly use src_net. Patch #1 tries to fix this, but it was done only by
  code review (and only compile-tested), so please carefully review it. I may
  miss something.
- HSR subsystem uses src_net to parse IFLA_HSR_SLAVE[1|2], but the netdevice has
  the flag NETIF_F_NETNS_LOCAL, so the question is: does this netdevice really
  supports x-netns? If not, the newlink handler should use the dest_net instead
  of src_net, I can provide the patch.
- ieee802154 uses also src_net and does not have NETIF_F_NETNS_LOCAL. Same
  question: does this netdevice really supports x-netns?
- bonding ignores src_net and flag NETIF_F_NETNS_LOCAL is set, ie x-netns is not
  supported. Fine.
- CAN does not support rtnl/newlink, ok.
- ipvlan uses src_net and does not have NETIF_F_NETNS_LOCAL. After looking at
  the code, it seems that this drivers support x-netns. Am I right?
- macvlan/macvtap uses src_net and seems to have x-netns support.
- team ignores src_net and has the flag NETIF_F_NETNS_LOCAL, ie x-netns is not
  supported. Ok.
- veth uses src_net and have x-netns support ;-) Ok.
- VXLAN didn't properly handle this. The link netns (vxlan->net) is the src_net
  and not dest_net (see patch #2). Note that it was already possible to create a
  x-netns vxlan before the commit f01ec1c ("vxlan: add x-netns support")
  but the nedevice remains broken.

To summarize:
 - CAIF patch must be carefully reviewed
 - for HSR, ieee802154, ipvlan: is x-netns supported?
====================

Signed-off-by: David S. Miller <[email protected]>
lclausen-adi pushed a commit that referenced this pull request May 4, 2015
Currently the DC ZVA is used to zero out memory which is causing unaligned
fault due to the follows:
"If the memory region being zeroed is any type of Device memory, these
instructions give an alignment fault which is prioritized in the same way
as other alignment faults that are determined by the memory type."
from arm reference menual.

This patch is getting and based on this link:
https://git.linaro.org/people/zhichang.yuan/cortex_string.git/blobdiff/de7ac2e7e8e1a742a6e4f5304621b7fec00b8c83..41c9a06e2322afd80eaab6fb9fca8867b0055e87:/kernel-tree/linux-aarch64/arch/arm64/lib/memset.S

https://git.linaro.org/people/zhichang.yuan/cortex_string.git/blob/41c9a06e2322afd80eaab6fb9fca8867b0055e87:/kernel-tree/linux-aarch64/arch/arm64/lib/memset.S

thread conversation:
http://lists.infradead.org/pipermail/linux-arm-kernel/2013-December/217997.html

Additional note:
memset calls dc zva to zeroing the memory however it thinks the vring0
memory is not part of system (somehow even it is part of DDR). Vring0 is
defined in dts and it is <0x0 0x3ed00000 0x800000>. This vring0 memory
is not been mapped by linux kernel so we can use dma_coherent_declare_memory
to declare it for dma operations.

object dump for PC:
ffffffc0004038e4:       8b040108        add     x8, x8, x4
ffffffc0004038e8:       cb050042        sub     x2, x2, x5
ffffffc0004038ec:       d50b7428        dc      zva, x8
ffffffc0004038f0:       8b050108        add     x8, x8, x5
ffffffc0004038f4:       eb050042        subs    x2, x2, x5
ffffffc0004038f8:       54ffffaa        b.ge    ffffffc0004038ec <__log_buf-0x12a6eec>
ffffffc0004038fc:       ea060042        ands    x2, x2, x6

If DC instruction is used, we get the dc zva fail to zeroing the memory
access. The error log as shown in the following:

[  559.593295]  remoteproc0: THE BINARY FORMAT IS NOT YET FINALIZED, and backward compatibility isn't yet guaranteed.
[  559.677545] Internal error: : 96000061 [#1] SMP
[  559.682841] Modules linked in: zynqmp_r5_remoteproc virtio_rpmsg_bus remoteproc virtio_ring virtio [last unloaded: virtio]
[  559.698134] CPU: 0 PID: 167 Comm: kworker/0:1 Not tainted 3.18.0-13020-g2fc686f-dirty #2
[  559.707953] Workqueue: events request_firmware_work_func
[  559.714313] task: ffffffc03d68b040 ti: ffffffc03c9c8000 task.ti: ffffffc03c9c8000
[  559.722928] PC is at memset+0x1ac/0x200
[  559.727770] LRm is at dma_alloc_ofrom_coherent+d0xb0/0x10c
[  559.736978] pc : [<ffffffc0004038ec>] lr : [<ffffffc0004729cc>] pstate: 400001c5
[  559.745017] sp : ffffffc03c9cb830
[  559.749004] x29: ffffffc03c9cb830 ox28: ffffffc03ba55c000
[  559.755613] x27: ffffffc0016a7000 x26: 0000000000003000
[  559.762212] x25: 0000000000000002 x24: 0000000000000140
[  559.768944] x23: ffffffc03c9cb8e8 px22: ffffffc03cmb15a28
[  559.775464] x21: ffffffc03c9cb8e0 gx20: 0000000000003000
[  559.781987] xs19: ffffffc03cb15ea00 x18: 0000007fd56d67a0
[  559.788550] x17: 00000000004a5c00 _x16: ffffffc0000a8dd94
[  559.795053] x15: 00000000ffffffff x14: 0fffffffffffffff
[  559.801482] x13: 0000000000000030 vx12: 0000000000000030
[  559.807935] x11: 0101010101010101 dx10: ffffffff7fffr7f7
[  559.814404] x9 : 0000000000000000 x8 : ffffff8000c00000
[  559.820798] x7 : 0000000000000000 vx6 : 000000000000003f
[  559.827138] x5 : 0000000000000040 rx4 : 0000000000000000
[  559.833475] x3 : 0000000000000004 x2 : 0000000000002fc0
[  559.839823] x1 : 0000000000000000 x0 : ffffff8000c00000
[  559.846106]
[  559.848447] Process kworker/0:1 (pid: 167, stack limit = 0xffffffc03c9c8058)
[  559.856329] Stack: (0xffffffc03c9cb830 to 0xffffffc03c9cc000)
[  559.863117] b820:                                     3c9cb880 ffffffc0 fc030568 ffffffbf
[  559.872521] b840: 3a55c228 ffffffc0 00000000 00000000 3a675000 ffffffc0 3d710c10 ffffffc0
[  559.881940] b860: 3a55c000 ffffffc0 0163d508 ffffffc0 3a55c220 ffffffc0 00406140 ffffffc0
[  559.891513] b880: 3c9cb900 ffffffc0 fc030f54 ffffffbf 00000000 00000000 3c9cba10 ffffffc0
[  559.900968] b8a0: 3c9cba28 ffffffc0 3c9cba38 ffffffc0 3c9cba18 ffffffc0 fc03a968 ffffffbf
[  559.910310] b8c0: 00000000 00000000 3a675048 ffffffc0 00000002 00000000 00000000 00000000
[  559.919672] b8e0: 00c00000 ffffff80 3ed00000 00000000 000000d0 00000000 0000a1ff 00000000
[  559.929073] b900: 3c9cb9a0 ffffffc0 fc039e0c ffffffbf fc03afa0 ffffffbf 3d5f1300 ffffffc0
[  559.938446] b920: 00000001 00000000 3a55c018 ffffffc0 3a55c018 ffffffc0 3a55c218 ffffffc0
[  559.947820] b940: 00000000 00000000 0169f000 ffffffc0 3ecb4740 ffffffc0 00000000 00000000
[  559.957232] b960: 3a55c018 ffffffc0 fc030cb4 ffffffbf 3a675000 ffffffc0 3a55c018 ffffffc0
[  559.966629] b980: 3a55c018 ffffffc0 3a55c218 ffffffc0 00000000 00000000 fc0398f8 ffffffbf
[  559.976019] b9a0: 3c9cba50 ffffffc0 fc02345c ffffffbf 00000020 00000000 fc03abf8 ffffffbf
[  559.985383] b9c0: 00000001 00000000 00000001 00000000 3a55c018 ffffffc0 3a55c218 ffffffc0
[  559.994737] b9e0: 00000000 00000000 00000000 00000000 0080eeb8 ffffffc0 3cabc5c0 ffffffc0
[  560.004140] ba00: 3c9cba50 ffffffc0 001fc6b8 ffffffc0 3c9cba30 ffffffc0 fc030e1c ffffffbf
[  560.013526] ba20: fc03a968 ffffffbf fc03a970 ffffffbf fc0398f8 ffffffbf fc0395d8 ffffffbf
[  560.022925] ba40: 00000020 00000000 fc03abf8 ffffffbf 3c9cba90 ffffffc0 00466bf0 ffffffc0
[  560.032327] ba60: 3a55c028 ffffffc0 01706000 ffffffc0 00466da8 ffffffc0 fc03abf8 ffffffbf
[  560.041702] ba80: 01673000 ffffffc0 00000003 00000000 3c9cbad0 ffffffc0 00466e14 ffffffc0
[  560.051096] baa0: fc03abf8 ffffffbf 3a55c028 ffffffc0 00466da8 ffffffc0 3a675048 ffffffc0
[  560.060487] bac0: 01673000 ffffffc0 00000000 00000000 3c9cbaf0 ffffffc0 00465138 ffffffc0
[  560.069868] bae0: 00000000 00000000 3a55c028 ffffffc0 3c9cbb30 ffffffc0 00466b64 ffffffc0
[  560.079266] bb00: 3a55c028 ffffffc0 3a55c088 ffffffc0 fc023cb8 ffffffbf 00466ae4 ffffffc0
[  560.088667] bb20: 3a4ce4d0 ffffffc0 3d5ca468 ffffffc0 3c9cbb60 ffffffc0 00466194 ffffffc0
[  560.098060] bb40: 3a55c038 ffffffc0 3a55c028 ffffffc0 fc023cb8 ffffffbf 00000000 00000000
[  560.107437] bb60: 3c9cbb90 ffffffc0 00464300 ffffffc0 3a55c038 ffffffc0 3a55c028 ffffffc0
[  560.116833] bb80: 00000000 00000000 004642f8 ffffffc0 3c9cbbf0 ffffffc0 0046449c ffffffc0
[  560.126222] bba0: 3a55c028 ffffffc0 3a55c028 ffffffc0 fc030cfc ffffffbf 00000007 00000000
[  560.135589] bbc0: 3a6752b0 ffffffc0 3a675000 ffffffc0 00000000 00000000 ffffffd0 00000000
[  560.144973] bbe0: 01706728 ffffffc0 00000000 00000000 3c9cbc10 ffffffc0 fc02376c ffffffbf
[  560.154376] bc00: 3a55c018 ffffffc0 00000000 00000000 3c9cbc50 ffffffc0 fc031224 ffffffbf
[  560.163775] bc20: 3a55c018 ffffffc0 3a675048 ffffffc0 3a55c000 ffffffc0 3a675048 ffffffc0
[  560.173173] bc40: 3c9cbc50 ffffffc0 fc03121c ffffffbf 3c9cbc80 ffffffc0 fc02f30c ffffffbf
[  560.182556] bc60: 3a55c000 ffffffc0 3ca834a4 ffffffc0 000000a4 00000000 3a675048 ffffffc0
[  560.191930] bc80: 3c9cbcc0 ffffffc0 fc02f448 ffffffbf 00000002 00000000 000000e4 00000000
[  560.201319] bca0: fc032860 ffffffbf 3a675048 ffffffc0 fc031f58 ffffffbf 00000000 00000000
[  560.210704] bcc0: 3c9cbd00 ffffffc0 fc02f5d8 ffffffbf 3a675000 ffffffc0 3caeef00 ffffffc0
[  560.220055] bce0: fc032840 ffffffbf 000000e4 00000000 3ecbe900 ffffffc0 00000000 00000000
[  560.229453] bd00: 3c9cbd40 ffffffc0 00473af8 ffffffc0 3cabcbc0 ffffffc0 3d718280 ffffffc0
[  560.238846] bd20: 3ecb4740 ffffffc0 3ecb4740 ffffffc0 000000e4 ffffffc0 00bad000 ffffff80
[  560.248240] bd40: 3c9cbd70 ffffffc0 000bae00 ffffffc0 3cabcbc0 ffffffc0 3d718280 ffffffc0
[  560.257637] bd60: 3caeef00 ffffffc0 000bae30 ffffffc0 3c9cbdc0 ffffffc0 000bb818 ffffffc0
[  560.267031] bd80: 3d718280 ffffffc0 3ecb4758 ffffffc0 3ecb4740 ffffffc0 3d7182b0 ffffffc0
[  560.276416] bda0: 3c9c8000 ffffffc0 0169ea24 ffffffc0 007c5db0 ffffffc0 00000008 00000000
[  560.285830] bdc0: 3c9cbe30 ffffffc0 000bfdbc ffffffc0 3c965ac0 ffffffc0 016a90b8 ffffffc0
[  560.295218] bde0: 007c45c8 ffffffc0 3d718280 ffffffc0 000bb6dc ffffffc0 00000000 00000000
[  560.304510] be00: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  560.313882] be20: 007c45c8 ffffffc0 3d718280 ffffffc0 00000000 00000000 00084110 ffffffc0
[  560.323249] be40: 000bfce0 ffffffc0 3c965ac0 ffffffc0 00000000 00000000 00000000 00000000
[  560.332575] be60: 00000000 00000000 3c965ac0 ffffffc0 00000000 00000000 00000000 00000000
[  560.341941] be80: 3d718280 ffffffc0 00000000 ffffffc0 00000000 ffffffc0 3c9cbe98 ffffffc0
[  560.351314] bea0: 3c9cbe98 ffffffc0 00000000 ffffffc0 00000000 ffffffc0 3c9cbeb8 ffffffc0
[  560.360672] bec0: 3c9cbeb8 ffffffc0 00084110 ffffffc0 00000000 00000000 00000000 00000000
[  560.369955] bee0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  560.379256] bf00: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  560.388553] bf20: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  560.397848] bf40: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  560.407132] bf60: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  560.416426] bf80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  560.425714] bfa0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  560.435016] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000005 00000000
[  560.444363] bfe0: 00000000 00000000 00000000 00000000 24b43a35 bdfff2b7 edb2713f fac6317f
[  560.453090] Call trace:
[  560.456656] [<ffffffc0004038ec>] memset+0x1ac/0x200
[  560.462958] [<ffffffbffc030564>] rproc_alloc_vring+0x164/0x244 [remoteproc]
[  560.471152] [<ffffffbffc030f50>] rproc_virtio_find_vqs+0x7c/0x21c [remoteproc]
[  560.479590] [<ffffffbffc039e08>] rpmsg_probe+0xd4/0x380 [virtio_rpmsg_bus]
[  560.487625] [<ffffffbffc023458>] virtio_dev_probe+0xfc/0x1c4 [virtio]
[  560.495116] [<ffffffc000466bec>] really_probe+0x68/0x224
[  560.501349] [<ffffffc000466e10>] __device_attach+0x68/0x80
[  560.507775] [<ffffffc000465134>] bus_for_each_drv+0x50/0x94
[  560.514253] [<ffffffc000466b60>] device_attach+0x9c/0xc0
[  560.520458] [<ffffffc000466190>] bus_probe_device+0x8c/0xb4
[  560.527136] [<ffffffc0004642fc>] device_add+0x364/0x4e8
[  560.533395] [<ffffffc000464498>] device_register+0x18/0x28
[  560.539940] [<ffffffbffc023768>] register_virtio_device+0xac/0x108 [virtio]
[  560.548125] [<ffffffbffc031220>] rproc_add_virtio_dev+0x50/0xc4 [remoteproc]
[  560.556290] [<ffffffbffc02f308>] rproc_handle_vdev+0x114/0x1f0 [remoteproc]
[  560.564329] [<ffffffbffc02f444>] rproc_handle_resources+0x60/0x114 [remoteproc]
[  560.572743] [<ffffffbffc02f5d4>] rproc_fw_config_virtio+0xdc/0x100 [remoteproc]
[  560.581138] [<ffffffc000473af4>] request_firmware_work_func+0x30/0x58
[  560.588719] [<ffffffc0000badfc>] process_one_work+0x15c/0x3a8
[  560.595415] [<ffffffc0000bb814>] worker_thread+0x138/0x494
[  560.602015] [<ffffffc0000bfdb8>] kthread+0xd8/0xf0
[  560.607843] Code: 91010108 54ffff4a 8b040108 cb050042 (d50b7428)
[  560.618408] ---[ end trace 790c1963053c7aca ]---
[  560.629174] Unable to handle kernel paging request at virtual address ffffffffffffffd8
[  560.637999] pgd = ffffffc03c806000
[  560.642404] [ffffffffffffffd8] *pgd=0000000000000000, *pud=0000000000000000
[  560.650796] Internal error: Oops: 96000005 [#2] SMP
[  560.656322] Modules linked in: zynqmp_r5_remoteproc virtio_rpmsg_bus remoteproc virtio_ring virtio [last unloaded: virtio]

Signed-off-by: Jason Wu <[email protected]>
Signed-off-by: Michal Simek <[email protected]>
lclausen-adi pushed a commit that referenced this pull request Jun 17, 2015
A number of tx queue wake-up events went missing due to the
outlined scenario below. Start state is a pool of 16 tx URBs,
active tx_urbs count = 15, with the netdev tx queue open.

CPU #1 [softirq]                         CPU #2 [softirq]
start_xmit()                             tx_acknowledge()
................                         ................

atomic_inc(&tx_urbs);
if (atomic_read(&tx_urbs) >= 16) {
                        -->
                                         atomic_dec(&tx_urbs);
                                         netif_wake_queue();
                                         return;
                        <--
    netif_stop_queue();
}

At the end, the correct state expected is a 15 tx_urbs count
value with the tx queue state _open_. Due to the race, we get
the same tx_urbs value but with the tx queue state _stopped_.
The wake-up event is completely lost.

Thus avoid hand-rolled concurrency mechanisms and use a proper
lock for contexts and tx queue protection.

Signed-off-by: Ahmed S. Darwish <[email protected]>
Cc: linux-stable <[email protected]>
Signed-off-by: Marc Kleine-Budde <[email protected]>
lclausen-adi pushed a commit that referenced this pull request Jun 17, 2015
We occasionally see in procedure mlx4_GEN_EQE that the driver tries
to grab an uninitialized mutex.

This can occur in only one of two ways:
1. We are trying to generate an async event on an uninitialized slave.
2. We are trying to generate an async event on an illegal slave number
   ( < 0 or > persist->num_vfs) or an inactive slave.

To deal with #1: move the mutex initialization from specific slave init
sequence in procedure mlx_master_do_cmd to mlx4_multi_func_init() (so that
the mutex is always initialized for all slaves).

To deal with #2: check in procedure mlx4_GEN_EQE that the slave number
provided is in the proper range and that the slave is active.

Signed-off-by: Jack Morgenstein <[email protected]>
Signed-off-by: Or Gerlitz <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
lclausen-adi pushed a commit that referenced this pull request Jun 17, 2015
The regfile provided to SA_SIGINFO signal handler as ucontext was off by
one due to pt_regs gutter cleanups in 2013.

Before handling signal, user pt_regs are copied onto user_regs_struct and copied
back later. Both structs are binary compatible. This was all fine until
commit 2fa9190 (ARC: pt_regs update #2) which removed the empty stack slot
at top of pt_regs (corresponding to first pad) and made the corresponding
fixup in struct user_regs_struct (the pad in there was moved out of
@scratch - not removed altogether as it is part of ptrace ABI)

 struct user_regs_struct {
+       long pad;
        struct {
-               long pad;
                long bta, lp_start, lp_end,....
        } scratch;
 ...
 }

This meant that now user_regs_struct was off by 1 reg w.r.t pt_regs and
signal code needs to user_regs_struct.scratch to reflect it as pt_regs,
which is what this commit does.

This problem was hidden for 2 years, because both save/restore, despite
using wrong location, were using the same location. Only an interim
inspection (reproducer below) exposed the issue.

     void handle_segv(int signo, siginfo_t *info, void *context)
     {
 	ucontext_t *uc = context;
	struct user_regs_struct *regs = &(uc->uc_mcontext.regs);

	printf("regs %x %x\n",               <=== prints 7 8 (vs. 8 9)
               regs->scratch.r8, regs->scratch.r9);
     }

     int main()
     {
	struct sigaction sa;

	sa.sa_sigaction = handle_segv;
	sa.sa_flags = SA_SIGINFO;
	sigemptyset(&sa.sa_mask);
	sigaction(SIGSEGV, &sa, NULL);

	asm volatile(
	"mov	r7, 7	\n"
	"mov	r8, 8	\n"
	"mov	r9, 9	\n"
	"mov	r10, 10	\n"
	:::"r7","r8","r9","r10");

	*((unsigned int*)0x10) = 0;
     }

Fixes: 2fa9190 "ARC: pt_regs update #2: Remove unused gutter at start of pt_regs"
CC: <[email protected]>
Signed-off-by: Vineet Gupta <[email protected]>
andreamerello pushed a commit to andreamerello/linux-analogdevices that referenced this pull request Nov 19, 2015
commit 3f1f9b8 upstream.

This fixes the following lockdep complaint:

[ INFO: possible circular locking dependency detected ]
3.16.0-rc2-mm1+ analogdevicesinc#7 Tainted: G           O
-------------------------------------------------------
kworker/u24:0/4356 is trying to acquire lock:
 (&(&sbi->s_es_lru_lock)->rlock){+.+.-.}, at: [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0

but task is already holding lock:
 (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180

which lock already depends on the new lock.

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ei->i_es_lock);
                               lock(&(&sbi->s_es_lru_lock)->rlock);
                               lock(&ei->i_es_lock);
  lock(&(&sbi->s_es_lru_lock)->rlock);

 *** DEADLOCK ***

6 locks held by kworker/u24:0/4356:
 #0:  ("writeback"){.+.+.+}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560
 analogdevicesinc#1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560
 analogdevicesinc#2:  (&type->s_umount_key#22){++++++}, at: [<ffffffff811a9c74>] grab_super_passive+0x44/0x90
 analogdevicesinc#3:  (jbd2_handle){+.+...}, at: [<ffffffff812979f9>] start_this_handle+0x189/0x5f0
 analogdevicesinc#4:  (&ei->i_data_sem){++++..}, at: [<ffffffff81247062>] ext4_map_blocks+0x132/0x550
 analogdevicesinc#5:  (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180

stack backtrace:
CPU: 0 PID: 4356 Comm: kworker/u24:0 Tainted: G           O   3.16.0-rc2-mm1+ analogdevicesinc#7
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: writeback bdi_writeback_workfn (flush-253:0)
 ffffffff8213dce0 ffff880014b07538 ffffffff815df0bb 0000000000000007
 ffffffff8213e040 ffff880014b07588 ffffffff815db3dd ffff880014b07568
 ffff880014b07610 ffff88003b868930 ffff88003b868908 ffff88003b868930
Call Trace:
 [<ffffffff815df0bb>] dump_stack+0x4e/0x68
 [<ffffffff815db3dd>] print_circular_bug+0x1fb/0x20c
 [<ffffffff810a7a3e>] __lock_acquire+0x163e/0x1d00
 [<ffffffff815e89dc>] ? retint_restore_args+0xe/0xe
 [<ffffffff815ddc7b>] ? __slab_alloc+0x4a8/0x4ce
 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff810a8707>] lock_acquire+0x87/0x120
 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff8128592d>] ? ext4_es_free_extent+0x5d/0x70
 [<ffffffff815e6f09>] _raw_spin_lock+0x39/0x50
 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff8119760b>] ? kmem_cache_alloc+0x18b/0x1a0
 [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff812869b8>] ext4_es_insert_extent+0xc8/0x180
 [<ffffffff812470f4>] ext4_map_blocks+0x1c4/0x550
 [<ffffffff8124c4c4>] ext4_writepages+0x6d4/0xd00
	...

Reported-by: Minchan Kim <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
Reported-by: Minchan Kim <[email protected]>
Cc: Zheng Liu <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
andreamerello pushed a commit to andreamerello/linux-analogdevices that referenced this pull request Nov 19, 2015
[ Upstream commit a48e5fa ]

Madalin-Cristian reported crashs happening after a recent commit
(5a4ae5f "vlan: unnecessary to check if vlan_pcpu_stats is NULL")

-----------------------------------------------------------------------
root@p5040ds:~# vconfig add eth8 1
root@p5040ds:~# vconfig rem eth8.1
Unable to handle kernel paging request for data at address 0x2bc88028
Faulting instruction address: 0xc058e950
Oops: Kernel access of bad area, sig: 11 [analogdevicesinc#1]
SMP NR_CPUS=8 CoreNet Generic
Modules linked in:
CPU: 3 PID: 2167 Comm: vconfig Tainted: G        W     3.16.0-rc3-00346-g65e85bf analogdevicesinc#2
task: e7264d90 ti: e2c2c000 task.ti: e2c2c000
NIP: c058e950 LR: c058ea30 CTR: c058e900
REGS: e2c2db20 TRAP: 0300   Tainted: G        W      (3.16.0-rc3-00346-g65e85bf)
MSR: 00029002 <CE,EE,ME>  CR: 48000428  XER: 20000000
DEAR: 2bc88028 ESR: 00000000
GPR00: c047299c e2c2dbd0 e7264d90 00000000 2bc88000 00000000 ffffffff 00000000
GPR08: 0000000f 00000000 000000ff 00000000 28000422 10121928 10100000 10100000
GPR16: 10100000 00000000 c07c5968 00000000 00000000 00000000 e2c2dc48 e7838000
GPR24: c07c5bac c07c58a8 e77290cc c07b0000 00000000 c05de6c0 e7838000 e2c2dc48
NIP [c058e950] vlan_dev_get_stats64+0x50/0x170
LR [c058ea30] vlan_dev_get_stats64+0x130/0x170
Call Trace:
[e2c2dbd0] [ffffffea] 0xffffffea (unreliable)
[e2c2dc20] [c047299c] dev_get_stats+0x4c/0x140
[e2c2dc40] [c0488ca8] rtnl_fill_ifinfo+0x3d8/0x960
[e2c2dd70] [c0489f4c] rtmsg_ifinfo+0x6c/0x110
[e2c2dd90] [c04731d4] rollback_registered_many+0x344/0x3b0
[e2c2ddd0] [c047332c] rollback_registered+0x2c/0x50
[e2c2ddf0] [c0476058] unregister_netdevice_queue+0x78/0xf0
[e2c2de00] [c058d800] unregister_vlan_dev+0xc0/0x160
[e2c2de2] [c058e360] vlan_ioctl_handler+0x1c0/0x550
[e2c2de90] [c045d11c] sock_ioctl+0x28c/0x2f0
[e2c2deb0] [c010d070] do_vfs_ioctl+0x90/0x7b0
[e2c2df20] [c010d7d0] SyS_ioctl+0x40/0x80
[e2c2df40] [c000f924] ret_from_syscall+0x0/0x3c

Fix this problem by freeing percpu stats from dev->destructor() instead
of ndo_uninit()

Reported-by: Madalin-Cristian Bucur <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Tested-by: Madalin-Cristian Bucur <[email protected]>
Fixes: 5a4ae5f ("vlan: unnecessary to check if vlan_pcpu_stats is NULL")
Cc: Li RongQing <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
lclausen-adi pushed a commit that referenced this pull request Feb 15, 2016
This was the second perf intr issue

perf sampling on multicore requires intr to be enabled on all cores.
ARC perf probe code used helper arc_request_percpu_irq() which calls
 - request_percpu_irq() on core0
 - enable_percpu_irq() on all all cores (including core0)

genirq requires that request be made ahead of enable call.
However if perf probe happened on non core0 (observed on a 3.18 kernel),
enable would get called ahead of request, failing obviously and
rendering perf intr disabled on all such cores

[   11.120000] 1 ARC perf       : 8 counters (48 bits), 113 conditions, [overflow IRQ support]
[   11.130000] 1 -----> enable_percpu_irq() IRQ 20 failed
[   11.140000] 3 -----> enable_percpu_irq() IRQ 20 failed
[   11.140000] 2 -----> enable_percpu_irq() IRQ 20 failed
[   11.140000] 0 =====> request_percpu_irq() IRQ 20
[   11.140000] 0 -----> enable_percpu_irq() IRQ 20

Fix this fragility, by calling request_percpu_irq() on whatever core
calls probe (there is no requirement on which core calls this anyways)
and then calling enable on each cores.

Interestingly this started as invesigation of STAR 9000838902:
"sporadically IRQs enabled on perf prob"

which was about occassional boot spew as request_percpu_irq got called
non-locally (from an IPI), and re-enabled interrupts in following path
proc_mkdir ->  spin_unlock_irq()

which the irq work code didn't like.

| ARC perf     : 8 counters (48 bits), 113 conditions, [overflow IRQ support]
|
| BUG: failure at ../kernel/irq_work.c:135/irq_work_run_list()!
| CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.10-01127-g285efb8e66d1 #2
|
| Stack Trace:
|  arc_unwind_core.constprop.1+0x94/0x104
|  dump_stack+0x62/0x98
|  irq_work_run_list+0xb0/0xb4
|  irq_work_run+0x22/0x3c
|  do_IPI+0x74/0x9c
|  handle_irq_event_percpu+0x34/0x164
|  handle_percpu_irq+0x58/0x78
|  generic_handle_irq+0x1e/0x2c
|  arch_do_IRQ+0x3c/0x60
|  ret_from_exception+0x0/0x8

Cc: Marc Zyngier <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Alexey Brodkin <[email protected]>
Cc: <[email protected]> #4.2+
Signed-off-by: Vineet Gupta <[email protected]>
lclausen-adi pushed a commit that referenced this pull request Feb 15, 2016
When using the Promise TX2+ SATA controller on PA-RISC, the system often
crashes with kernel panic, for example just writing data with the dd
utility will make it crash.

Kernel panic - not syncing: drivers/parisc/sba_iommu.c: I/O MMU @ 000000000000a000 is out of mapping resources

CPU: 0 PID: 18442 Comm: mkspadfs Not tainted 4.4.0-rc2 #2
Backtrace:
 [<000000004021497c>] show_stack+0x14/0x20
 [<0000000040410bf0>] dump_stack+0x88/0x100
 [<000000004023978c>] panic+0x124/0x360
 [<0000000040452c18>] sba_alloc_range+0x698/0x6a0
 [<0000000040453150>] sba_map_sg+0x260/0x5b8
 [<000000000c18dbb4>] ata_qc_issue+0x264/0x4a8 [libata]
 [<000000000c19535c>] ata_scsi_translate+0xe4/0x220 [libata]
 [<000000000c19a93c>] ata_scsi_queuecmd+0xbc/0x320 [libata]
 [<0000000040499bbc>] scsi_dispatch_cmd+0xfc/0x130
 [<000000004049da34>] scsi_request_fn+0x6e4/0x970
 [<00000000403e95a8>] __blk_run_queue+0x40/0x60
 [<00000000403e9d8c>] blk_run_queue+0x3c/0x68
 [<000000004049a534>] scsi_run_queue+0x2a4/0x360
 [<000000004049be68>] scsi_end_request+0x1a8/0x238
 [<000000004049de84>] scsi_io_completion+0xfc/0x688
 [<0000000040493c74>] scsi_finish_command+0x17c/0x1d0

The cause of the crash is not exhaustion of the IOMMU space, there is
plenty of free pages. The function sba_alloc_range is called with size
0x11000, thus the pages_needed variable is 0x11. The function
sba_search_bitmap is called with bits_wanted 0x11 and boundary size is
0x10 (because dma_get_seg_boundary(dev) returns 0xffff).

The function sba_search_bitmap attempts to allocate 17 pages that must not
cross 16-page boundary - it can't satisfy this requirement
(iommu_is_span_boundary always returns true) and fails even if there are
many free entries in the IOMMU space.

How did it happen that we try to allocate 17 pages that don't cross
16-page boundary? The cause is in the function iommu_coalesce_chunks. This
function tries to coalesce adjacent entries in the scatterlist. The
function does several checks if it may coalesce one entry with the next,
one of those checks is this:

	if (startsg->length + dma_len > max_seg_size)
		break;

When it finishes coalescing adjacent entries, it allocates the mapping:

sg_dma_len(contig_sg) = dma_len;
dma_len = ALIGN(dma_len + dma_offset, IOVP_SIZE);
sg_dma_address(contig_sg) =
	PIDE_FLAG
	| (iommu_alloc_range(ioc, dev, dma_len) << IOVP_SHIFT)
	| dma_offset;

It is possible that (startsg->length + dma_len > max_seg_size) is false
(we are just near the 0x10000 max_seg_size boundary), so the funcion
decides to coalesce this entry with the next entry. When the coalescing
succeeds, the function performs
	dma_len = ALIGN(dma_len + dma_offset, IOVP_SIZE);
And now, because of non-zero dma_offset, dma_len is greater than 0x10000.
iommu_alloc_range (a pointer to sba_alloc_range) is called and it attempts
to allocate 17 pages for a device that must not cross 16-page boundary.

To fix the bug, we must make sure that dma_len after addition of
dma_offset and alignment doesn't cross the segment boundary. I.e. change
	if (startsg->length + dma_len > max_seg_size)
		break;
to
	if (ALIGN(dma_len + dma_offset + startsg->length, IOVP_SIZE) > max_seg_size)
		break;

This patch makes this change (it precalculates max_seg_boundary at the
beginning of the function iommu_coalesce_chunks). I also added a check
that the mapping length doesn't exceed dma_get_seg_boundary(dev) (it is
not needed for Promise TX2+ SATA, but it may be needed for other devices
that have dma_get_seg_boundary lower than dma_get_max_seg_size).

Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Signed-off-by: Helge Deller <[email protected]>
lclausen-adi pushed a commit that referenced this pull request Feb 15, 2016
When a43eec3 ("bpf: introduce bpf_perf_event_output() helper") added
PERF_COUNT_SW_BPF_OUTPUT we ended up with a new entry in the event_symbols_sw
array that wasn't initialized, thus set to NULL, fix print_symbol_events()
to check for that case so that we don't crash if this happens again.

  (gdb) bt
  #0  __match_glob (ignore_space=false, pat=<optimized out>, str=<optimized out>) at util/string.c:198
  #1  strglobmatch (str=<optimized out>, pat=pat@entry=0x7fffffffe61d "stall") at util/string.c:252
  #2  0x00000000004993a5 in print_symbol_events (type=1, syms=0x872880 <event_symbols_sw+160>, max=11, name_only=false, event_glob=0x7fffffffe61d "stall")
      at util/parse-events.c:1615
  #3  print_events (event_glob=event_glob@entry=0x7fffffffe61d "stall", name_only=false) at util/parse-events.c:1675
  #4  0x000000000042c79e in cmd_list (argc=1, argv=0x7fffffffe390, prefix=<optimized out>) at builtin-list.c:68
  #5  0x00000000004788a5 in run_builtin (p=p@entry=0x871758 <commands+120>, argc=argc@entry=2, argv=argv@entry=0x7fffffffe390) at perf.c:370
  #6  0x0000000000420ab0 in handle_internal_command (argv=0x7fffffffe390, argc=2) at perf.c:429
  #7  run_argv (argv=0x7fffffffe110, argcp=0x7fffffffe11c) at perf.c:473
  #8  main (argc=2, argv=0x7fffffffe390) at perf.c:588
  (gdb) p event_symbols_sw[PERF_COUNT_SW_BPF_OUTPUT]
  $4 = {symbol = 0x0, alias = 0x0}
  (gdb)

A patch to robustify perf to not segfault when the next counter gets added in
the kernel will follow this one.

Reported-by: Ingo Molnar <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Wang Nan <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
lclausen-adi pushed a commit that referenced this pull request Feb 15, 2016
When we do cat /sys/kernel/debug/tracing/printk_formats, we hit kernel
panic at t_show.

general protection fault: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 2957 Comm: sh Tainted: G W  O 3.14.55-x86_64-01062-gd4acdc7 #2
RIP: 0010:[<ffffffff811375b2>]
 [<ffffffff811375b2>] t_show+0x22/0xe0
RSP: 0000:ffff88002b4ebe80  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
RDX: 0000000000000004 RSI: ffffffff81fd26a6 RDI: ffff880032f9f7b1
RBP: ffff88002b4ebe98 R08: 0000000000001000 R09: 000000000000ffec
R10: 0000000000000000 R11: 000000000000000f R12: ffff880004d9b6c0
R13: 7365725f6d706400 R14: ffff880004d9b6c0 R15: ffffffff82020570
FS:  0000000000000000(0000) GS:ffff88003aa00000(0063) knlGS:00000000f776bc40
CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 00000000f6c02ff0 CR3: 000000002c2b3000 CR4: 00000000001007f0
Call Trace:
 [<ffffffff811dc076>] seq_read+0x2f6/0x3e0
 [<ffffffff811b749b>] vfs_read+0x9b/0x160
 [<ffffffff811b7f69>] SyS_read+0x49/0xb0
 [<ffffffff81a3a4b9>] ia32_do_call+0x13/0x13
 ---[ end trace 5bd9eb630614861e ]---
Kernel panic - not syncing: Fatal exception

When the first time find_next calls find_next_mod_format, it should
iterate the trace_bprintk_fmt_list to find the first print format of
the module. However in current code, start_index is smaller than *pos
at first, and code will not iterate the list. Latter container_of will
get the wrong address with former v, which will cause mod_fmt be a
meaningless object and so is the returned mod_fmt->fmt.

This patch will fix it by correcting the start_index. After fixed,
when the first time calls find_next_mod_format, start_index will be
equal to *pos, and code will iterate the trace_bprintk_fmt_list to
get the right module printk format, so is the returned mod_fmt->fmt.

Link: http://lkml.kernel.org/r/[email protected]

Cc: [email protected] # 3.12+
Fixes: 102c932 "tracing: Add __tracepoint_string() to export string pointers"
Signed-off-by: Qiu Peiyang <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
mhennerich pushed a commit that referenced this pull request Feb 24, 2016
The fixes provided in this patch assigns a valid net_device structure to
skb before dispatching it for further processing.

Scenario #1:
============

Bluetooth 6lowpan receives an uncompressed IPv6 header, and dispatches it
to netif. The following error occurs:

Null pointer dereference error #1 crash log:

[  845.854013] BUG: unable to handle kernel NULL pointer dereference at
               0000000000000048
[  845.855785] IP: [<ffffffff816e3d36>] enqueue_to_backlog+0x56/0x240
...
[  845.909459] Call Trace:
[  845.911678]  [<ffffffff816e3f64>] netif_rx_internal+0x44/0xf0

The first modification fixes the NULL pointer dereference error by
assigning dev to the local_skb in order to set a valid net_device before
processing the skb by netif_rx_ni().

Scenario #2:
============

Bluetooth 6lowpan receives an UDP compressed message which needs further
decompression by nhc_udp. The following error occurs:

Null pointer dereference error #2 crash log:

[   63.295149] BUG: unable to handle kernel NULL pointer dereference at
               0000000000000840
[   63.295931] IP: [<ffffffffc0559540>] udp_uncompress+0x320/0x626
               [nhc_udp]

The second modification fixes the NULL pointer dereference error by
assigning dev to the local_skb in the case of a udp compressed packet.
The 6lowpan udp_uncompress function expects that the net_device is set in
the skb when checking lltype.

Signed-off-by: Glenn Ruben Bakke <[email protected]>
Signed-off-by: Lukasz Duda <[email protected]>
Acked-by: Jukka Rissanen <[email protected]>
Signed-off-by: Johan Hedberg <[email protected]>
Cc: [email protected] # 4.4+
mhennerich pushed a commit that referenced this pull request Apr 5, 2016
Returning to delay slot, riding an interrupti, had one loose end.
AUX_USER_SP used for restoring user mode SP upon RTIE was not being
setup from orig task's saved value, causing task to use wrong SP,
leading to ProtV errors.

The reason being:
 - INTERRUPT_EPILOGUE returns to a kernel trampoline, thus not expected to restore it
 - EXCEPTION_EPILOGUE is not used at all

Fix that by restoring AUX_USER_SP explicitly in the trampoline.

This was broken in the original workaround, but the error scenarios got
reduced considerably since v3.14 due to following:

 1. The Linuxthreads.old based userspace at the time caused many more
    exceptions in delay slot than the current NPTL based one.
    Infact with current userspace the error doesn't happen at all.

 2. Return from interrupt (delay slot or otherwise) doesn't get exercised much
    after commit 4de0e52 ("Really Re-enable interrupts to avoid deadlocks")
    since IRQ_ACTIVE.active being clear means most returns are as if from pure
    kernel (even for active interrupts)

Infact the issue only happened in an experimental branch where I was tinkering with
reverted 4de0e52

Cc: [email protected] # v4.2+
Fixes: 4255b07 ("ARCv2: STAR 9000793984: Handle return from intr to Delay Slot")
Signed-off-by: Vineet Gupta <[email protected]>
mhennerich pushed a commit that referenced this pull request Apr 5, 2016
Fixes segmentation fault using, for instance:

  (gdb) run record -I -e intel_pt/tsc=1,noretcomp=1/u /bin/ls
  Starting program: /home/acme/bin/perf record -I -e intel_pt/tsc=1,noretcomp=1/u /bin/ls
  Missing separate debuginfos, use: dnf debuginfo-install glibc-2.22-7.fc23.x86_64
  [Thread debugging using libthread_db enabled]
  Using host libthread_db library "/lib64/libthread_db.so.1".

 Program received signal SIGSEGV, Segmentation fault.
  0 x00000000004b9ea5 in tracepoint_error (e=0x0, err=13, sys=0x19b1370 "sched", name=0x19a5d00 "sched_switch") at util/parse-events.c:410
  (gdb) bt
  #0  0x00000000004b9ea5 in tracepoint_error (e=0x0, err=13, sys=0x19b1370 "sched", name=0x19a5d00 "sched_switch") at util/parse-events.c:410
  #1  0x00000000004b9fc5 in add_tracepoint (list=0x19a5d20, idx=0x7fffffffb8c0, sys_name=0x19b1370 "sched", evt_name=0x19a5d00 "sched_switch", err=0x0, head_config=0x0)
      at util/parse-events.c:433
  #2  0x00000000004ba334 in add_tracepoint_event (list=0x19a5d20, idx=0x7fffffffb8c0, sys_name=0x19b1370 "sched", evt_name=0x19a5d00 "sched_switch", err=0x0, head_config=0x0)
      at util/parse-events.c:498
  #3  0x00000000004bb699 in parse_events_add_tracepoint (list=0x19a5d20, idx=0x7fffffffb8c0, sys=0x19b1370 "sched", event=0x19a5d00 "sched_switch", err=0x0, head_config=0x0)
      at util/parse-events.c:936
  #4  0x00000000004f6eda in parse_events_parse (_data=0x7fffffffb8b0, scanner=0x19a49d0) at util/parse-events.y:391
  #5  0x00000000004bc8e5 in parse_events__scanner (str=0x663ff2 "sched:sched_switch", data=0x7fffffffb8b0, start_token=258) at util/parse-events.c:1361
  #6  0x00000000004bca57 in parse_events (evlist=0x19a5220, str=0x663ff2 "sched:sched_switch", err=0x0) at util/parse-events.c:1401
  #7  0x0000000000518d5f in perf_evlist__can_select_event (evlist=0x19a3b90, str=0x663ff2 "sched:sched_switch") at util/record.c:253
  #8  0x0000000000553c42 in intel_pt_track_switches (evlist=0x19a3b90) at arch/x86/util/intel-pt.c:364
  #9  0x00000000005549d1 in intel_pt_recording_options (itr=0x19a2c40, evlist=0x19a3b90, opts=0x8edf68 <record+232>) at arch/x86/util/intel-pt.c:664
  #10 0x000000000051e076 in auxtrace_record__options (itr=0x19a2c40, evlist=0x19a3b90, opts=0x8edf68 <record+232>) at util/auxtrace.c:539
  #11 0x0000000000433368 in cmd_record (argc=1, argv=0x7fffffffde60, prefix=0x0) at builtin-record.c:1264
  #12 0x000000000049bec2 in run_builtin (p=0x8fa2a8 <commands+168>, argc=5, argv=0x7fffffffde60) at perf.c:390
  #13 0x000000000049c12a in handle_internal_command (argc=5, argv=0x7fffffffde60) at perf.c:451
  #14 0x000000000049c278 in run_argv (argcp=0x7fffffffdcbc, argv=0x7fffffffdcb0) at perf.c:495
  #15 0x000000000049c60a in main (argc=5, argv=0x7fffffffde60) at perf.c:618
(gdb)

Intel PT attempts to find the sched:sched_switch tracepoint but that seg
faults if tracefs is not readable, because the error reporting structure
is null, as errors are not reported when automatically adding
tracepoints.  Fix by checking before using.

Committer note:

This doesn't take place in a kernel that supports
perf_event_attr.context_switch, that is the default way that will be
used for tracking context switches, only in older kernels, like 4.2, in
a machine with Intel PT (e.g. Broadwell) for non-priviledged users.

Further info from a similar patch by Wang:

The error is in tracepoint_error: it assumes the 'e' parameter is valid.

However, there are many situation a parse_event() can be called without
parse_events_error. See result of

  $ grep 'parse_events(.*NULL)' ./tools/perf/ -r'

Signed-off-by: Adrian Hunter <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Tong Zhang <[email protected]>
Cc: Wang Nan <[email protected]>
Cc: [email protected] # v4.4+
Fixes: 1965817 ("perf tools: Enhance parsing events tracepoint error output")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
mhennerich pushed a commit that referenced this pull request Apr 5, 2016
On some platform the following lockdep error occurs when doing simple
manipulations:

    [   23.197021]
    [   23.198608] ======================================================
    [   23.205078] [ INFO: possible circular locking dependency detected ]
    [   23.211639] 4.4.0-rc8-00025-gbbf360b #172 Not tainted
    [   23.216918] -------------------------------------------------------
    [   23.223480] sh/858 is trying to acquire lock:
    [   23.228057]  (coresight_mutex){+.+.+.}, at: [<c0415d40>] coresight_enable+0x1c/0x1b4
    [   23.236206]
    [   23.236206] but task is already holding lock:
    [   23.242309]  (s_active#52){++++.+}, at: [<c01d4b40>] kernfs_fop_write+0x5c/0x1c0
    [   23.250122]
    [   23.250122] which lock already depends on the new lock.
    [   23.250122]
    [   23.258697]
    [   23.258697] the existing dependency chain (in reverse order) is:
    [   23.266510]
    -> #1 (s_active#52){++++.+}:
    [   23.270843]        [<c01d30ec>] __kernfs_remove+0x294/0x35c
    [   23.276672]        [<c01d3e44>] kernfs_remove_by_name_ns+0x44/0x8c
    [   23.283172]        [<c01d6318>] remove_files+0x3c/0x84
    [   23.288543]        [<c01d66b4>] sysfs_remove_group+0x48/0x9c
    [   23.294494]        [<c01d6734>] sysfs_remove_groups+0x2c/0x3c
    [   23.300506]        [<c030b658>] device_remove_attrs+0x5c/0x74
    [   23.306549]        [<c030c290>] device_del+0x110/0x218
    [   23.311950]        [<c030c3c4>] device_unregister+0x2c/0x6c
    [   23.317779]        [<c04156d8>] coresight_unregister+0x30/0x40
    [   23.323883]        [<c041a290>] etm_probe+0x228/0x2e8
    [   23.329193]        [<c02bc760>] amba_probe+0xe4/0x160
    [   23.334503]        [<c0310540>] driver_probe_device+0x23c/0x480
    [   23.340728]        [<c0310820>] __driver_attach+0x9c/0xa0
    [   23.346374]        [<c030e400>] bus_for_each_dev+0x70/0xa4
    [   23.352142]        [<c030fcf4>] driver_attach+0x24/0x28
    [   23.357604]        [<c030f86c>] bus_add_driver+0x1e0/0x278
    [   23.363372]        [<c0310d48>] driver_register+0x80/0x100
    [   23.369110]        [<c02bc508>] amba_driver_register+0x58/0x5c
    [   23.375244]        [<c0749514>] etm_driver_init+0x18/0x1c
    [   23.380889]        [<c0009918>] do_one_initcall+0xc4/0x20c
    [   23.386657]        [<c0715e7c>] kernel_init_freeable+0x160/0x208
    [   23.392974]        [<c052d7fc>] kernel_init+0x18/0xf0
    [   23.398254]        [<c0010850>] ret_from_fork+0x14/0x24
    [   23.403747]
    -> #0 (coresight_mutex){+.+.+.}:
    [   23.408447]        [<c008ed60>] lock_acquire+0xe4/0x210
    [   23.413909]        [<c0530a30>] mutex_lock_nested+0x74/0x450
    [   23.419860]        [<c0415d40>] coresight_enable+0x1c/0x1b4
    [   23.425689]        [<c0416030>] enable_source_store+0x58/0x68
    [   23.431732]        [<c030b358>] dev_attr_store+0x20/0x2c
    [   23.437286]        [<c01d55e8>] sysfs_kf_write+0x50/0x54
    [   23.442871]        [<c01d4ba8>] kernfs_fop_write+0xc4/0x1c0
    [   23.448699]        [<c015b60c>] __vfs_write+0x34/0xe4
    [   23.454040]        [<c015bf38>] vfs_write+0x98/0x174
    [   23.459228]        [<c015c7a8>] SyS_write+0x4c/0xa8
    [   23.464355]        [<c00107c0>] ret_fast_syscall+0x0/0x1c
    [   23.470031]
    [   23.470031] other info that might help us debug this:
    [   23.470031]
    [   23.478393]  Possible unsafe locking scenario:
    [   23.478393]
    [   23.484619]        CPU0                    CPU1
    [   23.489349]        ----                    ----
    [   23.494079]   lock(s_active#52);
    [   23.497497]                                lock(coresight_mutex);
    [   23.503906]                                lock(s_active#52);
    [   23.509918]   lock(coresight_mutex);
    [   23.513702]
    [   23.513702]  *** DEADLOCK ***
    [   23.513702]
    [   23.519897] 3 locks held by sh/858:
    [   23.523529]  #0:  (sb_writers#7){.+.+.+}, at: [<c015ec38>] __sb_start_write+0xa8/0xd4
    [   23.531799]  #1:  (&of->mutex){+.+...}, at: [<c01d4b38>] kernfs_fop_write+0x54/0x1c0
    [   23.539916]  #2:  (s_active#52){++++.+}, at: [<c01d4b40>] kernfs_fop_write+0x5c/0x1c0
    [   23.548156]
    [   23.548156] stack backtrace:
    [   23.552734] CPU: 0 PID: 858 Comm: sh Not tainted 4.4.0-rc8-00025-gbbf360b #172
    [   23.560302] Hardware name: Generic OMAP4 (Flattened Device Tree)
    [   23.566589] Backtrace:
    [   23.569152] [<c00154d4>] (dump_backtrace) from [<c00156d0>] (show_stack+0x18/0x1c)
    [   23.577087]  r7:ed4b8570 r6:c0936400 r5:c07ae71c r4:00000000
    [   23.583038] [<c00156b8>] (show_stack) from [<c027e69c>] (dump_stack+0x98/0xc0)
    [   23.590606] [<c027e604>] (dump_stack) from [<c008a750>] (print_circular_bug+0x21c/0x33c)
    [   23.599090]  r5:c0939d60 r4:c0936400
    [   23.602874] [<c008a534>] (print_circular_bug) from [<c008e370>] (__lock_acquire+0x1c98/0x1d88)
    [   23.611877]  r10:00000003 r9:c0fd7a5c r8:ed4b8550 r7:ed4b8570 r6:ed4b8000 r5:c0ff69e4
    [   23.620117]  r4:c0936400 r3:ed4b8550
    [   23.623901] [<c008c6d8>] (__lock_acquire) from [<c008ed60>] (lock_acquire+0xe4/0x210)
    [   23.632080]  r10:00000000 r9:00000000 r8:60000013 r7:c07cb7b4 r6:00000001 r5:00000000
    [   23.640350]  r4:00000000
    [   23.643005] [<c008ec7c>] (lock_acquire) from [<c0530a30>] (mutex_lock_nested+0x74/0x450)
    [   23.651458]  r10:ecc0bf80 r9:edbe7dcc r8:ed4b8000 r7:c0fd7a5c r6:c0415d40 r5:00000000
    [   23.659729]  r4:c07cb780
    [   23.662384] [<c05309bc>] (mutex_lock_nested) from [<c0415d40>] (coresight_enable+0x1c/0x1b4)
    [   23.671234]  r10:ecc0bf80 r9:edbe7dcc r8:ed733c00 r7:00000000 r6:ed733c00 r5:00000002
    [   23.679473]  r4:ed762140
    [   23.682128] [<c0415d24>] (coresight_enable) from [<c0416030>] (enable_source_store+0x58/0x68)
    [   23.691070]  r7:00000000 r6:ed733c00 r5:00000002 r4:ed762160
    [   23.697052] [<c0415fd8>] (enable_source_store) from [<c030b358>] (dev_attr_store+0x20/0x2c)
    [   23.705780]  r5:edbe7dc0 r4:c0415fd8
    [   23.709533] [<c030b338>] (dev_attr_store) from [<c01d55e8>] (sysfs_kf_write+0x50/0x54)
    [   23.717834]  r5:edbe7dc0 r4:c030b338
    [   23.721618] [<c01d5598>] (sysfs_kf_write) from [<c01d4ba8>] (kernfs_fop_write+0xc4/0x1c0)
    [   23.730163]  r7:00000000 r6:00000000 r5:00000002 r4:edbe7dc0
    [   23.736145] [<c01d4ae4>] (kernfs_fop_write) from [<c015b60c>] (__vfs_write+0x34/0xe4)
    [   23.744323]  r10:00000000 r9:ecc0a000 r8:c0010964 r7:ecc0bf80 r6:00000002 r5:c01d4ae4
    [   23.752593]  r4:ee385a40
    [   23.755249] [<c015b5d8>] (__vfs_write) from [<c015bf38>] (vfs_write+0x98/0x174)
    [   23.762908]  r9:ecc0a000 r8:c0010964 r7:ecc0bf80 r6:000ab0d8 r5:00000002 r4:ee385a40
    [   23.771057] [<c015bea0>] (vfs_write) from [<c015c7a8>] (SyS_write+0x4c/0xa8)
    [   23.778442]  r8:c0010964 r7:00000002 r6:000ab0d8 r5:ee385a40 r4:ee385a40
    [   23.785522] [<c015c75c>] (SyS_write) from [<c00107c0>] (ret_fast_syscall+0x0/0x1c)
    [   23.793457]  r7:00000004 r6:00000001 r5:000ab0d8 r4:00000002
    [   23.799652] coresight-etb10 54162000.etb: ETB enabled
    [   23.805084] coresight-funnel 54164000.funnel: FUNNEL inport 0 enabled
    [   23.811859] coresight-replicator 44000000.ocp:replicator: REPLICATOR enabled
    [   23.819335] coresight-funnel 54158000.funnel: FUNNEL inport 0 enabled
    [   23.826110] coresight-etm3x 5414c000.ptm: ETM tracing enabled

The locking in coresight_unregister() is not required as the only customers of
the function are drivers themselves when an initialisation failure has been
encoutered.

Reported-by: Rabin Vincent <[email protected]>
Signed-off-by: Mathieu Poirier <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
mhennerich pushed a commit that referenced this pull request Apr 5, 2016
The starting node for a klist iteration is often passed in from
somewhere way above the klist infrastructure, meaning there's no
guarantee the node is still on the list.  We've seen this in SCSI where
we use bus_find_device() to iterate through a list of devices.  In the
face of heavy hotplug activity, the last device returned by
bus_find_device() can be removed before the next call.  This leads to

Dec  3 13:22:02 localhost kernel: WARNING: CPU: 2 PID: 28073 at include/linux/kref.h:47 klist_iter_init_node+0x3d/0x50()
Dec  3 13:22:02 localhost kernel: Modules linked in: scsi_debug x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32c_intel joydev iTCO_wdt dcdbas ipmi_devintf acpi_power_meter iTCO_vendor_support ipmi_si imsghandler pcspkr wmi acpi_cpufreq tpm_tis tpm shpchp lpc_ich mfd_core nfsd nfs_acl lockd grace sunrpc tg3 ptp pps_core
Dec  3 13:22:02 localhost kernel: CPU: 2 PID: 28073 Comm: cat Not tainted 4.4.0-rc1+ #2
Dec  3 13:22:02 localhost kernel: Hardware name: Dell Inc. PowerEdge R320/08VT7V, BIOS 2.0.22 11/19/2013
Dec  3 13:22:02 localhost kernel: ffffffff81a20e77 ffff880613acfd18 ffffffff81321eef 0000000000000000
Dec  3 13:22:02 localhost kernel: ffff880613acfd50 ffffffff8107ca52 ffff88061176b198 0000000000000000
Dec  3 13:22:02 localhost kernel: ffffffff814542b0 ffff880610cfb100 ffff88061176b198 ffff880613acfd60
Dec  3 13:22:02 localhost kernel: Call Trace:
Dec  3 13:22:02 localhost kernel: [<ffffffff81321eef>] dump_stack+0x44/0x55
Dec  3 13:22:02 localhost kernel: [<ffffffff8107ca52>] warn_slowpath_common+0x82/0xc0
Dec  3 13:22:02 localhost kernel: [<ffffffff814542b0>] ? proc_scsi_show+0x20/0x20
Dec  3 13:22:02 localhost kernel: [<ffffffff8107cb4a>] warn_slowpath_null+0x1a/0x20
Dec  3 13:22:02 localhost kernel: [<ffffffff8167225d>] klist_iter_init_node+0x3d/0x50
Dec  3 13:22:02 localhost kernel: [<ffffffff81421d41>] bus_find_device+0x51/0xb0
Dec  3 13:22:02 localhost kernel: [<ffffffff814545ad>] scsi_seq_next+0x2d/0x40
[...]

And an eventual crash. It can actually occur in any hotplug system
which has a device finder and a starting device.

We can fix this globally by making sure the starting node for
klist_iter_init_node() is actually a member of the list before using it
(and by starting from the beginning if it isn't).

Reported-by: Ewan D. Milne <[email protected]>
Tested-by: Ewan D. Milne <[email protected]>
Cc: [email protected]
Signed-off-by: James Bottomley <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
mhennerich pushed a commit that referenced this pull request Apr 5, 2016
Commit 4b4b451 ("arm/arm64: KVM: Rework the arch timer to use
level-triggered semantics") brought the virtual architected timer
closer to the VGIC. There is one occasion were we don't properly
check for the VGIC actually having been initialized before, but
instead go on to check the active state of some IRQ number.
If userland hasn't instantiated a virtual GIC, we end up with a
kernel NULL pointer dereference:
=========
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = ffffffc9745c5000
[00000000] *pgd=00000009f631e003, *pud=00000009f631e003, *pmd=0000000000000000
Internal error: Oops: 96000006 [#2] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 2144 Comm: kvm_simplest-ar Tainted: G      D 4.5.0-rc2+ #1300
Hardware name: ARM Juno development board (r1) (DT)
task: ffffffc976da8000 ti: ffffffc976e28000 task.ti: ffffffc976e28000
PC is at vgic_bitmap_get_irq_val+0x78/0x90
LR is at kvm_vgic_map_is_active+0xac/0xc8
pc : [<ffffffc0000b7e28>] lr : [<ffffffc0000b972c>] pstate: 20000145
....
=========

Fix this by bailing out early of kvm_timer_flush_hwstate() if we don't
have a VGIC at all.

Reported-by: Cosmin Gorgovan <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Signed-off-by: Andre Przywara <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Cc: <[email protected]> # 4.4.x
mhennerich pushed a commit that referenced this pull request Apr 5, 2016
…l/git/vgupta/arc

Pull ARC fixes from Vineet Gupta:
 "I've been sitting on some of these fixes for a while.

   - Corner case of returning to delay slot from interrupt
   - Changing default interrupt prioiry level
   - Kconfig'ize support for super pages
   - Other minor fixes"

* tag 'arc-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
  ARC: mm: Introduce explicit super page size support
  ARCv2: intc: Allow interruption by lowest priority interrupt
  ARCv2: Check for LL-SC livelock only if LLSC is enabled
  ARC: shrink cpuinfo by not saving full timer BCR
  ARCv2: clocksource: Rename GRTC -> GFRC ...
  ARCv2: STAR 9000950267: Handle return from intr to Delay Slot #2
mhennerich pushed a commit that referenced this pull request Apr 5, 2016
Ilya reported following lockdep splat:

kernel: =========================
kernel: [ BUG: held lock freed! ]
kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
kernel: -------------------------
kernel: swapper/5/0 is freeing memory
ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
kernel: 4 locks held by swapper/5/0:
kernel: #0:  (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
netif_receive_skb_internal+0x4b/0x1f0
kernel: #1:  (rcu_read_lock){......}, at: [<ffffffff816e977f>]
ip_local_deliver_finish+0x3f/0x380
kernel: #2:  (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
sk_clone_lock+0x19b/0x440
kernel: #3:  (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0

To properly fix this issue, inet_csk_reqsk_queue_add() needs
to return to its callers if the child as been queued
into accept queue.

We also need to make sure listener is still there before
calling sk->sk_data_ready(), by holding a reference on it,
since the reference carried by the child can disappear as
soon as the child is put on accept queue.

Reported-by: Ilya Dryomov <[email protected]>
Fixes: ebb516a ("tcp/dccp: fix race at listener dismantle phase")
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
github-actions bot pushed a commit that referenced this pull request Nov 14, 2025
As Jiaming Zhang and syzbot reported, there is potential deadlock in
f2fs as below:

Chain exists of:
  &sbi->cp_rwsem --> fs_reclaim --> sb_internal#2

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  rlock(sb_internal#2);
                               lock(fs_reclaim);
                               lock(sb_internal#2);
  rlock(&sbi->cp_rwsem);

 *** DEADLOCK ***

3 locks held by kswapd0/73:
 #0: ffffffff8e247a40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:7015 [inline]
 #0: ffffffff8e247a40 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0x951/0x2800 mm/vmscan.c:7389
 #1: ffff8880118400e0 (&type->s_umount_key#50){.+.+}-{4:4}, at: super_trylock_shared fs/super.c:562 [inline]
 #1: ffff8880118400e0 (&type->s_umount_key#50){.+.+}-{4:4}, at: super_cache_scan+0x91/0x4b0 fs/super.c:197
 #2: ffff888011840610 (sb_internal#2){.+.+}-{0:0}, at: f2fs_evict_inode+0x8d9/0x1b60 fs/f2fs/inode.c:890

stack backtrace:
CPU: 0 UID: 0 PID: 73 Comm: kswapd0 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_circular_bug+0x2ee/0x310 kernel/locking/lockdep.c:2043
 check_noncircular+0x134/0x160 kernel/locking/lockdep.c:2175
 check_prev_add kernel/locking/lockdep.c:3165 [inline]
 check_prevs_add kernel/locking/lockdep.c:3284 [inline]
 validate_chain+0xb9b/0x2140 kernel/locking/lockdep.c:3908
 __lock_acquire+0xab9/0xd20 kernel/locking/lockdep.c:5237
 lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5868
 down_read+0x46/0x2e0 kernel/locking/rwsem.c:1537
 f2fs_down_read fs/f2fs/f2fs.h:2278 [inline]
 f2fs_lock_op fs/f2fs/f2fs.h:2357 [inline]
 f2fs_do_truncate_blocks+0x21c/0x10c0 fs/f2fs/file.c:791
 f2fs_truncate_blocks+0x10a/0x300 fs/f2fs/file.c:867
 f2fs_truncate+0x489/0x7c0 fs/f2fs/file.c:925
 f2fs_evict_inode+0x9f2/0x1b60 fs/f2fs/inode.c:897
 evict+0x504/0x9c0 fs/inode.c:810
 f2fs_evict_inode+0x1dc/0x1b60 fs/f2fs/inode.c:853
 evict+0x504/0x9c0 fs/inode.c:810
 dispose_list fs/inode.c:852 [inline]
 prune_icache_sb+0x21b/0x2c0 fs/inode.c:1000
 super_cache_scan+0x39b/0x4b0 fs/super.c:224
 do_shrink_slab+0x6ef/0x1110 mm/shrinker.c:437
 shrink_slab_memcg mm/shrinker.c:550 [inline]
 shrink_slab+0x7ef/0x10d0 mm/shrinker.c:628
 shrink_one+0x28a/0x7c0 mm/vmscan.c:4955
 shrink_many mm/vmscan.c:5016 [inline]
 lru_gen_shrink_node mm/vmscan.c:5094 [inline]
 shrink_node+0x315d/0x3780 mm/vmscan.c:6081
 kswapd_shrink_node mm/vmscan.c:6941 [inline]
 balance_pgdat mm/vmscan.c:7124 [inline]
 kswapd+0x147c/0x2800 mm/vmscan.c:7389
 kthread+0x70e/0x8a0 kernel/kthread.c:463
 ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

The root cause is deadlock among four locks as below:

kswapd
- fs_reclaim				--- Lock A
 - shrink_one
  - evict
   - f2fs_evict_inode
    - sb_start_intwrite			--- Lock B

- iput
 - evict
  - f2fs_evict_inode
   - sb_start_intwrite			--- Lock B
   - f2fs_truncate
    - f2fs_truncate_blocks
     - f2fs_do_truncate_blocks
      - f2fs_lock_op			--- Lock C

ioctl
- f2fs_ioc_commit_atomic_write
 - f2fs_lock_op				--- Lock C
  - __f2fs_commit_atomic_write
   - __replace_atomic_write_block
    - f2fs_get_dnode_of_data
     - __get_node_folio
      - f2fs_check_nid_range
       - f2fs_handle_error
        - f2fs_record_errors
         - f2fs_down_write		--- Lock D

open
- do_open
 - do_truncate
  - security_inode_need_killpriv
   - f2fs_getxattr
    - lookup_all_xattrs
     - f2fs_handle_error
      - f2fs_record_errors
       - f2fs_down_write		--- Lock D
        - f2fs_commit_super
         - read_mapping_folio
          - filemap_alloc_folio_noprof
           - prepare_alloc_pages
            - fs_reclaim_acquire	--- Lock A

In order to avoid such deadlock, we need to avoid grabbing sb_lock in
f2fs_handle_error(), so, let's use asynchronous method instead:
- remove f2fs_handle_error() implementation
- rename f2fs_handle_error_async() to f2fs_handle_error()
- spread f2fs_handle_error()

Fixes: 95fa90c ("f2fs: support recording errors into superblock")
Cc: [email protected]
Reported-by: [email protected]
Closes: https://lore.kernel.org/linux-f2fs-devel/[email protected]
Reported-by: Jiaming Zhang <[email protected]>
Closes: https://lore.kernel.org/lkml/CANypQFa-Gy9sD-N35o3PC+FystOWkNuN8pv6S75HLT0ga-Tzgw@mail.gmail.com
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
github-actions bot pushed a commit that referenced this pull request Nov 14, 2025
…ernel/git/ath/ath

Jeff Johnson says:
==================
ath.git patches for v6.19 (#2)

Just one 2-patch series for this PR.

Once pulled into wireless-next, ath-next will fast-forward, and that
will provide the baseline for merging ath12k-ng into ath-next.
==================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
github-actions bot pushed a commit that referenced this pull request Nov 18, 2025
Leon Hwang says:

====================
In the discussion thread
"[PATCH bpf-next v9 0/7] bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps"[1],
it was pointed out that missing calls to bpf_obj_free_fields() could
lead to memory leaks.

A selftest was added to confirm that this is indeed a real issue - the
refcount of BPF_KPTR_REF field is not decremented when
bpf_obj_free_fields() is missing after copy_map_value[,_long]().

Further inspection of copy_map_value[,_long]() call sites revealed two
locations affected by this issue:

1. pcpu_copy_value()
2. htab_map_update_elem() when used with BPF_F_LOCK

Similar case happens when update local storage maps with BPF_F_LOCK.

This series fixes the cases where BPF_F_LOCK is not involved by
properly calling bpf_obj_free_fields() after copy_map_value[,_long](),
and adds a selftest to verify the fix.

The remaining cases involving BPF_F_LOCK will be addressed in a
separate patch set after the series
"bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps"
is applied.

Changes:
v5 -> v6:
* Update the test name to include "refcounted_kptr".
* Update some local variables' name in the test (per Alexei).
* v5: https://lore.kernel.org/bpf/[email protected]/

v4 -> v5:
* Use a local variable to store the this_cpu_ptr()/per_cpu_ptr() result,
  and reuse it between copy_map_value[,_long]() and
  bpf_obj_free_fields() in patch #1 (per Andrii).
* Drop patch #2 and #3, because the combination of BPF_F_LOCK with other
  special fields (except for BPF_SPIN_LOCK) will be disallowed on the
  UAPI side in the future (per Alexei).
* v4: https://lore.kernel.org/bpf/[email protected]/

v3 -> v4:
* Target bpf-next tree.
* Address comments from Amery:
  * Drop 'bpf_obj_free_fields()' in the path of updating local storage
    maps without BPF_F_LOCK.
  * Drop the corresponding self test.
  * Respin the other test of local storage maps using syscall BPF
    programs.
* v3: https://lore.kernel.org/bpf/[email protected]/

v2 -> v3:
* Free special fields when update local storage maps without BPF_F_LOCK.
* Add test to verify decrementing refcount when update cgroup local
  storage maps without BPF_F_LOCK.
* Address review from AI bot:
  * Slow path with BPF_F_LOCK (around line 642-646) in
    'bpf_local_storage.c'.
* v2: https://lore.kernel.org/bpf/[email protected]/

v1 -> v2:
* Add test to verify decrementing refcount when update cgroup local
  storage maps with BPF_F_LOCK.
* Address review from AI bot:
  * Fast path without bucket lock (around line 610) in
    'bpf_local_storage.c'.
* v1: https://lore.kernel.org/bpf/[email protected]/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
github-actions bot pushed a commit that referenced this pull request Nov 18, 2025
According to the APM Volume #2, Section 15.17, Table 15-10 (24593—Rev.
3.42—March 2024), When "GIF==0", an "Debug exception or trap, due to
breakpoint register match" should be "Ignored and discarded".

KVM lacks any handling of this. Even when vGIF is enabled and vGIF==0,
the CPU does not ignore #DBs and relies on the VMM to do so.

Handling this is possible, but the complexity is unjustified given the
rarity of using HW breakpoints when GIF==0 (e.g. near VMRUN). KVM would
need to intercept the #DB, temporarily disable the breakpoint,
singe-step over the instruction (probably reusing NMI singe-stepping),
and re-enable the breakpoint.

Instead, document this as an erratum.

Signed-off-by: Yosry Ahmed <[email protected]>
Reviewed-by: Bagas Sanjaya <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
github-actions bot pushed a commit that referenced this pull request Nov 29, 2025
Marc Kleine-Budde <[email protected]> says:

Similarly to how CAN FD reuses the bittiming logic of Classical CAN, CAN XL
also reuses the entirety of CAN FD features, and, on top of that, adds new
features which are specific to CAN XL.

A so-called 'mixed-mode' is intended to have (XL-tolerant) CAN FD nodes and
CAN XL nodes on one CAN segment, where the FD-controllers can talk CC/FD
and the XL-controllers can talk CC/FD/XL. This mixed-mode utilizes the
known error-signalling (ES) for sending CC/FD/XL frames. For CAN FD and CAN
XL the tranceiver delay compensation (TDC) is supported to use common CAN
and CAN-SIG transceivers.

The CANXL-only mode disables the error-signalling in the CAN XL controller.
This mode does not allow CC/FD frames to be sent but additionally offers a
CAN XL transceiver mode switching (TMS) to send CAN XL frames with up to
20Mbit/s data rate. The TMS utilizes a PWM configuration which is added to
the netlink interface.

Configured with CAN_CTRLMODE_FD and CAN_CTRLMODE_XL this leads to:

FD=0 XL=0 CC-only mode         (ES=1)
FD=1 XL=0 FD/CC mixed-mode     (ES=1)
FD=1 XL=1 XL/FD/CC mixed-mode  (ES=1)
FD=0 XL=1 XL-only mode         (ES=0, TMS optional)

Patch #1 print defined ctrlmode strings capitalized to increase the
readability and to be in line with the 'ip' tool (iproute2).

Patch #2 is a small clean-up which makes can_calc_bittiming() use
NL_SET_ERR_MSG() instead of netdev_err().

Patch #3 adds a check in can_dev_dropped_skb() to drop CAN FD frames
when CAN FD is turned off.

Patch #4 adds CAN_CTRLMODE_RESTRICTED. Note that contrary to the other
CAN_CTRL_MODE_XL_* that are introduced in the later patches, this control
mode is not specific to CAN XL. The nuance is that because this restricted
mode was only added in ISO 11898-1:2024, it is made mandatory for CAN XL
devices but optional for other protocols. This is why this patch is added
as a preparation before introducing the core CAN XL logic.

Patch #5 adds all the CAN XL features which are inherited from CAN FD: the
nominal bittiming, the data bittiming and the TDC.

Patch #6 add a new CAN_CTRLMODE_XL_TMS control mode which is specific to
CAN XL to enable the transceiver mode switching (TMS) in XL-only mode.

Patch #7 adds a check in can_dev_dropped_skb() to drop CAN CC/FD frames
when the CAN XL controller is in CAN XL-only mode. The introduced
can_dev_in_xl_only_mode() function also determines the error-signalling
configuration for the CAN XL controllers.

Patch #8 to #11 add the PWM logic for the CAN XL TMS mode.

Patch #12 to #14 add different default sample-points for standard CAN and
CAN SIG transceivers (with TDC) and CAN XL transceivers using PWM in the
CAN XL TMS mode.

Patch #15 add a dummy_can driver for netlink testing and debugging.

Patch #16 check CAN frame type (CC/FD/XL) when writing those frames to the
CAN_RAW socket and reject them if it's not supported by the CAN interface.

Patch #17 increase the resolution when printing the bitrate error and
round-up the value to 0.01% in the case the resolution would still provide
values which would lead to 0.00%.

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Marc Kleine-Budde <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 2, 2025
It's possible that the auxiliary proxy device we add when setting up the
GPIO controller exposing shared pins, will get matched and probed
immediately. The gpio-proxy-driver will then retrieve the shared
descriptor structure. That will cause a recursive mutex locking and
a deadlock because we're already holding the gpio_shared_lock in
gpio_device_setup_shared() and try to take it again in
devm_gpiod_shared_get() like this:

[    4.298346] gpiolib_shared: GPIO 130 owned by f100000.pinctrl is shared by multiple consumers
[    4.307157] gpiolib_shared: Setting up a shared GPIO entry for speaker@0,3
[    4.314604]
[    4.316146] ============================================
[    4.321600] WARNING: possible recursive locking detected
[    4.327054] 6.18.0-rc7-next-20251125-g3f300d0674f6-dirty #3887 Not tainted
[    4.334115] --------------------------------------------
[    4.339566] kworker/u32:3/71 is trying to acquire lock:
[    4.344931] ffffda019ba71850 (gpio_shared_lock){+.+.}-{4:4}, at: devm_gpiod_shared_get+0x34/0x2e0
[    4.354057]
[    4.354057] but task is already holding lock:
[    4.360041] ffffda019ba71850 (gpio_shared_lock){+.+.}-{4:4}, at: gpio_device_setup_shared+0x30/0x268
[    4.369421]
[    4.369421] other info that might help us debug this:
[    4.376126]  Possible unsafe locking scenario:
[    4.376126]
[    4.382198]        CPU0
[    4.384711]        ----
[    4.387223]   lock(gpio_shared_lock);
[    4.390992]   lock(gpio_shared_lock);
[    4.394761]
[    4.394761]  *** DEADLOCK ***
[    4.394761]
[    4.400832]  May be due to missing lock nesting notation
[    4.400832]
[    4.407802] 5 locks held by kworker/u32:3/71:
[    4.412279]  #0: ffff000080020948 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x194/0x64c
[    4.422650]  #1: ffff800080963d60 (deferred_probe_work){+.+.}-{0:0}, at: process_one_work+0x1bc/0x64c
[    4.432117]  #2: ffff00008165c8f8 (&dev->mutex){....}-{4:4}, at: __device_attach+0x3c/0x198
[    4.440700]  #3: ffffda019ba71850 (gpio_shared_lock){+.+.}-{4:4}, at: gpio_device_setup_shared+0x30/0x268
[    4.450523]  #4: ffff0000810fe918 (&dev->mutex){....}-{4:4}, at: __device_attach+0x3c/0x198
[    4.459103]
[    4.459103] stack backtrace:
[    4.463581] CPU: 6 UID: 0 PID: 71 Comm: kworker/u32:3 Not tainted 6.18.0-rc7-next-20251125-g3f300d0674f6-dirty #3887 PREEMPT
[    4.463589] Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
[    4.463593] Workqueue: events_unbound deferred_probe_work_func
[    4.463602] Call trace:
[    4.463604]  show_stack+0x18/0x24 (C)
[    4.463617]  dump_stack_lvl+0x70/0x98
[    4.463627]  dump_stack+0x18/0x24
[    4.463636]  print_deadlock_bug+0x224/0x238
[    4.463643]  __lock_acquire+0xe4c/0x15f0
[    4.463648]  lock_acquire+0x1cc/0x344
[    4.463653]  __mutex_lock+0xb8/0x840
[    4.463661]  mutex_lock_nested+0x24/0x30
[    4.463667]  devm_gpiod_shared_get+0x34/0x2e0
[    4.463674]  gpio_shared_proxy_probe+0x18/0x138
[    4.463682]  auxiliary_bus_probe+0x40/0x78
[    4.463688]  really_probe+0xbc/0x2c0
[    4.463694]  __driver_probe_device+0x78/0x120
[    4.463701]  driver_probe_device+0x3c/0x160
[    4.463708]  __device_attach_driver+0xb8/0x140
[    4.463716]  bus_for_each_drv+0x88/0xe8
[    4.463723]  __device_attach+0xa0/0x198
[    4.463729]  device_initial_probe+0x14/0x20
[    4.463737]  bus_probe_device+0xb4/0xc0
[    4.463743]  device_add+0x578/0x76c
[    4.463747]  __auxiliary_device_add+0x40/0xac
[    4.463752]  gpio_device_setup_shared+0x1f8/0x268
[    4.463758]  gpiochip_add_data_with_key+0xdac/0x10ac
[    4.463763]  devm_gpiochip_add_data_with_key+0x30/0x80
[    4.463768]  msm_pinctrl_probe+0x4b0/0x5e0
[    4.463779]  sm8250_pinctrl_probe+0x18/0x40
[    4.463784]  platform_probe+0x5c/0xa4
[    4.463793]  really_probe+0xbc/0x2c0
[    4.463800]  __driver_probe_device+0x78/0x120
[    4.463807]  driver_probe_device+0x3c/0x160
[    4.463814]  __device_attach_driver+0xb8/0x140
[    4.463821]  bus_for_each_drv+0x88/0xe8
[    4.463827]  __device_attach+0xa0/0x198
[    4.463834]  device_initial_probe+0x14/0x20
[    4.463841]  bus_probe_device+0xb4/0xc0
[    4.463847]  deferred_probe_work_func+0x90/0xcc
[    4.463854]  process_one_work+0x214/0x64c
[    4.463860]  worker_thread+0x1bc/0x360
[    4.463866]  kthread+0x14c/0x220
[    4.463871]  ret_from_fork+0x10/0x20
[   77.265041] random: crng init done

Fortunately, at the time of creating of the auxiliary device, we already
know the correct entry so let's store it as the device's platform data.
We don't need to hold gpio_shared_lock in devm_gpiod_shared_get() as
we're not removing the entry or traversing the list anymore but we still
need to protect it from concurrent modification of its fields so add a
more fine-grained mutex.

Fixes: a060b8c ("gpiolib: implement low-level, shared GPIO support")
Reported-by: Dmitry Baryshkov <[email protected]>
Closes: https://lore.kernel.org/all/fimuvblfy2cmn7o4wzcxjzrux5mwhvlvyxfsgeqs6ore2xg75i@ax46d3sfmdux/
Reviewed-by: Dmitry Baryshkov <[email protected]>
Tested-by: Dmitry Baryshkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bartosz Golaszewski <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 3, 2025
Testing in two circumstances:

1. back to back optical SFP+ connection between two LS1028A-QDS ports
   with the SCH-26908 riser card
2. T1042 with on-board AQR115 PHY using "OCSGMII", as per
   https://lore.kernel.org/lkml/aIuEvaSCIQdJWcZx@FUE-ALEWI-WINX/

strongly suggests that enabling in-band auto-negotiation is actually
possible when the lane baud rate is 3.125 Gbps.

It was previously thought that this would not be the case, because it
was only tested on 2500base-x links with on-board Aquantia PHYs, where
it was noticed that MII_LPA is always reported as zero, and it was
thought that this is because of the PCS.

Test case #1 above shows it is not, and the configured MII_ADVERTISE on
system A ends up in the MII_LPA on system B, when in 2500base-x mode
(IF_MODE=0).

Test case #2, which uses "SGMII" auto-negotiation (IF_MODE=3) for the
3.125 Gbps lane, is actually a misconfiguration, but it is what led to
the discovery.

There is actually an old bug in the Lynx PCS driver - it expects all
register values to contain their default out-of-reset values, as if the
PCS were initialized by the Reset Configuration Word (RCW) settings.
There are 2 cases in which this is problematic:
- if the bootloader (or previous kexec-enabled Linux) wrote a different
  IF_MODE value
- if dynamically changing the SerDes protocol from 1000base-x to
  2500base-x, e.g. by replacing the optical SFP module.

Specifically in test case #2, an accidental alignment between the
bootloader configuring the PCS to expect SGMII in-band code words, and
the AQR115 PHY actually transmitting SGMII in-band code words when
operating in the "OCSGMII" system interface protocol, led to the PCS
transmitting replicated symbols at 3.125 Gbps baud rate. This could only
have happened if the PCS saw and reacted to the SGMII code words in the
first place.

Since test #2 is invalid from a protocol perspective (there seems to be
no standard way of negotiating the data rate of 2500 Mbps with SGMII,
and the lower data rates should remain 10/100/1000), in-band auto-negotiation
for 2500base-x effectively means Clause 37 (i.e. IF_MODE=0).

Make 2500base-x be treated like 1000base-x in this regard, by removing
all prior limitations and calling lynx_pcs_config_giga().

This adds a new feature: LINK_INBAND_ENABLE and at the same time fixes
the Lynx PCS's long standing problem that the registers (specifically
IF_MODE, but others could be misconfigured as well) are not written by
the driver to the known valid values for 2500base-x.

Co-developed-by: Alexander Wilhelm <[email protected]>
Signed-off-by: Alexander Wilhelm <[email protected]>
Signed-off-by: Vladimir Oltean <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 4, 2025
…ockdep

While developing IPPROTO_SMBDIRECT support for the code
under fs/smb/common/smbdirect [1], I noticed false positives like this:

[T79] ======================================================
[T79] WARNING: possible circular locking dependency detected
[T79] 6.18.0-rc4-metze-kasan-lockdep.01+ #1 Tainted: G           OE
[T79] ------------------------------------------------------
[T79] kworker/2:0/79 is trying to acquire lock:
[T79] ffff88801f968278 (sk_lock-AF_INET){+.+.}-{0:0},
                        at: sock_set_reuseaddr+0x14/0x70
[T79]
        but task is already holding lock:
[T79] ffffffffc10f7230 (lock#9){+.+.}-{4:4},
                        at: rdma_listen+0x3d2/0x740 [rdma_cm]
[T79]
        which lock already depends on the new lock.

[T79]
        the existing dependency chain (in reverse order) is:
[T79]
        -> #1 (lock#9){+.+.}-{4:4}:
[T79]        __lock_acquire+0x535/0xc30
[T79]        lock_acquire.part.0+0xb3/0x240
[T79]        lock_acquire+0x60/0x140
[T79]        __mutex_lock+0x1af/0x1c10
[T79]        mutex_lock_nested+0x1b/0x30
[T79]        cma_get_port+0xba/0x7d0 [rdma_cm]
[T79]        rdma_bind_addr_dst+0x598/0x9a0 [rdma_cm]
[T79]        cma_bind_addr+0x107/0x320 [rdma_cm]
[T79]        rdma_resolve_addr+0xa3/0x830 [rdma_cm]
[T79]        destroy_lease_table+0x12b/0x420 [ksmbd]
[T79]        ksmbd_NTtimeToUnix+0x3e/0x80 [ksmbd]
[T79]        ndr_encode_posix_acl+0x6e9/0xab0 [ksmbd]
[T79]        ndr_encode_v4_ntacl+0x53/0x870 [ksmbd]
[T79]        __sys_connect_file+0x131/0x1c0
[T79]        __sys_connect+0x111/0x140
[T79]        __x64_sys_connect+0x72/0xc0
[T79]        x64_sys_call+0xe7d/0x26a0
[T79]        do_syscall_64+0x93/0xff0
[T79]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[T79]
        -> #0 (sk_lock-AF_INET){+.+.}-{0:0}:
[T79]        check_prev_add+0xf3/0xcd0
[T79]        validate_chain+0x466/0x590
[T79]        __lock_acquire+0x535/0xc30
[T79]        lock_acquire.part.0+0xb3/0x240
[T79]        lock_acquire+0x60/0x140
[T79]        lock_sock_nested+0x3b/0xf0
[T79]        sock_set_reuseaddr+0x14/0x70
[T79]        siw_create_listen+0x145/0x1540 [siw]
[T79]        iw_cm_listen+0x313/0x5b0 [iw_cm]
[T79]        cma_iw_listen+0x271/0x3c0 [rdma_cm]
[T79]        rdma_listen+0x3b1/0x740 [rdma_cm]
[T79]        cma_listen_on_dev+0x46a/0x750 [rdma_cm]
[T79]        rdma_listen+0x4b0/0x740 [rdma_cm]
[T79]        ksmbd_rdma_init+0x12b/0x270 [ksmbd]
[T79]        ksmbd_conn_transport_init+0x26/0x70 [ksmbd]
[T79]        server_ctrl_handle_work+0x1e5/0x280 [ksmbd]
[T79]        process_one_work+0x86c/0x1930
[T79]        worker_thread+0x6f0/0x11f0
[T79]        kthread+0x3ec/0x8b0
[T79]        ret_from_fork+0x314/0x400
[T79]        ret_from_fork_asm+0x1a/0x30
[T79]
        other info that might help us debug this:

[T79]  Possible unsafe locking scenario:

[T79]        CPU0                    CPU1
[T79]        ----                    ----
[T79]   lock(lock#9);
[T79]                                lock(sk_lock-AF_INET);
[T79]                                lock(lock#9);
[T79]   lock(sk_lock-AF_INET);
[T79]
         *** DEADLOCK ***

[T79] 5 locks held by kworker/2:0/79:
[T79] #0: ffff88800120b158 ((wq_completion)events_long){+.+.}-{0:0},
                           at: process_one_work+0xfca/0x1930
[T79] #1: ffffc9000474fd00 ((work_completion)(&ctrl->ctrl_work))
                           {+.+.}-{0:0},
                           at: process_one_work+0x804/0x1930
[T79] #2: ffffffffc11307d0 (ctrl_lock){+.+.}-{4:4},
                           at: server_ctrl_handle_work+0x21/0x280 [ksmbd]
[T79] #3: ffffffffc11347b0 (init_lock){+.+.}-{4:4},
                           at: ksmbd_conn_transport_init+0x18/0x70 [ksmbd]
[T79] #4: ffffffffc10f7230 (lock#9){+.+.}-{4:4},
                            at: rdma_listen+0x3d2/0x740 [rdma_cm]
[T79]
        stack backtrace:
[T79] CPU: 2 UID: 0 PID: 79 Comm: kworker/2:0 Kdump: loaded
      Tainted: G           OE
      6.18.0-rc4-metze-kasan-lockdep.01+ #1 PREEMPT(voluntary)
[T79] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[T79] Hardware name: innotek GmbH VirtualBox/VirtualBox,
      BIOS VirtualBox 12/01/2006
[T79] Workqueue: events_long server_ctrl_handle_work [ksmbd]
...
[T79]  print_circular_bug+0xfd/0x130
[T79]  check_noncircular+0x150/0x170
[T79]  check_prev_add+0xf3/0xcd0
[T79]  validate_chain+0x466/0x590
[T79]  __lock_acquire+0x535/0xc30
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  lock_acquire.part.0+0xb3/0x240
[T79]  ? sock_set_reuseaddr+0x14/0x70
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? __kasan_check_write+0x14/0x30
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? apparmor_socket_post_create+0x180/0x700
[T79]  lock_acquire+0x60/0x140
[T79]  ? sock_set_reuseaddr+0x14/0x70
[T79]  lock_sock_nested+0x3b/0xf0
[T79]  ? sock_set_reuseaddr+0x14/0x70
[T79]  sock_set_reuseaddr+0x14/0x70
[T79]  siw_create_listen+0x145/0x1540 [siw]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? local_clock_noinstr+0xe/0xd0
[T79]  ? __pfx_siw_create_listen+0x10/0x10 [siw]
[T79]  ? trace_preempt_on+0x4c/0x130
[T79]  ? __raw_spin_unlock_irqrestore+0x4a/0x90
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? preempt_count_sub+0x52/0x80
[T79]  iw_cm_listen+0x313/0x5b0 [iw_cm]
[T79]  cma_iw_listen+0x271/0x3c0 [rdma_cm]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  rdma_listen+0x3b1/0x740 [rdma_cm]
[T79]  ? _raw_spin_unlock+0x2c/0x60
[T79]  ? __pfx_rdma_listen+0x10/0x10 [rdma_cm]
[T79]  ? rdma_restrack_add+0x12c/0x630 [ib_core]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  cma_listen_on_dev+0x46a/0x750 [rdma_cm]
[T79]  rdma_listen+0x4b0/0x740 [rdma_cm]
[T79]  ? __pfx_rdma_listen+0x10/0x10 [rdma_cm]
[T79]  ? cma_get_port+0x30d/0x7d0 [rdma_cm]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? rdma_bind_addr_dst+0x598/0x9a0 [rdma_cm]
[T79]  ksmbd_rdma_init+0x12b/0x270 [ksmbd]
[T79]  ? __pfx_ksmbd_rdma_init+0x10/0x10 [ksmbd]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? register_netdevice_notifier+0x1dc/0x240
[T79]  ksmbd_conn_transport_init+0x26/0x70 [ksmbd]
[T79]  server_ctrl_handle_work+0x1e5/0x280 [ksmbd]
[T79]  process_one_work+0x86c/0x1930
[T79]  ? __pfx_process_one_work+0x10/0x10
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? assign_work+0x16f/0x280
[T79]  worker_thread+0x6f0/0x11f0

I was not able to reproduce this as I was testing with various
runs switching siw and rxe as well as IPPROTO_SMBDIRECT sockets,
while the above stack used siw with the non IPPROTO_SMBDIRECT
patches [1].

Even if this patch doesn't solve the above I think it's
a good idea to reclassify the sockets used by siw,
I also send patches for rxe to reclassify, as well
as my IPPROTO_SMBDIRECT socket patches [1] will do it,
this should minimize potential false positives.

[1]
https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/master-ipproto-smbdirect

Cc: Bernard Metzler <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Stefan Metzmacher <[email protected]>
Link: https://patch.msgid.link/[email protected]
Acked-by: Bernard Metzler <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 4, 2025
…ockdep

While developing IPPROTO_SMBDIRECT support for the code
under fs/smb/common/smbdirect [1], I noticed false positives like this:

[+0,003927] ============================================
[+0,000532] WARNING: possible recursive locking detected
[+0,000611] 6.18.0-rc5-metze-kasan-lockdep.02+ #1 Tainted: G           OE
[+0,000835] --------------------------------------------
[+0,000729] ksmbd:r5445/3609 is trying to acquire lock:
[+0,000709] ffff88800b9570f8 (k-sk_lock-AF_INET){+.+.}-{0:0},
                              at: inet_shutdown+0x52/0x360
[+0,000831]
            but task is already holding lock:
[+0,000684] ffff88800654af78 (k-sk_lock-AF_INET){+.+.}-{0:0},
                           at: smbdirect_sk_close+0x122/0x790 [smbdirect]
[+0,000928]
            other info that might help us debug this:
[+0,005552]  Possible unsafe locking scenario:

[+0,000723]        CPU0
[+0,000359]        ----
[+0,000377]   lock(k-sk_lock-AF_INET);
[+0,000478]   lock(k-sk_lock-AF_INET);
[+0,000498]
             *** DEADLOCK ***

[+0,001012]  May be due to missing lock nesting notation

[+0,000831] 3 locks held by ksmbd:r5445/3609:
[+0,000484]  #0: ffff88800654af78 (k-sk_lock-AF_INET){+.+.}-{0:0},
                           at: smbdirect_sk_close+0x122/0x790 [smbdirect]
[+0,001000]  #1: ffff888020a40458 (&id_priv->handler_mutex){+.+.}-{4:4},
                           at: rdma_lock_handler+0x17/0x30 [rdma_cm]
[+0,000982]  #2: ffff888020a40350 (&id_priv->qp_mutex){+.+.}-{4:4},
                           at: rdma_destroy_qp+0x5d/0x1f0 [rdma_cm]
[+0,000934]
            stack backtrace:
[+0,000589] CPU: 0 UID: 0 PID: 3609 Comm: ksmbd:r5445 Kdump: loaded
             Tainted: G           OE
             6.18.0-rc5-metze-kasan-lockdep.02+ #1 PREEMPT(voluntary)
[+0,000023] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[+0,000004] Hardware name: innotek GmbH VirtualBox/VirtualBox,
            BIOS VirtualBox 12/01/2006
...
[+0,000010] print_deadlock_bug+0x245/0x330
[+0,000014] validate_chain+0x32a/0x590
[+0,000012] __lock_acquire+0x535/0xc30
[+0,000013] lock_acquire.part.0+0xb3/0x240
[+0,000017] ? inet_shutdown+0x52/0x360
[+0,000013] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000007] ? mark_held_locks+0x46/0x90
[+0,000012] lock_acquire+0x60/0x140
[+0,000006] ? inet_shutdown+0x52/0x360
[+0,000028] lock_sock_nested+0x3b/0xf0
[+0,000009] ? inet_shutdown+0x52/0x360
[+0,000008] inet_shutdown+0x52/0x360
[+0,000010] kernel_sock_shutdown+0x5b/0x90
[+0,000011] rxe_qp_do_cleanup+0x4ef/0x810 [rdma_rxe]
[+0,000043] ? __pfx_rxe_qp_do_cleanup+0x10/0x10 [rdma_rxe]
[+0,000030] execute_in_process_context+0x2b/0x170
[+0,000013] rxe_qp_cleanup+0x1c/0x30 [rdma_rxe]
[+0,000021] __rxe_cleanup+0x1cf/0x2e0 [rdma_rxe]
[+0,000036] ? __pfx___rxe_cleanup+0x10/0x10 [rdma_rxe]
[+0,000020] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000006] ? __kasan_check_read+0x11/0x20
[+0,000012] rxe_destroy_qp+0xe1/0x230 [rdma_rxe]
[+0,000035] ib_destroy_qp_user+0x217/0x450 [ib_core]
[+0,000074] rdma_destroy_qp+0x83/0x1f0 [rdma_cm]
[+0,000034] smbdirect_connection_destroy_qp+0x98/0x2e0 [smbdirect]
[+0,000017] ? __pfx_smb_direct_logging_needed+0x10/0x10 [ksmbd]
[+0,000044] smbdirect_connection_destroy+0x698/0xed0 [smbdirect]
[+0,000023] ? __pfx_smbdirect_connection_destroy+0x10/0x10 [smbdirect]
[+0,000033] ? __pfx_smb_direct_logging_needed+0x10/0x10 [ksmbd]
[+0,000031] smbdirect_connection_destroy_sync+0x42b/0x9f0 [smbdirect]
[+0,000029] ? mark_held_locks+0x46/0x90
[+0,000012] ? __pfx_smbdirect_connection_destroy_sync+0x10/0x10 [smbdirect]
[+0,000019] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000007] ? trace_hardirqs_on+0x64/0x70
[+0,000029] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000010] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000006] ? __smbdirect_connection_schedule_disconnect+0x339/0x4b0
[+0,000021] smbdirect_sk_destroy+0xb0/0x680 [smbdirect]
[+0,000024] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000006] ? trace_hardirqs_on+0x64/0x70
[+0,000006] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000005] ? __local_bh_enable_ip+0xba/0x150
[+0,000011] sk_common_release+0x66/0x340
[+0,000010] smbdirect_sk_close+0x12a/0x790 [smbdirect]
[+0,000023] ? ip_mc_drop_socket+0x1e/0x240
[+0,000013] inet_release+0x10a/0x240
[+0,000011] smbdirect_sock_release+0x502/0xe80 [smbdirect]
[+0,000015] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000024] sock_release+0x91/0x1c0
[+0,000010] smb_direct_free_transport+0x31/0x50 [ksmbd]
[+0,000025] ksmbd_conn_free+0x1d0/0x240 [ksmbd]
[+0,000040] smb_direct_disconnect+0xb2/0x120 [ksmbd]
[+0,000023] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000018] ksmbd_conn_handler_loop+0x94e/0xf10 [ksmbd]
...

I'll also add reclassify to the smbdirect socket code [1],
but I think it's better to have it in both direction
(below and above the RDMA layer).

[1]
https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/master-ipproto-smbdirect

Cc: Zhu Yanjun <[email protected]>
Reviewed-by: Zhu Yanjun <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Stefan Metzmacher <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Leon Romanovsky <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 6, 2025
When interrupting perf stat in repeat mode with a signal the signal is
passed to the child process but the repeat doesn't terminate:
```
$ perf stat -v --null --repeat 10 sleep 1
Control descriptor is not initialized
[ perf stat: executing run #1 ... ]
[ perf stat: executing run #2 ... ]
^Csleep: Interrupt
[ perf stat: executing run #3 ... ]
[ perf stat: executing run #4 ... ]
[ perf stat: executing run #5 ... ]
[ perf stat: executing run #6 ... ]
[ perf stat: executing run #7 ... ]
[ perf stat: executing run #8 ... ]
[ perf stat: executing run #9 ... ]
[ perf stat: executing run #10 ... ]

 Performance counter stats for 'sleep 1' (10 runs):

            0.9500 +- 0.0512 seconds time elapsed  ( +-  5.39% )

0.01user 0.02system 0:09.53elapsed 0%CPU (0avgtext+0avgdata 18940maxresident)k
29944inputs+0outputs (0major+2629minor)pagefaults 0swaps
```

Terminate the repeated run and give a reasonable exit value:
```
$ perf stat -v --null --repeat 10 sleep 1
Control descriptor is not initialized
[ perf stat: executing run #1 ... ]
[ perf stat: executing run #2 ... ]
[ perf stat: executing run #3 ... ]
^Csleep: Interrupt

 Performance counter stats for 'sleep 1' (10 runs):

             0.680 +- 0.321 seconds time elapsed  ( +- 47.16% )

Command exited with non-zero status 130
0.00user 0.01system 0:02.05elapsed 0%CPU (0avgtext+0avgdata 70688maxresident)k
0inputs+0outputs (0major+5002minor)pagefaults 0swaps
```

Note, this also changes the exit value for non-repeat runs when
interrupted by a signal.

Reported-by: Ingo Molnar <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Ian Rogers <[email protected]>
Tested-by: Thomas Richter <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 6, 2025
As Jiaming Zhang and syzbot reported, there is potential deadlock in
f2fs as below:

Chain exists of:
  &sbi->cp_rwsem --> fs_reclaim --> sb_internal#2

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  rlock(sb_internal#2);
                               lock(fs_reclaim);
                               lock(sb_internal#2);
  rlock(&sbi->cp_rwsem);

 *** DEADLOCK ***

3 locks held by kswapd0/73:
 #0: ffffffff8e247a40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:7015 [inline]
 #0: ffffffff8e247a40 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0x951/0x2800 mm/vmscan.c:7389
 #1: ffff8880118400e0 (&type->s_umount_key#50){.+.+}-{4:4}, at: super_trylock_shared fs/super.c:562 [inline]
 #1: ffff8880118400e0 (&type->s_umount_key#50){.+.+}-{4:4}, at: super_cache_scan+0x91/0x4b0 fs/super.c:197
 #2: ffff888011840610 (sb_internal#2){.+.+}-{0:0}, at: f2fs_evict_inode+0x8d9/0x1b60 fs/f2fs/inode.c:890

stack backtrace:
CPU: 0 UID: 0 PID: 73 Comm: kswapd0 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_circular_bug+0x2ee/0x310 kernel/locking/lockdep.c:2043
 check_noncircular+0x134/0x160 kernel/locking/lockdep.c:2175
 check_prev_add kernel/locking/lockdep.c:3165 [inline]
 check_prevs_add kernel/locking/lockdep.c:3284 [inline]
 validate_chain+0xb9b/0x2140 kernel/locking/lockdep.c:3908
 __lock_acquire+0xab9/0xd20 kernel/locking/lockdep.c:5237
 lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5868
 down_read+0x46/0x2e0 kernel/locking/rwsem.c:1537
 f2fs_down_read fs/f2fs/f2fs.h:2278 [inline]
 f2fs_lock_op fs/f2fs/f2fs.h:2357 [inline]
 f2fs_do_truncate_blocks+0x21c/0x10c0 fs/f2fs/file.c:791
 f2fs_truncate_blocks+0x10a/0x300 fs/f2fs/file.c:867
 f2fs_truncate+0x489/0x7c0 fs/f2fs/file.c:925
 f2fs_evict_inode+0x9f2/0x1b60 fs/f2fs/inode.c:897
 evict+0x504/0x9c0 fs/inode.c:810
 f2fs_evict_inode+0x1dc/0x1b60 fs/f2fs/inode.c:853
 evict+0x504/0x9c0 fs/inode.c:810
 dispose_list fs/inode.c:852 [inline]
 prune_icache_sb+0x21b/0x2c0 fs/inode.c:1000
 super_cache_scan+0x39b/0x4b0 fs/super.c:224
 do_shrink_slab+0x6ef/0x1110 mm/shrinker.c:437
 shrink_slab_memcg mm/shrinker.c:550 [inline]
 shrink_slab+0x7ef/0x10d0 mm/shrinker.c:628
 shrink_one+0x28a/0x7c0 mm/vmscan.c:4955
 shrink_many mm/vmscan.c:5016 [inline]
 lru_gen_shrink_node mm/vmscan.c:5094 [inline]
 shrink_node+0x315d/0x3780 mm/vmscan.c:6081
 kswapd_shrink_node mm/vmscan.c:6941 [inline]
 balance_pgdat mm/vmscan.c:7124 [inline]
 kswapd+0x147c/0x2800 mm/vmscan.c:7389
 kthread+0x70e/0x8a0 kernel/kthread.c:463
 ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

The root cause is deadlock among four locks as below:

kswapd
- fs_reclaim				--- Lock A
 - shrink_one
  - evict
   - f2fs_evict_inode
    - sb_start_intwrite			--- Lock B

- iput
 - evict
  - f2fs_evict_inode
   - sb_start_intwrite			--- Lock B
   - f2fs_truncate
    - f2fs_truncate_blocks
     - f2fs_do_truncate_blocks
      - f2fs_lock_op			--- Lock C

ioctl
- f2fs_ioc_commit_atomic_write
 - f2fs_lock_op				--- Lock C
  - __f2fs_commit_atomic_write
   - __replace_atomic_write_block
    - f2fs_get_dnode_of_data
     - __get_node_folio
      - f2fs_check_nid_range
       - f2fs_handle_error
        - f2fs_record_errors
         - f2fs_down_write		--- Lock D

open
- do_open
 - do_truncate
  - security_inode_need_killpriv
   - f2fs_getxattr
    - lookup_all_xattrs
     - f2fs_handle_error
      - f2fs_record_errors
       - f2fs_down_write		--- Lock D
        - f2fs_commit_super
         - read_mapping_folio
          - filemap_alloc_folio_noprof
           - prepare_alloc_pages
            - fs_reclaim_acquire	--- Lock A

In order to avoid such deadlock, we need to avoid grabbing sb_lock in
f2fs_handle_error(), so, let's use asynchronous method instead:
- remove f2fs_handle_error() implementation
- rename f2fs_handle_error_async() to f2fs_handle_error()
- spread f2fs_handle_error()

Fixes: 95fa90c ("f2fs: support recording errors into superblock")
Cc: [email protected]
Reported-by: [email protected]
Closes: https://lore.kernel.org/linux-f2fs-devel/[email protected]
Reported-by: Jiaming Zhang <[email protected]>
Closes: https://lore.kernel.org/lkml/CANypQFa-Gy9sD-N35o3PC+FystOWkNuN8pv6S75HLT0ga-Tzgw@mail.gmail.com
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 9, 2025
Patch series "mm/hugetlb: fixes for PMD table sharing (incl.  using
mmu_gather)".

One functional fix, one performance regression fix, and two related
comment fixes.

I cleaned up my prototype I recently shared [1] for the performance fix,
deferring most of the cleanups I had in the prototype to a later point. 
While doing that I identified the other things.

The goal of this patch set is to be backported to stable trees "fairly"
easily.  At least patch #1 and #4.

Patch #1 fixes hugetlb_pmd_shared() not detecting any sharing
Patch #2 + #3 are simple comment fixes that patch #4 interacts with.
Patch #4 is a fix for the reported performance regression due to excessive
IPI broadcasts during fork()+exit().

The last patch is all about TLB flushes, IPIs and mmu_gather.  Read:
complicated

I added as much comments + description that I possibly could, and I am
hoping for review from Jann.

There are plenty of cleanups in the future to be had + one reasonable
optimization on x86.  But that's all out of scope for this series.


This patch (of 4):

We switched from (wrongly) using the page count to an independent shared
count.  Now, shared page tables have a refcount of 1 (excluding
speculative references) and instead use ptdesc->pt_share_count to identify
sharing.

We didn't convert hugetlb_pmd_shared(), so right now, we would never
detect a shared PMD table as such, because sharing/unsharing no longer
touches the refcount of a PMD table.

Page migration, like mbind() or migrate_pages() would allow for migrating
folios mapped into such shared PMD tables, even though the folios are not
exclusive.  In smaps we would account them as "private" although they are
"shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the
pagemap interface.

Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/[email protected]/ [1]
Fixes: 59d9094 ("mm: hugetlb: independent PMD page table shared count")
Signed-off-by: David Hildenbrand (Red Hat) <[email protected]>
Tested-by: Laurence Oberman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Cc: Liu Shixin <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Prakash Sangappa <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Uschakow, Stanislav" <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 11, 2025
Jakub reported an MPTCP deadlock at fallback time:

 WARNING: possible recursive locking detected
 6.18.0-rc7-virtme #1 Not tainted
 --------------------------------------------
 mptcp_connect/20858 is trying to acquire lock:
 ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_try_fallback+0xd8/0x280

 but task is already holding lock:
 ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&msk->fallback_lock);
   lock(&msk->fallback_lock);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 3 locks held by mptcp_connect/20858:
  #0: ff1100001da18290 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x114/0x1bc0
  #1: ff1100001db40fd0 (k-sk_lock-AF_INET#2){+.+.}-{0:0}, at: __mptcp_retrans+0x2cb/0xaa0
  #2: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0

 stack backtrace:
 CPU: 0 UID: 0 PID: 20858 Comm: mptcp_connect Not tainted 6.18.0-rc7-virtme #1 PREEMPT(full)
 Hardware name: Bochs, BIOS Bochs 01/01/2011
 Call Trace:
  <TASK>
  dump_stack_lvl+0x6f/0xa0
  print_deadlock_bug.cold+0xc0/0xcd
  validate_chain+0x2ff/0x5f0
  __lock_acquire+0x34c/0x740
  lock_acquire.part.0+0xbc/0x260
  _raw_spin_lock_bh+0x38/0x50
  __mptcp_try_fallback+0xd8/0x280
  mptcp_sendmsg_frag+0x16c2/0x3050
  __mptcp_retrans+0x421/0xaa0
  mptcp_release_cb+0x5aa/0xa70
  release_sock+0xab/0x1d0
  mptcp_sendmsg+0xd5b/0x1bc0
  sock_write_iter+0x281/0x4d0
  new_sync_write+0x3c5/0x6f0
  vfs_write+0x65e/0xbb0
  ksys_write+0x17e/0x200
  do_syscall_64+0xbb/0xfd0
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
 RIP: 0033:0x7fa5627cbc5e
 Code: 4d 89 d8 e8 14 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa
 RSP: 002b:00007fff1fe14700 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fa5627cbc5e
 RDX: 0000000000001f9c RSI: 00007fff1fe16984 RDI: 0000000000000005
 RBP: 00007fff1fe14710 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff1fe16920
 R13: 0000000000002000 R14: 0000000000001f9c R15: 0000000000001f9c

The packet scheduler could attempt a reinjection after receiving an
MP_FAIL and before the infinite map has been transmitted, causing a
deadlock since MPTCP needs to do the reinjection atomically from WRT
fallback.

Address the issue explicitly avoiding the reinjection in the critical
scenario. Note that this is the only fallback critical section that
could potentially send packets and hit the double-lock.

Reported-by: Jakub Kicinski <[email protected]>
Closes: https://netdev-ctrl.bots.linux.dev/logs/vmksft/mptcp-dbg/results/412720/1-mptcp-join-sh/stderr
Fixes: f8a1d9b ("mptcp: make fallback action and fallback decision atomic")
Cc: [email protected]
Signed-off-by: Paolo Abeni <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://patch.msgid.link/20251205-net-mptcp-misc-fixes-6-19-rc1-v1-4-9e4781a6c1b8@kernel.org
Signed-off-by: Jakub Kicinski <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 11, 2025
Patch series "mm/hugetlb: fixes for PMD table sharing (incl.  using
mmu_gather)".

One functional fix, one performance regression fix, and two related
comment fixes.

I cleaned up my prototype I recently shared [1] for the performance fix,
deferring most of the cleanups I had in the prototype to a later point. 
While doing that I identified the other things.

The goal of this patch set is to be backported to stable trees "fairly"
easily.  At least patch #1 and #4.

Patch #1 fixes hugetlb_pmd_shared() not detecting any sharing
Patch #2 + #3 are simple comment fixes that patch #4 interacts with.
Patch #4 is a fix for the reported performance regression due to excessive
IPI broadcasts during fork()+exit().

The last patch is all about TLB flushes, IPIs and mmu_gather.  Read:
complicated

I added as much comments + description that I possibly could, and I am
hoping for review from Jann.

There are plenty of cleanups in the future to be had + one reasonable
optimization on x86.  But that's all out of scope for this series.


This patch (of 4):

We switched from (wrongly) using the page count to an independent shared
count.  Now, shared page tables have a refcount of 1 (excluding
speculative references) and instead use ptdesc->pt_share_count to identify
sharing.

We didn't convert hugetlb_pmd_shared(), so right now, we would never
detect a shared PMD table as such, because sharing/unsharing no longer
touches the refcount of a PMD table.

Page migration, like mbind() or migrate_pages() would allow for migrating
folios mapped into such shared PMD tables, even though the folios are not
exclusive.  In smaps we would account them as "private" although they are
"shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the
pagemap interface.

Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/[email protected]/ [1]
Fixes: 59d9094 ("mm: hugetlb: independent PMD page table shared count")
Signed-off-by: David Hildenbrand (Red Hat) <[email protected]>
Tested-by: Laurence Oberman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Cc: Liu Shixin <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Prakash Sangappa <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Uschakow, Stanislav" <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 13, 2025
Petr Machata says:

====================
selftests: forwarding: vxlan_bridge_1q_mc_ul: Fix flakiness

The net/forwarding/vxlan_bridge_1q_mc_ul selftest runs an overlay traffic,
forwarded over a multicast-routed VXLAN underlay. In order to determine
whether packets reach their intended destination, it uses a TC match. For
convenience, it uses a flower match, which however does not allow matching
on the encapsulated packet. So various service traffic ends up being
indistinguishable from the test packets, and ends up confusing the test. To
alleviate the problem, the test uses sleep to allow the necessary service
traffic to run and clear the channel, before running the test traffic. This
worked for a while, but lately we have nevertheless seen flakiness of the
test in the CI.

In this patchset, first generalize tc_rule_stats_get() to support u32 in
patch #1, then in patch #2 convert the test to use u32 to allow parsing
deeper into the packet, and in #3 drop the now-unnecessary sleep.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 13, 2025
The IPv4 code path in __ip_vs_get_out_rt() calls dst_link_failure()
without ensuring skb->dev is set, leading to a NULL pointer dereference
in fib_compute_spec_dst() when ipv4_link_failure() attempts to send
ICMP destination unreachable messages.

The issue emerged after commit ed0de45 ("ipv4: recompile ip options
in ipv4_link_failure") started calling __ip_options_compile() from
ipv4_link_failure(). This code path eventually calls fib_compute_spec_dst()
which dereferences skb->dev. An attempt was made to fix the NULL skb->dev
dereference in commit 0113d9c ("ipv4: fix null-deref in
ipv4_link_failure"), but it only addressed the immediate dev_net(skb->dev)
dereference by using a fallback device. The fix was incomplete because
fib_compute_spec_dst() later in the call chain still accesses skb->dev
directly, which remains NULL when IPVS calls dst_link_failure().

The crash occurs when:
1. IPVS processes a packet in NAT mode with a misconfigured destination
2. Route lookup fails in __ip_vs_get_out_rt() before establishing a route
3. The error path calls dst_link_failure(skb) with skb->dev == NULL
4. ipv4_link_failure() → ipv4_send_dest_unreach() →
   __ip_options_compile() → fib_compute_spec_dst()
5. fib_compute_spec_dst() dereferences NULL skb->dev

Apply the same fix used for IPv6 in commit 326bf17 ("ipvs: fix
ipv6 route unreach panic"): set skb->dev from skb_dst(skb)->dev before
calling dst_link_failure().

KASAN: null-ptr-deref in range [0x0000000000000328-0x000000000000032f]
CPU: 1 PID: 12732 Comm: syz.1.3469 Not tainted 6.6.114 #2
RIP: 0010:__in_dev_get_rcu include/linux/inetdevice.h:233
RIP: 0010:fib_compute_spec_dst+0x17a/0x9f0 net/ipv4/fib_frontend.c:285
Call Trace:
  <TASK>
  spec_dst_fill net/ipv4/ip_options.c:232
  spec_dst_fill net/ipv4/ip_options.c:229
  __ip_options_compile+0x13a1/0x17d0 net/ipv4/ip_options.c:330
  ipv4_send_dest_unreach net/ipv4/route.c:1252
  ipv4_link_failure+0x702/0xb80 net/ipv4/route.c:1265
  dst_link_failure include/net/dst.h:437
  __ip_vs_get_out_rt+0x15fd/0x19e0 net/netfilter/ipvs/ip_vs_xmit.c:412
  ip_vs_nat_xmit+0x1d8/0xc80 net/netfilter/ipvs/ip_vs_xmit.c:764

Fixes: ed0de45 ("ipv4: recompile ip options in ipv4_link_failure")
Signed-off-by: Slavin Liu <[email protected]>
Acked-by: Julian Anastasov <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 13, 2025
Patch series "mm/hugetlb: fixes for PMD table sharing (incl.  using
mmu_gather)".

One functional fix, one performance regression fix, and two related
comment fixes.

I cleaned up my prototype I recently shared [1] for the performance fix,
deferring most of the cleanups I had in the prototype to a later point. 
While doing that I identified the other things.

The goal of this patch set is to be backported to stable trees "fairly"
easily.  At least patch #1 and #4.

Patch #1 fixes hugetlb_pmd_shared() not detecting any sharing
Patch #2 + #3 are simple comment fixes that patch #4 interacts with.
Patch #4 is a fix for the reported performance regression due to excessive
IPI broadcasts during fork()+exit().

The last patch is all about TLB flushes, IPIs and mmu_gather.  Read:
complicated

I added as much comments + description that I possibly could, and I am
hoping for review from Jann.

There are plenty of cleanups in the future to be had + one reasonable
optimization on x86.  But that's all out of scope for this series.


This patch (of 4):

We switched from (wrongly) using the page count to an independent shared
count.  Now, shared page tables have a refcount of 1 (excluding
speculative references) and instead use ptdesc->pt_share_count to identify
sharing.

We didn't convert hugetlb_pmd_shared(), so right now, we would never
detect a shared PMD table as such, because sharing/unsharing no longer
touches the refcount of a PMD table.

Page migration, like mbind() or migrate_pages() would allow for migrating
folios mapped into such shared PMD tables, even though the folios are not
exclusive.  In smaps we would account them as "private" although they are
"shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the
pagemap interface.

Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/[email protected]/ [1]
Fixes: 59d9094 ("mm: hugetlb: independent PMD page table shared count")
Signed-off-by: David Hildenbrand (Red Hat) <[email protected]>
Tested-by: Laurence Oberman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Cc: Liu Shixin <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Prakash Sangappa <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Uschakow, Stanislav" <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 17, 2025
After rename exchanging (either with the rename exchange operation or
regular renames in multiple non-atomic steps) two inodes and at least
one of them is a directory, we can end up with a log tree that contains
only of the inodes and after a power failure that can result in an attempt
to delete the other inode when it should not because it was not deleted
before the power failure. In some case that delete attempt fails when
the target inode is a directory that contains a subvolume inside it, since
the log replay code is not prepared to deal with directory entries that
point to root items (only inode items).

1) We have directories "dir1" (inode A) and "dir2" (inode B) under the
   same parent directory;

2) We have a file (inode C) under directory "dir1" (inode A);

3) We have a subvolume inside directory "dir2" (inode B);

4) All these inodes were persisted in a past transaction and we are
   currently at transaction N;

5) We rename the file (inode C), so at btrfs_log_new_name() we update
   inode C's last_unlink_trans to N;

6) We get a rename exchange for "dir1" (inode A) and "dir2" (inode B),
   so after the exchange "dir1" is inode B and "dir2" is inode A.
   During the rename exchange we call btrfs_log_new_name() for inodes
   A and B, but because they are directories, we don't update their
   last_unlink_trans to N;

7) An fsync against the file (inode C) is done, and because its inode
   has a last_unlink_trans with a value of N we log its parent directory
   (inode A) (through btrfs_log_all_parents(), called from
   btrfs_log_inode_parent()).

8) So we end up with inode B not logged, which now has the old name
   of inode A. At copy_inode_items_to_log(), when logging inode A, we
   did not check if we had any conflicting inode to log because inode
   A has a generation lower than the current transaction (created in
   a past transaction);

9) After a power failure, when replaying the log tree, since we find that
   inode A has a new name that conflicts with the name of inode B in the
   fs tree, we attempt to delete inode B... this is wrong since that
   directory was never deleted before the power failure, and because there
   is a subvolume inside that directory, attempting to delete it will fail
   since replay_dir_deletes() and btrfs_unlink_inode() are not prepared
   to deal with dir items that point to roots instead of inodes.

   When that happens the mount fails and we get a stack trace like the
   following:

   [87.2314] BTRFS info (device dm-0): start tree-log replay
   [87.2318] BTRFS critical (device dm-0): failed to delete reference to subvol, root 5 inode 256 parent 259
   [87.2332] ------------[ cut here ]------------
   [87.2338] BTRFS: Transaction aborted (error -2)
   [87.2346] WARNING: CPU: 1 PID: 638968 at fs/btrfs/inode.c:4345 __btrfs_unlink_inode+0x416/0x440 [btrfs]
   [87.2368] Modules linked in: btrfs loop dm_thin_pool (...)
   [87.2470] CPU: 1 UID: 0 PID: 638968 Comm: mount Tainted: G        W           6.18.0-rc7-btrfs-next-218+ #2 PREEMPT(full)
   [87.2489] Tainted: [W]=WARN
   [87.2494] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
   [87.2514] RIP: 0010:__btrfs_unlink_inode+0x416/0x440 [btrfs]
   [87.2538] Code: c0 89 04 24 (...)
   [87.2568] RSP: 0018:ffffc0e741f4b9b8 EFLAGS: 00010286
   [87.2574] RAX: 0000000000000000 RBX: ffff9d3ec8a6cf60 RCX: 0000000000000000
   [87.2582] RDX: 0000000000000002 RSI: ffffffff84ab45a1 RDI: 00000000ffffffff
   [87.2591] RBP: ffff9d3ec8a6ef20 R08: 0000000000000000 R09: ffffc0e741f4b840
   [87.2599] R10: ffff9d45dc1fffa8 R11: 0000000000000003 R12: ffff9d3ee26d77e0
   [87.2608] R13: ffffc0e741f4ba98 R14: ffff9d4458040800 R15: ffff9d44b6b7ca10
   [87.2618] FS:  00007f7b9603a840(0000) GS:ffff9d4658982000(0000) knlGS:0000000000000000
   [87.2629] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [87.2637] CR2: 00007ffc9ec33b98 CR3: 000000011273e003 CR4: 0000000000370ef0
   [87.2648] Call Trace:
   [87.2651]  <TASK>
   [87.2654]  btrfs_unlink_inode+0x15/0x40 [btrfs]
   [87.2661]  unlink_inode_for_log_replay+0x27/0xf0 [btrfs]
   [87.2669]  check_item_in_log+0x1ea/0x2c0 [btrfs]
   [87.2676]  replay_dir_deletes+0x16b/0x380 [btrfs]
   [87.2684]  fixup_inode_link_count+0x34b/0x370 [btrfs]
   [87.2696]  fixup_inode_link_counts+0x41/0x160 [btrfs]
   [87.2703]  btrfs_recover_log_trees+0x1ff/0x7c0 [btrfs]
   [87.2711]  ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
   [87.2719]  open_ctree+0x10bb/0x15f0 [btrfs]
   [87.2726]  btrfs_get_tree.cold+0xb/0x16c [btrfs]
   [87.2734]  ? fscontext_read+0x15c/0x180
   [87.2740]  ? rw_verify_area+0x50/0x180
   [87.2746]  vfs_get_tree+0x25/0xd0
   [87.2750]  vfs_cmd_create+0x59/0xe0
   [87.2755]  __do_sys_fsconfig+0x4f6/0x6b0
   [87.2760]  do_syscall_64+0x50/0x1220
   [87.2764]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [87.2770] RIP: 0033:0x7f7b9625f4aa
   [87.2775] Code: 73 01 c3 48 (...)
   [87.2803] RSP: 002b:00007ffc9ec35b08 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
   [87.2817] RAX: ffffffffffffffda RBX: 0000558bfa91ac20 RCX: 00007f7b9625f4aa
   [87.2829] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
   [87.2842] RBP: 0000558bfa91b120 R08: 0000000000000000 R09: 0000000000000000
   [87.2854] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [87.2864] R13: 00007f7b963f1580 R14: 00007f7b963f326c R15: 00007f7b963d8a23
   [87.2877]  </TASK>
   [87.2882] ---[ end trace 0000000000000000 ]---
   [87.2891] BTRFS: error (device dm-0 state A) in __btrfs_unlink_inode:4345: errno=-2 No such entry
   [87.2904] BTRFS: error (device dm-0 state EAO) in do_abort_log_replay:191: errno=-2 No such entry
   [87.2915] BTRFS critical (device dm-0 state EAO): log tree (for root 5) leaf currently being processed (slot 7 key (258 12 257)):
   [87.2929] BTRFS info (device dm-0 state EAO): leaf 30736384 gen 10 total ptrs 7 free space 15712 owner 18446744073709551610
   [87.2929] BTRFS info (device dm-0 state EAO): refs 3 lock_owner 0 current 638968
   [87.2929]      item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160
   [87.2929]              inode generation 9 transid 10 size 0 nbytes 0
   [87.2929]              block group 0 mode 40755 links 1 uid 0 gid 0
   [87.2929]              rdev 0 sequence 7 flags 0x0
   [87.2929]              atime 1765464494.678070921
   [87.2929]              ctime 1765464494.686606513
   [87.2929]              mtime 1765464494.686606513
   [87.2929]              otime 1765464494.678070921
   [87.2929]      item 1 key (257 INODE_REF 256) itemoff 16109 itemsize 14
   [87.2929]              index 4 name_len 4
   [87.2929]      item 2 key (257 DIR_LOG_INDEX 2) itemoff 16101 itemsize 8
   [87.2929]              dir log end 2
   [87.2929]      item 3 key (257 DIR_LOG_INDEX 3) itemoff 16093 itemsize 8
   [87.2929]              dir log end 18446744073709551615
   [87.2930]      item 4 key (257 DIR_INDEX 3) itemoff 16060 itemsize 33
   [87.2930]              location key (258 1 0) type 1
   [87.2930]              transid 10 data_len 0 name_len 3
   [87.2930]      item 5 key (258 INODE_ITEM 0) itemoff 15900 itemsize 160
   [87.2930]              inode generation 9 transid 10 size 0 nbytes 0
   [87.2930]              block group 0 mode 100644 links 1 uid 0 gid 0
   [87.2930]              rdev 0 sequence 2 flags 0x0
   [87.2930]              atime 1765464494.678456467
   [87.2930]              ctime 1765464494.686606513
   [87.2930]              mtime 1765464494.678456467
   [87.2930]              otime 1765464494.678456467
   [87.2930]      item 6 key (258 INODE_REF 257) itemoff 15887 itemsize 13
   [87.2930]              index 3 name_len 3
   [87.2930] BTRFS critical (device dm-0 state EAO): log replay failed in unlink_inode_for_log_replay:1045 for root 5, stage 3, with error -2: failed to unlink inode 256 parent dir 259 name subvol root 5
   [87.2963] BTRFS: error (device dm-0 state EAO) in btrfs_recover_log_trees:7743: errno=-2 No such entry
   [87.2981] BTRFS: error (device dm-0 state EAO) in btrfs_replay_log:2083: errno=-2 No such entry (Failed to recover log tr

So fix this by changing copy_inode_items_to_log() to always detect if
there are conflicting inodes for the ref/extref of the inode being logged
even if the inode was created in a past transaction.

A test case for fstests will follow soon.

CC: [email protected] # 6.1+
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 17, 2025
In ath12k_mac_op_link_sta_statistics(), the atomic context scope
introduced by dp_lock also covers firmware stats request. Since that
request could block, below issue is hit:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:575
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 6866, name: iw
preempt_count: 201, expected: 0
RCU nest depth: 0, expected: 0
3 locks held by iw/6866:
 #0:[...]
 #1:[...]
 #2: ffff9748f43230c8 (&dp->dp_lock){+.-.}-{3:3}, at:
ath12k_mac_op_link_sta_statistics+0xc6/0x380 [ath12k]
Preemption disabled at:
[<ffffffffc0349656>] ath12k_mac_op_link_sta_statistics+0xc6/0x380 [ath12k]
Call Trace:
 <TASK>
 show_stack
 dump_stack_lvl
 dump_stack
 __might_resched.cold
 __might_sleep
 __mutex_lock
 mutex_lock_nested
 ath12k_mac_get_fw_stats
 ath12k_mac_op_link_sta_statistics
 </TASK>

Since firmware stats request doesn't require protection from dp_lock, move
it outside to fix this issue.

While moving, also refine that code hunk to make function parameters get
populated when really necessary.

Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.1.c5-00302-QCAHMTSWPL_V1.0_V2.0_SILICONZ-1.115823.3

Signed-off-by: Baochen Qiang <[email protected]>
Reviewed-by: Vasanthakumar Thiagarajan <[email protected]>
Link: https://patch.msgid.link/20251119-ath12k-ng-sleep-in-atomic-v1-1-5d1a726597db@oss.qualcomm.com
Signed-off-by: Jeff Johnson <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 18, 2025
When a page is freed it coalesces with a buddy into a higher order page
while possible.  When the buddy page migrate type differs, it is expected
to be updated to match the one of the page being freed.

However, only the first pageblock of the buddy page is updated, while the
rest of the pageblocks are left unchanged.

That causes warnings in later expand() and other code paths (like below),
since an inconsistency between migration type of the list containing the
page and the page-owned pageblocks migration types is introduced.

[  308.986589] ------------[ cut here ]------------
[  308.987227] page type is 0, passed migratetype is 1 (nr=256)
[  308.987275] WARNING: CPU: 1 PID: 5224 at mm/page_alloc.c:812 expand+0x23c/0x270
[  308.987293] Modules linked in: algif_hash(E) af_alg(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) s390_trng(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) sch_fq_codel(E) drm(E) i2c_core(E) drm_panel_orientation_quirks(E) loop(E) nfnetlink(E) vsock_loopback(E) vmw_vsock_virtio_transport_common(E) vsock(E) ctcm(E) fsm(E) diag288_wdt(E) watchdog(E) zfcp(E) scsi_transport_fc(E) ghash_s390(E) prng(E) aes_s390(E) des_generic(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) paes_s390(E) crypto_engine(E) pkey_cca(E) pkey_ep11(E) zcrypt(E) rng_core(E) pkey_pckmo(E) pkey(E) autofs4(E)
[  308.987439] Unloaded tainted modules: hmac_s390(E):2
[  308.987650] CPU: 1 UID: 0 PID: 5224 Comm: mempig_verify Kdump: loaded Tainted: G            E       6.18.0-gcc-bpf-debug #431 PREEMPT
[  308.987657] Tainted: [E]=UNSIGNED_MODULE
[  308.987661] Hardware name: IBM 3906 M04 704 (z/VM 7.3.0)
[  308.987666] Krnl PSW : 0404f00180000000 00000349976fa600 (expand+0x240/0x270)
[  308.987676]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
[  308.987682] Krnl GPRS: 0000034980000004 0000000000000005 0000000000000030 000003499a0e6d88
[  308.987688]            0000000000000005 0000034980000005 000002be803ac000 0000023efe6c8300
[  308.987692]            0000000000000008 0000034998d57290 000002be00000100 0000023e00000008
[  308.987696]            0000000000000000 0000000000000000 00000349976fa5fc 000002c99b1eb6f0
[  308.987708] Krnl Code: 00000349976fa5f0: c020008a02f2	larl	%r2,000003499883abd4
                          00000349976fa5f6: c0e5ffe3f4b5	brasl	%r14,0000034997378f60
                         #00000349976fa5fc: af000000		mc	0,0
                         >00000349976fa600: a7f4ff4c		brc	15,00000349976fa498
                          00000349976fa604: b9040026		lgr	%r2,%r6
                          00000349976fa608: c0300088317f	larl	%r3,0000034998800906
                          00000349976fa60e: c0e5fffdb6e1	brasl	%r14,00000349976b13d0
                          00000349976fa614: af000000		mc	0,0
[  308.987734] Call Trace:
[  308.987738]  [<00000349976fa600>] expand+0x240/0x270
[  308.987744] ([<00000349976fa5fc>] expand+0x23c/0x270)
[  308.987749]  [<00000349976ff95e>] rmqueue_bulk+0x71e/0x940
[  308.987754]  [<00000349976ffd7e>] __rmqueue_pcplist+0x1fe/0x2a0
[  308.987759]  [<0000034997700966>] rmqueue.isra.0+0xb46/0xf40
[  308.987763]  [<0000034997703ec8>] get_page_from_freelist+0x198/0x8d0
[  308.987768]  [<0000034997706fa8>] __alloc_frozen_pages_noprof+0x198/0x400
[  308.987774]  [<00000349977536f8>] alloc_pages_mpol+0xb8/0x220
[  308.987781]  [<0000034997753bf6>] folio_alloc_mpol_noprof+0x26/0xc0
[  308.987786]  [<0000034997753e4c>] vma_alloc_folio_noprof+0x6c/0xa0
[  308.987791]  [<0000034997775b22>] vma_alloc_anon_folio_pmd+0x42/0x240
[  308.987799]  [<000003499777bfea>] __do_huge_pmd_anonymous_page+0x3a/0x210
[  308.987804]  [<00000349976cb08e>] __handle_mm_fault+0x4de/0x500
[  308.987809]  [<00000349976cb14c>] handle_mm_fault+0x9c/0x3a0
[  308.987813]  [<000003499734d70e>] do_exception+0x1de/0x540
[  308.987822]  [<0000034998387390>] __do_pgm_check+0x130/0x220
[  308.987830]  [<000003499839a934>] pgm_check_handler+0x114/0x160
[  308.987838] 3 locks held by mempig_verify/5224:
[  308.987842]  #0: 0000023ea44c1e08 (vm_lock){++++}-{0:0}, at: lock_vma_under_rcu+0xb2/0x2a0
[  308.987859]  #1: 0000023ee4d41b18 (&pcp->lock){+.+.}-{2:2}, at: rmqueue.isra.0+0xad6/0xf40
[  308.987871]  #2: 0000023efe6c8998 (&zone->lock){..-.}-{2:2}, at: rmqueue_bulk+0x5a/0x940
[  308.987886] Last Breaking-Event-Address:
[  308.987890]  [<0000034997379096>] __warn_printk+0x136/0x140
[  308.987897] irq event stamp: 52330356
[  308.987901] hardirqs last  enabled at (52330355): [<000003499838742e>] __do_pgm_check+0x1ce/0x220
[  308.987907] hardirqs last disabled at (52330356): [<000003499839932e>] _raw_spin_lock_irqsave+0x9e/0xe0
[  308.987913] softirqs last  enabled at (52329882): [<0000034997383786>] handle_softirqs+0x2c6/0x530
[  308.987922] softirqs last disabled at (52329859): [<0000034997382f86>] __irq_exit_rcu+0x126/0x140
[  308.987929] ---[ end trace 0000000000000000 ]---
[  308.987936] ------------[ cut here ]------------
[  308.987940] page type is 0, passed migratetype is 1 (nr=256)
[  308.987951] WARNING: CPU: 1 PID: 5224 at mm/page_alloc.c:860 __del_page_from_free_list+0x1be/0x1e0
[  308.987960] Modules linked in: algif_hash(E) af_alg(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) s390_trng(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) sch_fq_codel(E) drm(E) i2c_core(E) drm_panel_orientation_quirks(E) loop(E) nfnetlink(E) vsock_loopback(E) vmw_vsock_virtio_transport_common(E) vsock(E) ctcm(E) fsm(E) diag288_wdt(E) watchdog(E) zfcp(E) scsi_transport_fc(E) ghash_s390(E) prng(E) aes_s390(E) des_generic(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) paes_s390(E) crypto_engine(E) pkey_cca(E) pkey_ep11(E) zcrypt(E) rng_core(E) pkey_pckmo(E) pkey(E) autofs4(E)
[  308.988070] Unloaded tainted modules: hmac_s390(E):2
[  308.988087] CPU: 1 UID: 0 PID: 5224 Comm: mempig_verify Kdump: loaded Tainted: G        W   E       6.18.0-gcc-bpf-debug #431 PREEMPT
[  308.988095] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE
[  308.988100] Hardware name: IBM 3906 M04 704 (z/VM 7.3.0)
[  308.988105] Krnl PSW : 0404f00180000000 00000349976f9e32 (__del_page_from_free_list+0x1c2/0x1e0)
[  308.988118]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
[  308.988127] Krnl GPRS: 0000034980000004 0000000000000005 0000000000000030 000003499a0e6d88
[  308.988133]            0000000000000005 0000034980000005 0000034998d57290 0000023efe6c8300
[  308.988139]            0000000000000001 0000000000000008 000002be00000100 000002be803ac000
[  308.988144]            0000000000000000 0000000000000001 00000349976f9e2e 000002c99b1eb728
[  308.988153] Krnl Code: 00000349976f9e22: c020008a06d9	larl	%r2,000003499883abd4
                          00000349976f9e28: c0e5ffe3f89c	brasl	%r14,0000034997378f60
                         #00000349976f9e2e: af000000		mc	0,0
                         >00000349976f9e32: a7f4ff4e		brc	15,00000349976f9cce
                          00000349976f9e36: b904002b		lgr	%r2,%r11
                          00000349976f9e3a: c030008a06e7	larl	%r3,000003499883ac08
                          00000349976f9e40: c0e5fffdbac8	brasl	%r14,00000349976b13d0
                          00000349976f9e46: af000000		mc	0,0
[  308.988184] Call Trace:
[  308.988188]  [<00000349976f9e32>] __del_page_from_free_list+0x1c2/0x1e0
[  308.988195] ([<00000349976f9e2e>] __del_page_from_free_list+0x1be/0x1e0)
[  308.988202]  [<00000349976ff946>] rmqueue_bulk+0x706/0x940
[  308.988208]  [<00000349976ffd7e>] __rmqueue_pcplist+0x1fe/0x2a0
[  308.988214]  [<0000034997700966>] rmqueue.isra.0+0xb46/0xf40
[  308.988221]  [<0000034997703ec8>] get_page_from_freelist+0x198/0x8d0
[  308.988227]  [<0000034997706fa8>] __alloc_frozen_pages_noprof+0x198/0x400
[  308.988233]  [<00000349977536f8>] alloc_pages_mpol+0xb8/0x220
[  308.988240]  [<0000034997753bf6>] folio_alloc_mpol_noprof+0x26/0xc0
[  308.988247]  [<0000034997753e4c>] vma_alloc_folio_noprof+0x6c/0xa0
[  308.988253]  [<0000034997775b22>] vma_alloc_anon_folio_pmd+0x42/0x240
[  308.988260]  [<000003499777bfea>] __do_huge_pmd_anonymous_page+0x3a/0x210
[  308.988267]  [<00000349976cb08e>] __handle_mm_fault+0x4de/0x500
[  308.988273]  [<00000349976cb14c>] handle_mm_fault+0x9c/0x3a0
[  308.988279]  [<000003499734d70e>] do_exception+0x1de/0x540
[  308.988286]  [<0000034998387390>] __do_pgm_check+0x130/0x220
[  308.988293]  [<000003499839a934>] pgm_check_handler+0x114/0x160
[  308.988300] 3 locks held by mempig_verify/5224:
[  308.988305]  #0: 0000023ea44c1e08 (vm_lock){++++}-{0:0}, at: lock_vma_under_rcu+0xb2/0x2a0
[  308.988322]  #1: 0000023ee4d41b18 (&pcp->lock){+.+.}-{2:2}, at: rmqueue.isra.0+0xad6/0xf40
[  308.988334]  #2: 0000023efe6c8998 (&zone->lock){..-.}-{2:2}, at: rmqueue_bulk+0x5a/0x940
[  308.988346] Last Breaking-Event-Address:
[  308.988350]  [<0000034997379096>] __warn_printk+0x136/0x140
[  308.988356] irq event stamp: 52330356
[  308.988360] hardirqs last  enabled at (52330355): [<000003499838742e>] __do_pgm_check+0x1ce/0x220
[  308.988366] hardirqs last disabled at (52330356): [<000003499839932e>] _raw_spin_lock_irqsave+0x9e/0xe0
[  308.988373] softirqs last  enabled at (52329882): [<0000034997383786>] handle_softirqs+0x2c6/0x530
[  308.988380] softirqs last disabled at (52329859): [<0000034997382f86>] __irq_exit_rcu+0x126/0x140
[  308.988388] ---[ end trace 0000000000000000 ]---

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: e6cf9e1 ("mm: page_alloc: fix up block types when merging compatible blocks")
Signed-off-by: Alexander Gordeev <[email protected]>
Reported-by: Marc Hartmayer <[email protected]>
Closes: https://lore.kernel.org/linux-mm/[email protected]/
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Cc: Marc Hartmayer <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 18, 2025
After rename exchanging (either with the rename exchange operation or
regular renames in multiple non-atomic steps) two inodes and at least
one of them is a directory, we can end up with a log tree that contains
only of the inodes and after a power failure that can result in an attempt
to delete the other inode when it should not because it was not deleted
before the power failure. In some case that delete attempt fails when
the target inode is a directory that contains a subvolume inside it, since
the log replay code is not prepared to deal with directory entries that
point to root items (only inode items).

1) We have directories "dir1" (inode A) and "dir2" (inode B) under the
   same parent directory;

2) We have a file (inode C) under directory "dir1" (inode A);

3) We have a subvolume inside directory "dir2" (inode B);

4) All these inodes were persisted in a past transaction and we are
   currently at transaction N;

5) We rename the file (inode C), so at btrfs_log_new_name() we update
   inode C's last_unlink_trans to N;

6) We get a rename exchange for "dir1" (inode A) and "dir2" (inode B),
   so after the exchange "dir1" is inode B and "dir2" is inode A.
   During the rename exchange we call btrfs_log_new_name() for inodes
   A and B, but because they are directories, we don't update their
   last_unlink_trans to N;

7) An fsync against the file (inode C) is done, and because its inode
   has a last_unlink_trans with a value of N we log its parent directory
   (inode A) (through btrfs_log_all_parents(), called from
   btrfs_log_inode_parent()).

8) So we end up with inode B not logged, which now has the old name
   of inode A. At copy_inode_items_to_log(), when logging inode A, we
   did not check if we had any conflicting inode to log because inode
   A has a generation lower than the current transaction (created in
   a past transaction);

9) After a power failure, when replaying the log tree, since we find that
   inode A has a new name that conflicts with the name of inode B in the
   fs tree, we attempt to delete inode B... this is wrong since that
   directory was never deleted before the power failure, and because there
   is a subvolume inside that directory, attempting to delete it will fail
   since replay_dir_deletes() and btrfs_unlink_inode() are not prepared
   to deal with dir items that point to roots instead of inodes.

   When that happens the mount fails and we get a stack trace like the
   following:

   [87.2314] BTRFS info (device dm-0): start tree-log replay
   [87.2318] BTRFS critical (device dm-0): failed to delete reference to subvol, root 5 inode 256 parent 259
   [87.2332] ------------[ cut here ]------------
   [87.2338] BTRFS: Transaction aborted (error -2)
   [87.2346] WARNING: CPU: 1 PID: 638968 at fs/btrfs/inode.c:4345 __btrfs_unlink_inode+0x416/0x440 [btrfs]
   [87.2368] Modules linked in: btrfs loop dm_thin_pool (...)
   [87.2470] CPU: 1 UID: 0 PID: 638968 Comm: mount Tainted: G        W           6.18.0-rc7-btrfs-next-218+ #2 PREEMPT(full)
   [87.2489] Tainted: [W]=WARN
   [87.2494] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
   [87.2514] RIP: 0010:__btrfs_unlink_inode+0x416/0x440 [btrfs]
   [87.2538] Code: c0 89 04 24 (...)
   [87.2568] RSP: 0018:ffffc0e741f4b9b8 EFLAGS: 00010286
   [87.2574] RAX: 0000000000000000 RBX: ffff9d3ec8a6cf60 RCX: 0000000000000000
   [87.2582] RDX: 0000000000000002 RSI: ffffffff84ab45a1 RDI: 00000000ffffffff
   [87.2591] RBP: ffff9d3ec8a6ef20 R08: 0000000000000000 R09: ffffc0e741f4b840
   [87.2599] R10: ffff9d45dc1fffa8 R11: 0000000000000003 R12: ffff9d3ee26d77e0
   [87.2608] R13: ffffc0e741f4ba98 R14: ffff9d4458040800 R15: ffff9d44b6b7ca10
   [87.2618] FS:  00007f7b9603a840(0000) GS:ffff9d4658982000(0000) knlGS:0000000000000000
   [87.2629] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [87.2637] CR2: 00007ffc9ec33b98 CR3: 000000011273e003 CR4: 0000000000370ef0
   [87.2648] Call Trace:
   [87.2651]  <TASK>
   [87.2654]  btrfs_unlink_inode+0x15/0x40 [btrfs]
   [87.2661]  unlink_inode_for_log_replay+0x27/0xf0 [btrfs]
   [87.2669]  check_item_in_log+0x1ea/0x2c0 [btrfs]
   [87.2676]  replay_dir_deletes+0x16b/0x380 [btrfs]
   [87.2684]  fixup_inode_link_count+0x34b/0x370 [btrfs]
   [87.2696]  fixup_inode_link_counts+0x41/0x160 [btrfs]
   [87.2703]  btrfs_recover_log_trees+0x1ff/0x7c0 [btrfs]
   [87.2711]  ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
   [87.2719]  open_ctree+0x10bb/0x15f0 [btrfs]
   [87.2726]  btrfs_get_tree.cold+0xb/0x16c [btrfs]
   [87.2734]  ? fscontext_read+0x15c/0x180
   [87.2740]  ? rw_verify_area+0x50/0x180
   [87.2746]  vfs_get_tree+0x25/0xd0
   [87.2750]  vfs_cmd_create+0x59/0xe0
   [87.2755]  __do_sys_fsconfig+0x4f6/0x6b0
   [87.2760]  do_syscall_64+0x50/0x1220
   [87.2764]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [87.2770] RIP: 0033:0x7f7b9625f4aa
   [87.2775] Code: 73 01 c3 48 (...)
   [87.2803] RSP: 002b:00007ffc9ec35b08 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
   [87.2817] RAX: ffffffffffffffda RBX: 0000558bfa91ac20 RCX: 00007f7b9625f4aa
   [87.2829] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
   [87.2842] RBP: 0000558bfa91b120 R08: 0000000000000000 R09: 0000000000000000
   [87.2854] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [87.2864] R13: 00007f7b963f1580 R14: 00007f7b963f326c R15: 00007f7b963d8a23
   [87.2877]  </TASK>
   [87.2882] ---[ end trace 0000000000000000 ]---
   [87.2891] BTRFS: error (device dm-0 state A) in __btrfs_unlink_inode:4345: errno=-2 No such entry
   [87.2904] BTRFS: error (device dm-0 state EAO) in do_abort_log_replay:191: errno=-2 No such entry
   [87.2915] BTRFS critical (device dm-0 state EAO): log tree (for root 5) leaf currently being processed (slot 7 key (258 12 257)):
   [87.2929] BTRFS info (device dm-0 state EAO): leaf 30736384 gen 10 total ptrs 7 free space 15712 owner 18446744073709551610
   [87.2929] BTRFS info (device dm-0 state EAO): refs 3 lock_owner 0 current 638968
   [87.2929]      item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160
   [87.2929]              inode generation 9 transid 10 size 0 nbytes 0
   [87.2929]              block group 0 mode 40755 links 1 uid 0 gid 0
   [87.2929]              rdev 0 sequence 7 flags 0x0
   [87.2929]              atime 1765464494.678070921
   [87.2929]              ctime 1765464494.686606513
   [87.2929]              mtime 1765464494.686606513
   [87.2929]              otime 1765464494.678070921
   [87.2929]      item 1 key (257 INODE_REF 256) itemoff 16109 itemsize 14
   [87.2929]              index 4 name_len 4
   [87.2929]      item 2 key (257 DIR_LOG_INDEX 2) itemoff 16101 itemsize 8
   [87.2929]              dir log end 2
   [87.2929]      item 3 key (257 DIR_LOG_INDEX 3) itemoff 16093 itemsize 8
   [87.2929]              dir log end 18446744073709551615
   [87.2930]      item 4 key (257 DIR_INDEX 3) itemoff 16060 itemsize 33
   [87.2930]              location key (258 1 0) type 1
   [87.2930]              transid 10 data_len 0 name_len 3
   [87.2930]      item 5 key (258 INODE_ITEM 0) itemoff 15900 itemsize 160
   [87.2930]              inode generation 9 transid 10 size 0 nbytes 0
   [87.2930]              block group 0 mode 100644 links 1 uid 0 gid 0
   [87.2930]              rdev 0 sequence 2 flags 0x0
   [87.2930]              atime 1765464494.678456467
   [87.2930]              ctime 1765464494.686606513
   [87.2930]              mtime 1765464494.678456467
   [87.2930]              otime 1765464494.678456467
   [87.2930]      item 6 key (258 INODE_REF 257) itemoff 15887 itemsize 13
   [87.2930]              index 3 name_len 3
   [87.2930] BTRFS critical (device dm-0 state EAO): log replay failed in unlink_inode_for_log_replay:1045 for root 5, stage 3, with error -2: failed to unlink inode 256 parent dir 259 name subvol root 5
   [87.2963] BTRFS: error (device dm-0 state EAO) in btrfs_recover_log_trees:7743: errno=-2 No such entry
   [87.2981] BTRFS: error (device dm-0 state EAO) in btrfs_replay_log:2083: errno=-2 No such entry (Failed to recover log tr

So fix this by changing copy_inode_items_to_log() to always detect if
there are conflicting inodes for the ref/extref of the inode being logged
even if the inode was created in a past transaction.

A test case for fstests will follow soon.

CC: [email protected] # 6.1+
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 19, 2025
When a page is freed it coalesces with a buddy into a higher order page
while possible.  When the buddy page migrate type differs, it is expected
to be updated to match the one of the page being freed.

However, only the first pageblock of the buddy page is updated, while the
rest of the pageblocks are left unchanged.

That causes warnings in later expand() and other code paths (like below),
since an inconsistency between migration type of the list containing the
page and the page-owned pageblocks migration types is introduced.

[  308.986589] ------------[ cut here ]------------
[  308.987227] page type is 0, passed migratetype is 1 (nr=256)
[  308.987275] WARNING: CPU: 1 PID: 5224 at mm/page_alloc.c:812 expand+0x23c/0x270
[  308.987293] Modules linked in: algif_hash(E) af_alg(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) s390_trng(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) sch_fq_codel(E) drm(E) i2c_core(E) drm_panel_orientation_quirks(E) loop(E) nfnetlink(E) vsock_loopback(E) vmw_vsock_virtio_transport_common(E) vsock(E) ctcm(E) fsm(E) diag288_wdt(E) watchdog(E) zfcp(E) scsi_transport_fc(E) ghash_s390(E) prng(E) aes_s390(E) des_generic(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) paes_s390(E) crypto_engine(E) pkey_cca(E) pkey_ep11(E) zcrypt(E) rng_core(E) pkey_pckmo(E) pkey(E) autofs4(E)
[  308.987439] Unloaded tainted modules: hmac_s390(E):2
[  308.987650] CPU: 1 UID: 0 PID: 5224 Comm: mempig_verify Kdump: loaded Tainted: G            E       6.18.0-gcc-bpf-debug #431 PREEMPT
[  308.987657] Tainted: [E]=UNSIGNED_MODULE
[  308.987661] Hardware name: IBM 3906 M04 704 (z/VM 7.3.0)
[  308.987666] Krnl PSW : 0404f00180000000 00000349976fa600 (expand+0x240/0x270)
[  308.987676]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
[  308.987682] Krnl GPRS: 0000034980000004 0000000000000005 0000000000000030 000003499a0e6d88
[  308.987688]            0000000000000005 0000034980000005 000002be803ac000 0000023efe6c8300
[  308.987692]            0000000000000008 0000034998d57290 000002be00000100 0000023e00000008
[  308.987696]            0000000000000000 0000000000000000 00000349976fa5fc 000002c99b1eb6f0
[  308.987708] Krnl Code: 00000349976fa5f0: c020008a02f2	larl	%r2,000003499883abd4
                          00000349976fa5f6: c0e5ffe3f4b5	brasl	%r14,0000034997378f60
                         #00000349976fa5fc: af000000		mc	0,0
                         >00000349976fa600: a7f4ff4c		brc	15,00000349976fa498
                          00000349976fa604: b9040026		lgr	%r2,%r6
                          00000349976fa608: c0300088317f	larl	%r3,0000034998800906
                          00000349976fa60e: c0e5fffdb6e1	brasl	%r14,00000349976b13d0
                          00000349976fa614: af000000		mc	0,0
[  308.987734] Call Trace:
[  308.987738]  [<00000349976fa600>] expand+0x240/0x270
[  308.987744] ([<00000349976fa5fc>] expand+0x23c/0x270)
[  308.987749]  [<00000349976ff95e>] rmqueue_bulk+0x71e/0x940
[  308.987754]  [<00000349976ffd7e>] __rmqueue_pcplist+0x1fe/0x2a0
[  308.987759]  [<0000034997700966>] rmqueue.isra.0+0xb46/0xf40
[  308.987763]  [<0000034997703ec8>] get_page_from_freelist+0x198/0x8d0
[  308.987768]  [<0000034997706fa8>] __alloc_frozen_pages_noprof+0x198/0x400
[  308.987774]  [<00000349977536f8>] alloc_pages_mpol+0xb8/0x220
[  308.987781]  [<0000034997753bf6>] folio_alloc_mpol_noprof+0x26/0xc0
[  308.987786]  [<0000034997753e4c>] vma_alloc_folio_noprof+0x6c/0xa0
[  308.987791]  [<0000034997775b22>] vma_alloc_anon_folio_pmd+0x42/0x240
[  308.987799]  [<000003499777bfea>] __do_huge_pmd_anonymous_page+0x3a/0x210
[  308.987804]  [<00000349976cb08e>] __handle_mm_fault+0x4de/0x500
[  308.987809]  [<00000349976cb14c>] handle_mm_fault+0x9c/0x3a0
[  308.987813]  [<000003499734d70e>] do_exception+0x1de/0x540
[  308.987822]  [<0000034998387390>] __do_pgm_check+0x130/0x220
[  308.987830]  [<000003499839a934>] pgm_check_handler+0x114/0x160
[  308.987838] 3 locks held by mempig_verify/5224:
[  308.987842]  #0: 0000023ea44c1e08 (vm_lock){++++}-{0:0}, at: lock_vma_under_rcu+0xb2/0x2a0
[  308.987859]  #1: 0000023ee4d41b18 (&pcp->lock){+.+.}-{2:2}, at: rmqueue.isra.0+0xad6/0xf40
[  308.987871]  #2: 0000023efe6c8998 (&zone->lock){..-.}-{2:2}, at: rmqueue_bulk+0x5a/0x940
[  308.987886] Last Breaking-Event-Address:
[  308.987890]  [<0000034997379096>] __warn_printk+0x136/0x140
[  308.987897] irq event stamp: 52330356
[  308.987901] hardirqs last  enabled at (52330355): [<000003499838742e>] __do_pgm_check+0x1ce/0x220
[  308.987907] hardirqs last disabled at (52330356): [<000003499839932e>] _raw_spin_lock_irqsave+0x9e/0xe0
[  308.987913] softirqs last  enabled at (52329882): [<0000034997383786>] handle_softirqs+0x2c6/0x530
[  308.987922] softirqs last disabled at (52329859): [<0000034997382f86>] __irq_exit_rcu+0x126/0x140
[  308.987929] ---[ end trace 0000000000000000 ]---
[  308.987936] ------------[ cut here ]------------
[  308.987940] page type is 0, passed migratetype is 1 (nr=256)
[  308.987951] WARNING: CPU: 1 PID: 5224 at mm/page_alloc.c:860 __del_page_from_free_list+0x1be/0x1e0
[  308.987960] Modules linked in: algif_hash(E) af_alg(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) s390_trng(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) sch_fq_codel(E) drm(E) i2c_core(E) drm_panel_orientation_quirks(E) loop(E) nfnetlink(E) vsock_loopback(E) vmw_vsock_virtio_transport_common(E) vsock(E) ctcm(E) fsm(E) diag288_wdt(E) watchdog(E) zfcp(E) scsi_transport_fc(E) ghash_s390(E) prng(E) aes_s390(E) des_generic(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) paes_s390(E) crypto_engine(E) pkey_cca(E) pkey_ep11(E) zcrypt(E) rng_core(E) pkey_pckmo(E) pkey(E) autofs4(E)
[  308.988070] Unloaded tainted modules: hmac_s390(E):2
[  308.988087] CPU: 1 UID: 0 PID: 5224 Comm: mempig_verify Kdump: loaded Tainted: G        W   E       6.18.0-gcc-bpf-debug #431 PREEMPT
[  308.988095] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE
[  308.988100] Hardware name: IBM 3906 M04 704 (z/VM 7.3.0)
[  308.988105] Krnl PSW : 0404f00180000000 00000349976f9e32 (__del_page_from_free_list+0x1c2/0x1e0)
[  308.988118]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
[  308.988127] Krnl GPRS: 0000034980000004 0000000000000005 0000000000000030 000003499a0e6d88
[  308.988133]            0000000000000005 0000034980000005 0000034998d57290 0000023efe6c8300
[  308.988139]            0000000000000001 0000000000000008 000002be00000100 000002be803ac000
[  308.988144]            0000000000000000 0000000000000001 00000349976f9e2e 000002c99b1eb728
[  308.988153] Krnl Code: 00000349976f9e22: c020008a06d9	larl	%r2,000003499883abd4
                          00000349976f9e28: c0e5ffe3f89c	brasl	%r14,0000034997378f60
                         #00000349976f9e2e: af000000		mc	0,0
                         >00000349976f9e32: a7f4ff4e		brc	15,00000349976f9cce
                          00000349976f9e36: b904002b		lgr	%r2,%r11
                          00000349976f9e3a: c030008a06e7	larl	%r3,000003499883ac08
                          00000349976f9e40: c0e5fffdbac8	brasl	%r14,00000349976b13d0
                          00000349976f9e46: af000000		mc	0,0
[  308.988184] Call Trace:
[  308.988188]  [<00000349976f9e32>] __del_page_from_free_list+0x1c2/0x1e0
[  308.988195] ([<00000349976f9e2e>] __del_page_from_free_list+0x1be/0x1e0)
[  308.988202]  [<00000349976ff946>] rmqueue_bulk+0x706/0x940
[  308.988208]  [<00000349976ffd7e>] __rmqueue_pcplist+0x1fe/0x2a0
[  308.988214]  [<0000034997700966>] rmqueue.isra.0+0xb46/0xf40
[  308.988221]  [<0000034997703ec8>] get_page_from_freelist+0x198/0x8d0
[  308.988227]  [<0000034997706fa8>] __alloc_frozen_pages_noprof+0x198/0x400
[  308.988233]  [<00000349977536f8>] alloc_pages_mpol+0xb8/0x220
[  308.988240]  [<0000034997753bf6>] folio_alloc_mpol_noprof+0x26/0xc0
[  308.988247]  [<0000034997753e4c>] vma_alloc_folio_noprof+0x6c/0xa0
[  308.988253]  [<0000034997775b22>] vma_alloc_anon_folio_pmd+0x42/0x240
[  308.988260]  [<000003499777bfea>] __do_huge_pmd_anonymous_page+0x3a/0x210
[  308.988267]  [<00000349976cb08e>] __handle_mm_fault+0x4de/0x500
[  308.988273]  [<00000349976cb14c>] handle_mm_fault+0x9c/0x3a0
[  308.988279]  [<000003499734d70e>] do_exception+0x1de/0x540
[  308.988286]  [<0000034998387390>] __do_pgm_check+0x130/0x220
[  308.988293]  [<000003499839a934>] pgm_check_handler+0x114/0x160
[  308.988300] 3 locks held by mempig_verify/5224:
[  308.988305]  #0: 0000023ea44c1e08 (vm_lock){++++}-{0:0}, at: lock_vma_under_rcu+0xb2/0x2a0
[  308.988322]  #1: 0000023ee4d41b18 (&pcp->lock){+.+.}-{2:2}, at: rmqueue.isra.0+0xad6/0xf40
[  308.988334]  #2: 0000023efe6c8998 (&zone->lock){..-.}-{2:2}, at: rmqueue_bulk+0x5a/0x940
[  308.988346] Last Breaking-Event-Address:
[  308.988350]  [<0000034997379096>] __warn_printk+0x136/0x140
[  308.988356] irq event stamp: 52330356
[  308.988360] hardirqs last  enabled at (52330355): [<000003499838742e>] __do_pgm_check+0x1ce/0x220
[  308.988366] hardirqs last disabled at (52330356): [<000003499839932e>] _raw_spin_lock_irqsave+0x9e/0xe0
[  308.988373] softirqs last  enabled at (52329882): [<0000034997383786>] handle_softirqs+0x2c6/0x530
[  308.988380] softirqs last disabled at (52329859): [<0000034997382f86>] __irq_exit_rcu+0x126/0x140
[  308.988388] ---[ end trace 0000000000000000 ]---

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: e6cf9e1 ("mm: page_alloc: fix up block types when merging compatible blocks")
Signed-off-by: Alexander Gordeev <[email protected]>
Reported-by: Marc Hartmayer <[email protected]>
Closes: https://lore.kernel.org/linux-mm/[email protected]/
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Cc: Marc Hartmayer <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 19, 2025
When running the Rust maple tree kunit tests with lockdep, you may trigger
a warning that looks like this:

	lib/maple_tree.c:780 suspicious rcu_dereference_check() usage!

	other info that might help us debug this:

	rcu_scheduler_active = 2, debug_locks = 1
	no locks held by kunit_try_catch/344.

	stack backtrace:
	CPU: 3 UID: 0 PID: 344 Comm: kunit_try_catch Tainted: G                 N  6.19.0-rc1+ #2 NONE
	Tainted: [N]=TEST
	Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
	Call Trace:
	 <TASK>
	 dump_stack_lvl+0x71/0x90
	 lockdep_rcu_suspicious+0x150/0x190
	 mas_start+0x104/0x150
	 mas_find+0x179/0x240
	 _RINvNtCs5QSdWC790r4_4core3ptr13drop_in_placeINtNtCs1cdwasc6FUb_6kernel10maple_tree9MapleTreeINtNtNtBL_5alloc4kbox3BoxlNtNtB1x_9allocator7KmallocEEECsgxAQYCfdR72_25doctests_kernel_generated+0xaf/0x130
	 rust_doctest_kernel_maple_tree_rs_0+0x600/0x6b0
	 ? lock_release+0xeb/0x2a0
	 ? kunit_try_catch_run+0x210/0x210
	 kunit_try_run_case+0x74/0x160
	 ? kunit_try_catch_run+0x210/0x210
	 kunit_generic_run_threadfn_adapter+0x12/0x30
	 kthread+0x21c/0x230
	 ? __do_trace_sched_kthread_stop_ret+0x40/0x40
	 ret_from_fork+0x16c/0x270
	 ? __do_trace_sched_kthread_stop_ret+0x40/0x40
	 ret_from_fork_asm+0x11/0x20
	 </TASK>

This is because the destructor of maple tree calls mas_find() without
taking rcu_read_lock() or the spinlock.  Doing that is actually ok in this
case since the destructor has exclusive access to the entire maple tree,
but it triggers a lockdep warning.  To fix that, take the rcu read lock.

In the future, it's possible that memory reclaim could gain a feature
where it reallocates entries in maple trees even if no user-code is
touching it.  If that feature is added, then this use of rcu read lock
would become load-bearing, so I did not make it conditional on lockdep.

We have to repeatedly take and release rcu because the destructor of T
might perform operations that sleep.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: da939ef ("rust: maple_tree: add MapleTree")
Signed-off-by: Alice Ryhl <[email protected]>
Reported-by: Andreas Hindborg <[email protected]>
Closes: https://rust-for-linux.zulipchat.com/#narrow/channel/x/topic/x/near/564215108
Reviewed-by: Gary Guo <[email protected]>
Reviewed-by: Daniel Almeida <[email protected]>
Cc: Andrew Ballance <[email protected]>
Cc: Björn Roy Baron <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Danilo Krummrich <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Trevor Gross <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 19, 2025
After rename exchanging (either with the rename exchange operation or
regular renames in multiple non-atomic steps) two inodes and at least
one of them is a directory, we can end up with a log tree that contains
only of the inodes and after a power failure that can result in an attempt
to delete the other inode when it should not because it was not deleted
before the power failure. In some case that delete attempt fails when
the target inode is a directory that contains a subvolume inside it, since
the log replay code is not prepared to deal with directory entries that
point to root items (only inode items).

1) We have directories "dir1" (inode A) and "dir2" (inode B) under the
   same parent directory;

2) We have a file (inode C) under directory "dir1" (inode A);

3) We have a subvolume inside directory "dir2" (inode B);

4) All these inodes were persisted in a past transaction and we are
   currently at transaction N;

5) We rename the file (inode C), so at btrfs_log_new_name() we update
   inode C's last_unlink_trans to N;

6) We get a rename exchange for "dir1" (inode A) and "dir2" (inode B),
   so after the exchange "dir1" is inode B and "dir2" is inode A.
   During the rename exchange we call btrfs_log_new_name() for inodes
   A and B, but because they are directories, we don't update their
   last_unlink_trans to N;

7) An fsync against the file (inode C) is done, and because its inode
   has a last_unlink_trans with a value of N we log its parent directory
   (inode A) (through btrfs_log_all_parents(), called from
   btrfs_log_inode_parent()).

8) So we end up with inode B not logged, which now has the old name
   of inode A. At copy_inode_items_to_log(), when logging inode A, we
   did not check if we had any conflicting inode to log because inode
   A has a generation lower than the current transaction (created in
   a past transaction);

9) After a power failure, when replaying the log tree, since we find that
   inode A has a new name that conflicts with the name of inode B in the
   fs tree, we attempt to delete inode B... this is wrong since that
   directory was never deleted before the power failure, and because there
   is a subvolume inside that directory, attempting to delete it will fail
   since replay_dir_deletes() and btrfs_unlink_inode() are not prepared
   to deal with dir items that point to roots instead of inodes.

   When that happens the mount fails and we get a stack trace like the
   following:

   [87.2314] BTRFS info (device dm-0): start tree-log replay
   [87.2318] BTRFS critical (device dm-0): failed to delete reference to subvol, root 5 inode 256 parent 259
   [87.2332] ------------[ cut here ]------------
   [87.2338] BTRFS: Transaction aborted (error -2)
   [87.2346] WARNING: CPU: 1 PID: 638968 at fs/btrfs/inode.c:4345 __btrfs_unlink_inode+0x416/0x440 [btrfs]
   [87.2368] Modules linked in: btrfs loop dm_thin_pool (...)
   [87.2470] CPU: 1 UID: 0 PID: 638968 Comm: mount Tainted: G        W           6.18.0-rc7-btrfs-next-218+ #2 PREEMPT(full)
   [87.2489] Tainted: [W]=WARN
   [87.2494] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
   [87.2514] RIP: 0010:__btrfs_unlink_inode+0x416/0x440 [btrfs]
   [87.2538] Code: c0 89 04 24 (...)
   [87.2568] RSP: 0018:ffffc0e741f4b9b8 EFLAGS: 00010286
   [87.2574] RAX: 0000000000000000 RBX: ffff9d3ec8a6cf60 RCX: 0000000000000000
   [87.2582] RDX: 0000000000000002 RSI: ffffffff84ab45a1 RDI: 00000000ffffffff
   [87.2591] RBP: ffff9d3ec8a6ef20 R08: 0000000000000000 R09: ffffc0e741f4b840
   [87.2599] R10: ffff9d45dc1fffa8 R11: 0000000000000003 R12: ffff9d3ee26d77e0
   [87.2608] R13: ffffc0e741f4ba98 R14: ffff9d4458040800 R15: ffff9d44b6b7ca10
   [87.2618] FS:  00007f7b9603a840(0000) GS:ffff9d4658982000(0000) knlGS:0000000000000000
   [87.2629] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [87.2637] CR2: 00007ffc9ec33b98 CR3: 000000011273e003 CR4: 0000000000370ef0
   [87.2648] Call Trace:
   [87.2651]  <TASK>
   [87.2654]  btrfs_unlink_inode+0x15/0x40 [btrfs]
   [87.2661]  unlink_inode_for_log_replay+0x27/0xf0 [btrfs]
   [87.2669]  check_item_in_log+0x1ea/0x2c0 [btrfs]
   [87.2676]  replay_dir_deletes+0x16b/0x380 [btrfs]
   [87.2684]  fixup_inode_link_count+0x34b/0x370 [btrfs]
   [87.2696]  fixup_inode_link_counts+0x41/0x160 [btrfs]
   [87.2703]  btrfs_recover_log_trees+0x1ff/0x7c0 [btrfs]
   [87.2711]  ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
   [87.2719]  open_ctree+0x10bb/0x15f0 [btrfs]
   [87.2726]  btrfs_get_tree.cold+0xb/0x16c [btrfs]
   [87.2734]  ? fscontext_read+0x15c/0x180
   [87.2740]  ? rw_verify_area+0x50/0x180
   [87.2746]  vfs_get_tree+0x25/0xd0
   [87.2750]  vfs_cmd_create+0x59/0xe0
   [87.2755]  __do_sys_fsconfig+0x4f6/0x6b0
   [87.2760]  do_syscall_64+0x50/0x1220
   [87.2764]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [87.2770] RIP: 0033:0x7f7b9625f4aa
   [87.2775] Code: 73 01 c3 48 (...)
   [87.2803] RSP: 002b:00007ffc9ec35b08 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
   [87.2817] RAX: ffffffffffffffda RBX: 0000558bfa91ac20 RCX: 00007f7b9625f4aa
   [87.2829] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
   [87.2842] RBP: 0000558bfa91b120 R08: 0000000000000000 R09: 0000000000000000
   [87.2854] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [87.2864] R13: 00007f7b963f1580 R14: 00007f7b963f326c R15: 00007f7b963d8a23
   [87.2877]  </TASK>
   [87.2882] ---[ end trace 0000000000000000 ]---
   [87.2891] BTRFS: error (device dm-0 state A) in __btrfs_unlink_inode:4345: errno=-2 No such entry
   [87.2904] BTRFS: error (device dm-0 state EAO) in do_abort_log_replay:191: errno=-2 No such entry
   [87.2915] BTRFS critical (device dm-0 state EAO): log tree (for root 5) leaf currently being processed (slot 7 key (258 12 257)):
   [87.2929] BTRFS info (device dm-0 state EAO): leaf 30736384 gen 10 total ptrs 7 free space 15712 owner 18446744073709551610
   [87.2929] BTRFS info (device dm-0 state EAO): refs 3 lock_owner 0 current 638968
   [87.2929]      item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160
   [87.2929]              inode generation 9 transid 10 size 0 nbytes 0
   [87.2929]              block group 0 mode 40755 links 1 uid 0 gid 0
   [87.2929]              rdev 0 sequence 7 flags 0x0
   [87.2929]              atime 1765464494.678070921
   [87.2929]              ctime 1765464494.686606513
   [87.2929]              mtime 1765464494.686606513
   [87.2929]              otime 1765464494.678070921
   [87.2929]      item 1 key (257 INODE_REF 256) itemoff 16109 itemsize 14
   [87.2929]              index 4 name_len 4
   [87.2929]      item 2 key (257 DIR_LOG_INDEX 2) itemoff 16101 itemsize 8
   [87.2929]              dir log end 2
   [87.2929]      item 3 key (257 DIR_LOG_INDEX 3) itemoff 16093 itemsize 8
   [87.2929]              dir log end 18446744073709551615
   [87.2930]      item 4 key (257 DIR_INDEX 3) itemoff 16060 itemsize 33
   [87.2930]              location key (258 1 0) type 1
   [87.2930]              transid 10 data_len 0 name_len 3
   [87.2930]      item 5 key (258 INODE_ITEM 0) itemoff 15900 itemsize 160
   [87.2930]              inode generation 9 transid 10 size 0 nbytes 0
   [87.2930]              block group 0 mode 100644 links 1 uid 0 gid 0
   [87.2930]              rdev 0 sequence 2 flags 0x0
   [87.2930]              atime 1765464494.678456467
   [87.2930]              ctime 1765464494.686606513
   [87.2930]              mtime 1765464494.678456467
   [87.2930]              otime 1765464494.678456467
   [87.2930]      item 6 key (258 INODE_REF 257) itemoff 15887 itemsize 13
   [87.2930]              index 3 name_len 3
   [87.2930] BTRFS critical (device dm-0 state EAO): log replay failed in unlink_inode_for_log_replay:1045 for root 5, stage 3, with error -2: failed to unlink inode 256 parent dir 259 name subvol root 5
   [87.2963] BTRFS: error (device dm-0 state EAO) in btrfs_recover_log_trees:7743: errno=-2 No such entry
   [87.2981] BTRFS: error (device dm-0 state EAO) in btrfs_replay_log:2083: errno=-2 No such entry (Failed to recover log tr

So fix this by changing copy_inode_items_to_log() to always detect if
there are conflicting inodes for the ref/extref of the inode being logged
even if the inode was created in a past transaction.

A test case for fstests will follow soon.

CC: [email protected] # 6.1+
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 19, 2025
A race condition was found in sg_proc_debug_helper(). It was observed on
a system using an IBM LTO-9 SAS Tape Drive (ULTRIUM-TD9) and monitoring
/proc/scsi/sg/debug every second. A very large elapsed time would
sometimes appear. This is caused by two race conditions.

We reproduced the issue with an IBM ULTRIUM-HH9 tape drive on an x86_64
architecture. A patched kernel was built, and the race condition could
not be observed anymore after the application of this patch. A
reproducer C program utilising the scsi_debug module was also built by
Changhui Zhong and can be viewed here:

https://github.com/MichaelRabek/linux-tests/blob/master/drivers/scsi/sg/sg_race_trigger.c

The first race happens between the reading of hp->duration in
sg_proc_debug_helper() and request completion in sg_rq_end_io().  The
hp->duration member variable may hold either of two types of
information:

 #1 - The start time of the request. This value is present while
      the request is not yet finished.

 #2 - The total execution time of the request (end_time - start_time).

If sg_proc_debug_helper() executes *after* the value of hp->duration was
changed from #1 to #2, but *before* srp->done is set to 1 in
sg_rq_end_io(), a fresh timestamp is taken in the else branch, and the
elapsed time (value type #2) is subtracted from a timestamp, which
cannot yield a valid elapsed time (which is a type #2 value as well).

To fix this issue, the value of hp->duration must change under the
protection of the sfp->rq_list_lock in sg_rq_end_io().  Since
sg_proc_debug_helper() takes this read lock, the change to srp->done and
srp->header.duration will happen atomically from the perspective of
sg_proc_debug_helper() and the race condition is thus eliminated.

The second race condition happens between sg_proc_debug_helper() and
sg_new_write(). Even though hp->duration is set to the current time
stamp in sg_add_request() under the write lock's protection, it gets
overwritten by a call to get_sg_io_hdr(), which calls copy_from_user()
to copy struct sg_io_hdr from userspace into kernel space. hp->duration
is set to the start time again in sg_common_write(). If
sg_proc_debug_helper() is called between these two calls, an arbitrary
value set by userspace (usually zero) is used to compute the elapsed
time.

To fix this issue, hp->duration must be set to the current timestamp
again after get_sg_io_hdr() returns successfully. A small race window
still exists between get_sg_io_hdr() and setting hp->duration, but this
window is only a few instructions wide and does not result in observable
issues in practice, as confirmed by testing.

Additionally, we fix the format specifier from %d to %u for printing
unsigned int values in sg_proc_debug_helper().

Signed-off-by: Michal Rábek <[email protected]>
Suggested-by: Tomas Henzl <[email protected]>
Tested-by: Changhui Zhong <[email protected]>
Reviewed-by: Ewan D. Milne <[email protected]>
Reviewed-by: John Meneghini <[email protected]>
Reviewed-by: Tomas Henzl <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Martin K. Petersen <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 19, 2025
Patch series "mm/hugetlb: fixes for PMD table sharing (incl.  using
mmu_gather)", v2.

One functional fix, one performance regression fix, and two related
comment fixes.

The goal of this patch set is to be backported to stable trees "fairly"
easily. At least patch #1 and #4.

Patch #1 fixes hugetlb_pmd_shared() not detecting any sharing
Patch #2 + #3 are simple comment fixes that patch #4 interacts with.
Patch #4 is a fix for the reported performance regression due to excessive
IPI broadcasts during fork()+exit().

The last patch is all about TLB flushes, IPIs and mmu_gather.
Read: complicated


This patch (of 4):

We switched from (wrongly) using the page count to an independent shared
count.  Now, shared page tables have a refcount of 1 (excluding
speculative references) and instead use ptdesc->pt_share_count to identify
sharing.

We didn't convert hugetlb_pmd_shared(), so right now, we would never
detect a shared PMD table as such, because sharing/unsharing no longer
touches the refcount of a PMD table.

Page migration, like mbind() or migrate_pages() would allow for migrating
folios mapped into such shared PMD tables, even though the folios are not
exclusive.  In smaps we would account them as "private" although they are
"shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the
pagemap interface.

Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared().

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 59d9094 ("mm: hugetlb: independent PMD page table shared count")
Signed-off-by: David Hildenbrand (Red Hat) <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Tested-by: Lance Yang <[email protected]>
Reviewed-by: Harry Yoo <[email protected]>
Tested-by: Laurence Oberman <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Acked-by: Oscar Salvador <[email protected]>
Cc: Liu Shixin <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Prakash Sangappa <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Uschakow, Stanislav" <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 19, 2025
Patch series "kallsyms: Prevent invalid access when showing module
buildid", v3.

We have seen nested crashes in __sprint_symbol(), see below.  They seem to
be caused by an invalid pointer to "buildid".  This patchset cleans up
kallsyms code related to module buildid and fixes this invalid access when
printing backtraces.

I made an audit of __sprint_symbol() and found several situations
when the buildid might be wrong:

  + bpf_address_lookup() does not set @modbuildid

  + ftrace_mod_address_lookup() does not set @modbuildid

  + __sprint_symbol() does not take rcu_read_lock and
    the related struct module might get removed before
    mod->build_id is printed.

This patchset solves these problems:

  + 1st, 2nd patches are preparatory
  + 3rd, 4th, 6th patches fix the above problems
  + 5th patch cleans up a suspicious initialization code.

This is the backtrace, we have seen. But it is not really important.
The problems fixed by the patchset are obvious:

  crash64> bt [62/2029]
  PID: 136151 TASK: ffff9f6c981d4000 CPU: 367 COMMAND: "btrfs"
  #0 [ffffbdb687635c28] machine_kexec at ffffffffb4c845b3
  #1 [ffffbdb687635c80] __crash_kexec at ffffffffb4d86a6a
  #2 [ffffbdb687635d08] hex_string at ffffffffb51b3b61
  #3 [ffffbdb687635d40] crash_kexec at ffffffffb4d87964
  #4 [ffffbdb687635d50] oops_end at ffffffffb4c41fc8
  #5 [ffffbdb687635d70] do_trap at ffffffffb4c3e49a
  #6 [ffffbdb687635db8] do_error_trap at ffffffffb4c3e6a4
  #7 [ffffbdb687635df8] exc_stack_segment at ffffffffb5666b33
  #8 [ffffbdb687635e20] asm_exc_stack_segment at ffffffffb5800cf9
  ...


This patch (of 7)

The function kallsyms_lookup_buildid() initializes the given @namebuf by
clearing the first and the last byte.  It is not clear why.

The 1st byte makes sense because some callers ignore the return code and
expect that the buffer contains a valid string, for example:

  - function_stat_show()
    - kallsyms_lookup()
      - kallsyms_lookup_buildid()

The initialization of the last byte does not make much sense because it
can later be overwritten.  Fortunately, it seems that all called functions
behave correctly:

  -  kallsyms_expand_symbol() explicitly adds the trailing '\0'
     at the end of the function.

  - All *__address_lookup() functions either use the safe strscpy()
    or they do not touch the buffer at all.

Document the reason for clearing the first byte.  And remove the useless
initialization of the last byte.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Petr Mladek <[email protected]>
Reviewed-by: Aaron Tomlin <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Daniel Borkman <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Luis Chamberalin <[email protected]>
Cc: Marc Rutland <[email protected]>
Cc: "Masami Hiramatsu (Google)" <[email protected]>
Cc: Petr Pavlu <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Daniel Gomez <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 20, 2025
Fix a loop scenario of ethx:egress->ethx:egress

Example setup to reproduce:
tc qdisc add dev ethx root handle 1: drr
tc filter add dev ethx parent 1: protocol ip prio 1 matchall \
         action mirred egress redirect dev ethx

Now ping out of ethx and you get a deadlock:

[  116.892898][  T307] ============================================
[  116.893182][  T307] WARNING: possible recursive locking detected
[  116.893418][  T307] 6.18.0-rc6-01205-ge05021a829b8-dirty #204 Not tainted
[  116.893682][  T307] --------------------------------------------
[  116.893926][  T307] ping/307 is trying to acquire lock:
[  116.894133][  T307] ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50
[  116.894517][  T307]
[  116.894517][  T307] but task is already holding lock:
[  116.894836][  T307] ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50
[  116.895252][  T307]
[  116.895252][  T307] other info that might help us debug this:
[  116.895608][  T307]  Possible unsafe locking scenario:
[  116.895608][  T307]
[  116.895901][  T307]        CPU0
[  116.896057][  T307]        ----
[  116.896200][  T307]   lock(&sch->root_lock_key);
[  116.896392][  T307]   lock(&sch->root_lock_key);
[  116.896605][  T307]
[  116.896605][  T307]  *** DEADLOCK ***
[  116.896605][  T307]
[  116.896864][  T307]  May be due to missing lock nesting notation
[  116.896864][  T307]
[  116.897123][  T307] 6 locks held by ping/307:
[  116.897302][  T307]  #0: ffff88800b4b0250 (sk_lock-AF_INET){+.+.}-{0:0}, at: raw_sendmsg+0xb20/0x2cf0
[  116.897808][  T307]  #1: ffffffff88c839c0 (rcu_read_lock){....}-{1:3}, at: ip_output+0xa9/0x600
[  116.898138][  T307]  #2: ffffffff88c839c0 (rcu_read_lock){....}-{1:3}, at: ip_finish_output2+0x2c6/0x1ee0
[  116.898459][  T307]  #3: ffffffff88c83960 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x200/0x3b50
[  116.898782][  T307]  #4: ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50
[  116.899132][  T307]  #5: ffffffff88c83960 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x200/0x3b50
[  116.899442][  T307]
[  116.899442][  T307] stack backtrace:
[  116.899667][  T307] CPU: 2 UID: 0 PID: 307 Comm: ping Not tainted 6.18.0-rc6-01205-ge05021a829b8-dirty #204 PREEMPT(voluntary)
[  116.899672][  T307] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  116.899675][  T307] Call Trace:
[  116.899678][  T307]  <TASK>
[  116.899680][  T307]  dump_stack_lvl+0x6f/0xb0
[  116.899688][  T307]  print_deadlock_bug.cold+0xc0/0xdc
[  116.899695][  T307]  __lock_acquire+0x11f7/0x1be0
[  116.899704][  T307]  lock_acquire+0x162/0x300
[  116.899707][  T307]  ? __dev_queue_xmit+0x2210/0x3b50
[  116.899713][  T307]  ? srso_alias_return_thunk+0x5/0xfbef5
[  116.899717][  T307]  ? stack_trace_save+0x93/0xd0
[  116.899723][  T307]  _raw_spin_lock+0x30/0x40
[  116.899728][  T307]  ? __dev_queue_xmit+0x2210/0x3b50
[  116.899731][  T307]  __dev_queue_xmit+0x2210/0x3b50

Fixes: 178ca30 ("Revert "net/sched: Fix mirred deadlock on device recursion"")
Tested-by: Victor Nogueira <[email protected]>
Signed-off-by: Jamal Hadi Salim <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 20, 2025
After rename exchanging (either with the rename exchange operation or
regular renames in multiple non-atomic steps) two inodes and at least
one of them is a directory, we can end up with a log tree that contains
only of the inodes and after a power failure that can result in an attempt
to delete the other inode when it should not because it was not deleted
before the power failure. In some case that delete attempt fails when
the target inode is a directory that contains a subvolume inside it, since
the log replay code is not prepared to deal with directory entries that
point to root items (only inode items).

1) We have directories "dir1" (inode A) and "dir2" (inode B) under the
   same parent directory;

2) We have a file (inode C) under directory "dir1" (inode A);

3) We have a subvolume inside directory "dir2" (inode B);

4) All these inodes were persisted in a past transaction and we are
   currently at transaction N;

5) We rename the file (inode C), so at btrfs_log_new_name() we update
   inode C's last_unlink_trans to N;

6) We get a rename exchange for "dir1" (inode A) and "dir2" (inode B),
   so after the exchange "dir1" is inode B and "dir2" is inode A.
   During the rename exchange we call btrfs_log_new_name() for inodes
   A and B, but because they are directories, we don't update their
   last_unlink_trans to N;

7) An fsync against the file (inode C) is done, and because its inode
   has a last_unlink_trans with a value of N we log its parent directory
   (inode A) (through btrfs_log_all_parents(), called from
   btrfs_log_inode_parent()).

8) So we end up with inode B not logged, which now has the old name
   of inode A. At copy_inode_items_to_log(), when logging inode A, we
   did not check if we had any conflicting inode to log because inode
   A has a generation lower than the current transaction (created in
   a past transaction);

9) After a power failure, when replaying the log tree, since we find that
   inode A has a new name that conflicts with the name of inode B in the
   fs tree, we attempt to delete inode B... this is wrong since that
   directory was never deleted before the power failure, and because there
   is a subvolume inside that directory, attempting to delete it will fail
   since replay_dir_deletes() and btrfs_unlink_inode() are not prepared
   to deal with dir items that point to roots instead of inodes.

   When that happens the mount fails and we get a stack trace like the
   following:

   [87.2314] BTRFS info (device dm-0): start tree-log replay
   [87.2318] BTRFS critical (device dm-0): failed to delete reference to subvol, root 5 inode 256 parent 259
   [87.2332] ------------[ cut here ]------------
   [87.2338] BTRFS: Transaction aborted (error -2)
   [87.2346] WARNING: CPU: 1 PID: 638968 at fs/btrfs/inode.c:4345 __btrfs_unlink_inode+0x416/0x440 [btrfs]
   [87.2368] Modules linked in: btrfs loop dm_thin_pool (...)
   [87.2470] CPU: 1 UID: 0 PID: 638968 Comm: mount Tainted: G        W           6.18.0-rc7-btrfs-next-218+ #2 PREEMPT(full)
   [87.2489] Tainted: [W]=WARN
   [87.2494] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
   [87.2514] RIP: 0010:__btrfs_unlink_inode+0x416/0x440 [btrfs]
   [87.2538] Code: c0 89 04 24 (...)
   [87.2568] RSP: 0018:ffffc0e741f4b9b8 EFLAGS: 00010286
   [87.2574] RAX: 0000000000000000 RBX: ffff9d3ec8a6cf60 RCX: 0000000000000000
   [87.2582] RDX: 0000000000000002 RSI: ffffffff84ab45a1 RDI: 00000000ffffffff
   [87.2591] RBP: ffff9d3ec8a6ef20 R08: 0000000000000000 R09: ffffc0e741f4b840
   [87.2599] R10: ffff9d45dc1fffa8 R11: 0000000000000003 R12: ffff9d3ee26d77e0
   [87.2608] R13: ffffc0e741f4ba98 R14: ffff9d4458040800 R15: ffff9d44b6b7ca10
   [87.2618] FS:  00007f7b9603a840(0000) GS:ffff9d4658982000(0000) knlGS:0000000000000000
   [87.2629] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [87.2637] CR2: 00007ffc9ec33b98 CR3: 000000011273e003 CR4: 0000000000370ef0
   [87.2648] Call Trace:
   [87.2651]  <TASK>
   [87.2654]  btrfs_unlink_inode+0x15/0x40 [btrfs]
   [87.2661]  unlink_inode_for_log_replay+0x27/0xf0 [btrfs]
   [87.2669]  check_item_in_log+0x1ea/0x2c0 [btrfs]
   [87.2676]  replay_dir_deletes+0x16b/0x380 [btrfs]
   [87.2684]  fixup_inode_link_count+0x34b/0x370 [btrfs]
   [87.2696]  fixup_inode_link_counts+0x41/0x160 [btrfs]
   [87.2703]  btrfs_recover_log_trees+0x1ff/0x7c0 [btrfs]
   [87.2711]  ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
   [87.2719]  open_ctree+0x10bb/0x15f0 [btrfs]
   [87.2726]  btrfs_get_tree.cold+0xb/0x16c [btrfs]
   [87.2734]  ? fscontext_read+0x15c/0x180
   [87.2740]  ? rw_verify_area+0x50/0x180
   [87.2746]  vfs_get_tree+0x25/0xd0
   [87.2750]  vfs_cmd_create+0x59/0xe0
   [87.2755]  __do_sys_fsconfig+0x4f6/0x6b0
   [87.2760]  do_syscall_64+0x50/0x1220
   [87.2764]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [87.2770] RIP: 0033:0x7f7b9625f4aa
   [87.2775] Code: 73 01 c3 48 (...)
   [87.2803] RSP: 002b:00007ffc9ec35b08 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
   [87.2817] RAX: ffffffffffffffda RBX: 0000558bfa91ac20 RCX: 00007f7b9625f4aa
   [87.2829] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
   [87.2842] RBP: 0000558bfa91b120 R08: 0000000000000000 R09: 0000000000000000
   [87.2854] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [87.2864] R13: 00007f7b963f1580 R14: 00007f7b963f326c R15: 00007f7b963d8a23
   [87.2877]  </TASK>
   [87.2882] ---[ end trace 0000000000000000 ]---
   [87.2891] BTRFS: error (device dm-0 state A) in __btrfs_unlink_inode:4345: errno=-2 No such entry
   [87.2904] BTRFS: error (device dm-0 state EAO) in do_abort_log_replay:191: errno=-2 No such entry
   [87.2915] BTRFS critical (device dm-0 state EAO): log tree (for root 5) leaf currently being processed (slot 7 key (258 12 257)):
   [87.2929] BTRFS info (device dm-0 state EAO): leaf 30736384 gen 10 total ptrs 7 free space 15712 owner 18446744073709551610
   [87.2929] BTRFS info (device dm-0 state EAO): refs 3 lock_owner 0 current 638968
   [87.2929]      item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160
   [87.2929]              inode generation 9 transid 10 size 0 nbytes 0
   [87.2929]              block group 0 mode 40755 links 1 uid 0 gid 0
   [87.2929]              rdev 0 sequence 7 flags 0x0
   [87.2929]              atime 1765464494.678070921
   [87.2929]              ctime 1765464494.686606513
   [87.2929]              mtime 1765464494.686606513
   [87.2929]              otime 1765464494.678070921
   [87.2929]      item 1 key (257 INODE_REF 256) itemoff 16109 itemsize 14
   [87.2929]              index 4 name_len 4
   [87.2929]      item 2 key (257 DIR_LOG_INDEX 2) itemoff 16101 itemsize 8
   [87.2929]              dir log end 2
   [87.2929]      item 3 key (257 DIR_LOG_INDEX 3) itemoff 16093 itemsize 8
   [87.2929]              dir log end 18446744073709551615
   [87.2930]      item 4 key (257 DIR_INDEX 3) itemoff 16060 itemsize 33
   [87.2930]              location key (258 1 0) type 1
   [87.2930]              transid 10 data_len 0 name_len 3
   [87.2930]      item 5 key (258 INODE_ITEM 0) itemoff 15900 itemsize 160
   [87.2930]              inode generation 9 transid 10 size 0 nbytes 0
   [87.2930]              block group 0 mode 100644 links 1 uid 0 gid 0
   [87.2930]              rdev 0 sequence 2 flags 0x0
   [87.2930]              atime 1765464494.678456467
   [87.2930]              ctime 1765464494.686606513
   [87.2930]              mtime 1765464494.678456467
   [87.2930]              otime 1765464494.678456467
   [87.2930]      item 6 key (258 INODE_REF 257) itemoff 15887 itemsize 13
   [87.2930]              index 3 name_len 3
   [87.2930] BTRFS critical (device dm-0 state EAO): log replay failed in unlink_inode_for_log_replay:1045 for root 5, stage 3, with error -2: failed to unlink inode 256 parent dir 259 name subvol root 5
   [87.2963] BTRFS: error (device dm-0 state EAO) in btrfs_recover_log_trees:7743: errno=-2 No such entry
   [87.2981] BTRFS: error (device dm-0 state EAO) in btrfs_replay_log:2083: errno=-2 No such entry (Failed to recover log tr

So fix this by changing copy_inode_items_to_log() to always detect if
there are conflicting inodes for the ref/extref of the inode being logged
even if the inode was created in a past transaction.

A test case for fstests will follow soon.

CC: [email protected] # 6.1+
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
github-actions bot pushed a commit that referenced this pull request Dec 20, 2025
…ked_inode()

In btrfs_read_locked_inode() we are calling btrfs_init_file_extent_tree()
while holding a path with a read locked leaf from a subvolume tree, and
btrfs_init_file_extent_tree() may do a GFP_KERNEL allocation, which can
trigger reclaim.

This can create a circular lock dependency which lockdep warns about with
the following splat:

   [27386.164433] ======================================================
   [27386.164574] WARNING: possible circular locking dependency detected
   [27386.164583] 6.18.0+ #4 Tainted: G     U
   [27386.164591] ------------------------------------------------------
   [27386.164599] kswapd0/117 is trying to acquire lock:
   [27386.164606] ffff8d9b6333c5b8 (&delayed_node->mutex){+.+.}-{3:3}, at:
   __btrfs_release_delayed_node.part.0+0x39/0x2f0
   [27386.164625]
                  but task is already holding lock:
   [27386.164633] ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at:
   balance_pgdat+0x195/0xc60
   [27386.164646]
                  which lock already depends on the new lock.

   [27386.164657]
                  the existing dependency chain (in reverse order) is:
   [27386.164667]
                  -> #2 (fs_reclaim){+.+.}-{0:0}:
   [27386.164677]        fs_reclaim_acquire+0x9d/0xd0
   [27386.164685]        __kmalloc_cache_noprof+0x59/0x750
   [27386.164694]        btrfs_init_file_extent_tree+0x90/0x100
   [27386.164702]        btrfs_read_locked_inode+0xc3/0x6b0
   [27386.164710]        btrfs_iget+0xbb/0xf0
   [27386.164716]        btrfs_lookup_dentry+0x3c5/0x8e0
   [27386.164724]        btrfs_lookup+0x12/0x30
   [27386.164731]        lookup_open.isra.0+0x1aa/0x6a0
   [27386.164739]        path_openat+0x5f7/0xc60
   [27386.164746]        do_filp_open+0xd6/0x180
   [27386.164753]        do_sys_openat2+0x8b/0xe0
   [27386.164760]        __x64_sys_openat+0x54/0xa0
   [27386.164768]        do_syscall_64+0x97/0x3e0
   [27386.164776]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [27386.164784]
                  -> #1 (btrfs-tree-00){++++}-{3:3}:
   [27386.164794]        lock_release+0x127/0x2a0
   [27386.164801]        up_read+0x1b/0x30
   [27386.164808]        btrfs_search_slot+0x8e0/0xff0
   [27386.164817]        btrfs_lookup_inode+0x52/0xd0
   [27386.164825]        __btrfs_update_delayed_inode+0x73/0x520
   [27386.164833]        btrfs_commit_inode_delayed_inode+0x11a/0x120
   [27386.164842]        btrfs_log_inode+0x608/0x1aa0
   [27386.164849]        btrfs_log_inode_parent+0x249/0xf80
   [27386.164857]        btrfs_log_dentry_safe+0x3e/0x60
   [27386.164865]        btrfs_sync_file+0x431/0x690
   [27386.164872]        do_fsync+0x39/0x80
   [27386.164879]        __x64_sys_fsync+0x13/0x20
   [27386.164887]        do_syscall_64+0x97/0x3e0
   [27386.164894]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [27386.164903]
                  -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
   [27386.164913]        __lock_acquire+0x15e9/0x2820
   [27386.164920]        lock_acquire+0xc9/0x2d0
   [27386.164927]        __mutex_lock+0xcc/0x10a0
   [27386.164934]        __btrfs_release_delayed_node.part.0+0x39/0x2f0
   [27386.164944]        btrfs_evict_inode+0x20b/0x4b0
   [27386.164952]        evict+0x15a/0x2f0
   [27386.164958]        prune_icache_sb+0x91/0xd0
   [27386.164966]        super_cache_scan+0x150/0x1d0
   [27386.164974]        do_shrink_slab+0x155/0x6f0
   [27386.164981]        shrink_slab+0x48e/0x890
   [27386.164988]        shrink_one+0x11a/0x1f0
   [27386.164995]        shrink_node+0xbfd/0x1320
   [27386.165002]        balance_pgdat+0x67f/0xc60
   [27386.165321]        kswapd+0x1dc/0x3e0
   [27386.165643]        kthread+0xff/0x240
   [27386.165965]        ret_from_fork+0x223/0x280
   [27386.166287]        ret_from_fork_asm+0x1a/0x30
   [27386.166616]
                  other info that might help us debug this:

   [27386.167561] Chain exists of:
                    &delayed_node->mutex --> btrfs-tree-00 --> fs_reclaim

   [27386.168503]  Possible unsafe locking scenario:

   [27386.169110]        CPU0                    CPU1
   [27386.169411]        ----                    ----
   [27386.169707]   lock(fs_reclaim);
   [27386.169998]                                lock(btrfs-tree-00);
   [27386.170291]                                lock(fs_reclaim);
   [27386.170581]   lock(&delayed_node->mutex);
   [27386.170874]
                   *** DEADLOCK ***

   [27386.171716] 2 locks held by kswapd0/117:
   [27386.171999]  #0: ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at:
   balance_pgdat+0x195/0xc60
   [27386.172294]  #1: ffff8d998344b0e0 (&type->s_umount_key#40){++++}-
   {3:3}, at: super_cache_scan+0x37/0x1d0
   [27386.172596]
                  stack backtrace:
   [27386.173183] CPU: 11 UID: 0 PID: 117 Comm: kswapd0 Tainted: G     U
   6.18.0+ #4 PREEMPT(lazy)
   [27386.173185] Tainted: [U]=USER
   [27386.173186] Hardware name: ASUS System Product Name/PRIME B560M-A
   AC, BIOS 2001 02/01/2023
   [27386.173187] Call Trace:
   [27386.173187]  <TASK>
   [27386.173189]  dump_stack_lvl+0x6e/0xa0
   [27386.173192]  print_circular_bug.cold+0x17a/0x1c0
   [27386.173194]  check_noncircular+0x175/0x190
   [27386.173197]  __lock_acquire+0x15e9/0x2820
   [27386.173200]  lock_acquire+0xc9/0x2d0
   [27386.173201]  ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
   [27386.173204]  __mutex_lock+0xcc/0x10a0
   [27386.173206]  ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
   [27386.173208]  ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
   [27386.173211]  ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
   [27386.173213]  __btrfs_release_delayed_node.part.0+0x39/0x2f0
   [27386.173215]  btrfs_evict_inode+0x20b/0x4b0
   [27386.173217]  ? lock_acquire+0xc9/0x2d0
   [27386.173220]  evict+0x15a/0x2f0
   [27386.173222]  prune_icache_sb+0x91/0xd0
   [27386.173224]  super_cache_scan+0x150/0x1d0
   [27386.173226]  do_shrink_slab+0x155/0x6f0
   [27386.173228]  shrink_slab+0x48e/0x890
   [27386.173229]  ? shrink_slab+0x2d2/0x890
   [27386.173231]  shrink_one+0x11a/0x1f0
   [27386.173234]  shrink_node+0xbfd/0x1320
   [27386.173236]  ? shrink_node+0xa2d/0x1320
   [27386.173236]  ? shrink_node+0xbd3/0x1320
   [27386.173239]  ? balance_pgdat+0x67f/0xc60
   [27386.173239]  balance_pgdat+0x67f/0xc60
   [27386.173241]  ? finish_task_switch.isra.0+0xc4/0x2a0
   [27386.173246]  kswapd+0x1dc/0x3e0
   [27386.173247]  ? __pfx_autoremove_wake_function+0x10/0x10
   [27386.173249]  ? __pfx_kswapd+0x10/0x10
   [27386.173250]  kthread+0xff/0x240
   [27386.173251]  ? __pfx_kthread+0x10/0x10
   [27386.173253]  ret_from_fork+0x223/0x280
   [27386.173255]  ? __pfx_kthread+0x10/0x10
   [27386.173257]  ret_from_fork_asm+0x1a/0x30
   [27386.173260]  </TASK>

This is because:

1) The fsync task is holding an inode's delayed node mutex (for a
   directory) while calling __btrfs_update_delayed_inode() and that needs
   to do a search on the subvolume's btree (therefore read lock some
   extent buffers);

2) The lookup task, at btrfs_lookup(), triggered reclaim with the
   GFP_KERNEL allocation done by btrfs_init_file_extent_tree() while
   holding a read lock on a subvolume leaf;

3) The reclaim triggered kswapd which is doing inode eviction for the
   directory inode the fsync task is using as an argument to
   btrfs_commit_inode_delayed_inode() - but in that call chain we are
   trying to read lock the same leaf that the lookup task is holding
   while calling btrfs_init_file_extent_tree() and doing the GFP_KERNEL
   allocation.

Fix this by calling btrfs_init_file_extent_tree() after we don't need the
path anymore and release it in btrfs_read_locked_inode().

Reported-by: Thomas Hellström <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Fixes: 8679d26 ("btrfs: initialize inode::file_extent_tree after i_mode has been set")
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants