-
Notifications
You must be signed in to change notification settings - Fork 10
Test March 26 #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Now that scrub can fix data errors with the help of parity for raid56 profile, repair during read is able to as well. Although the mirror num in raid56 senario has different meanings, i.e. 0 or 1: read data directly > 1: do recover with parity, it could be fit into how we repair bad block during read. The trick is to use BTRFS_MAP_READ instead of BTRFS_MAP_WRITE to get the device and position on it. Cc: David Sterba <[email protected]> Signed-off-by: Liu Bo <[email protected]> Signed-off-by: Chris Mason <[email protected]>
In raid56 senario, after trying parity recovery, we didn't set mirror_num for btrfs_bio with failed mirror_num, hence end_bio_extent_readpage() will report a random mirror_num in dmesg log. Cc: David Sterba <[email protected]> Signed-off-by: Liu Bo <[email protected]> Signed-off-by: Chris Mason <[email protected]>
Commit 20a7db8 ("btrfs: add dummy callback for readpage_io_failed and drop checks") made a cleanup around readpage_io_failed_hook, and it was supposed to keep the original sematics, but it also unexpectedly disabled repair during read for dup, raid1 and raid10. This fixes the problem by letting data's inode call the generic readpage_io_failed callback by returning -EAGAIN from its readpage_io_failed_hook in order to notify end_bio_extent_readpage to do the rest. We don't call it directly because the generic one takes an offset from end_bio_extent_readpage() to calculate the index in the checksum array and inode's readpage_io_failed_hook doesn't offer that offset. Cc: David Sterba <[email protected]> Signed-off-by: Liu Bo <[email protected]> Signed-off-by: Chris Mason <[email protected]>
Otherwise, we may later skip this page when repairing bad copy from good copy. Cc: David Sterba <[email protected]> Signed-off-by: Liu Bo <[email protected]> Signed-off-by: Chris Mason <[email protected]>
masoncl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed-by: Chris Mason [email protected]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Individual commit comment
masoncl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed-by: Chris Mason [email protected]
masoncl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on single commit
Commit 4f350c6 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly) can result in L1(run kvm-unit-tests/run_tests.sh vmx_controls in L1) null pointer deference and also L0 calltrace when EPT=0 on both L0 and L1. In L1: BUG: unable to handle kernel paging request at ffffffffc015bf8f IP: vmx_vcpu_run+0x202/0x510 [kvm_intel] PGD 146e13067 P4D 146e13067 PUD 146e15067 PMD 3d2686067 PTE 3d4af9161 Oops: 0003 [#1] PREEMPT SMP CPU: 2 PID: 1798 Comm: qemu-system-x86 Not tainted 4.14.0-rc4+ #6 RIP: 0010:vmx_vcpu_run+0x202/0x510 [kvm_intel] Call Trace: WARNING: kernel stack frame pointer at ffffb86f4988bc18 in qemu-system-x86:1798 has bad value 0000000000000002 In L0: -----------[ cut here ]------------ WARNING: CPU: 6 PID: 4460 at /home/kernel/linux/arch/x86/kvm//vmx.c:9845 vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel] CPU: 6 PID: 4460 Comm: qemu-system-x86 Tainted: G OE 4.14.0-rc7+ #25 RIP: 0010:vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel] Call Trace: paging64_page_fault+0x500/0xde0 [kvm] ? paging32_gva_to_gpa_nested+0x120/0x120 [kvm] ? nonpaging_page_fault+0x3b0/0x3b0 [kvm] ? __asan_storeN+0x12/0x20 ? paging64_gva_to_gpa+0xb0/0x120 [kvm] ? paging64_walk_addr_generic+0x11a0/0x11a0 [kvm] ? lock_acquire+0x2c0/0x2c0 ? vmx_read_guest_seg_ar+0x97/0x100 [kvm_intel] ? vmx_get_segment+0x2a6/0x310 [kvm_intel] ? sched_clock+0x1f/0x30 ? check_chain_key+0x137/0x1e0 ? __lock_acquire+0x83c/0x2420 ? kvm_multiple_exception+0xf2/0x220 [kvm] ? debug_check_no_locks_freed+0x240/0x240 ? debug_smp_processor_id+0x17/0x20 ? __lock_is_held+0x9e/0x100 kvm_mmu_page_fault+0x90/0x180 [kvm] kvm_handle_page_fault+0x15c/0x310 [kvm] ? __lock_is_held+0x9e/0x100 handle_exception+0x3c7/0x4d0 [kvm_intel] vmx_handle_exit+0x103/0x1010 [kvm_intel] ? kvm_arch_vcpu_ioctl_run+0x1628/0x2e20 [kvm] The commit avoids to load host state of vmcs12 as vmcs01's guest state since vmcs12 is not modified (except for the VM-instruction error field) if the checking of vmcs control area fails. However, the mmu context is switched to nested mmu in prepare_vmcs02() and it will not be reloaded since load_vmcs12_host_state() is skipped when nested VMLAUNCH/VMRESUME fails. This patch fixes it by reloading mmu context when nested VMLAUNCH/VMRESUME fails. Reviewed-by: Jim Mattson <[email protected]> Reviewed-by: Krish Sadhukhan <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Jim Mattson <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
Commit 9d643f6 ("KVM: x86: avoid large stack allocations in em_fxrstor") optimize the stack size, but introduced a guest memory access which might sleep while in atomic. Fix it by introducing, again, a second fxregs_state. Try to avoid large stacks by using noinline. Add some helpful comments. Reported by syzbot: in_atomic(): 1, irqs_disabled(): 0, pid: 2909, name: syzkaller879109 2 locks held by syzkaller879109/2909: #0: (&vcpu->mutex){+.+.}, at: [<ffffffff8106222c>] vcpu_load+0x1c/0x70 arch/x86/kvm/../../../virt/kvm/kvm_main.c:154 #1: (&kvm->srcu){....}, at: [<ffffffff810dd162>] vcpu_enter_guest arch/x86/kvm/x86.c:6983 [inline] #1: (&kvm->srcu){....}, at: [<ffffffff810dd162>] vcpu_run arch/x86/kvm/x86.c:7061 [inline] #1: (&kvm->srcu){....}, at: [<ffffffff810dd162>] kvm_arch_vcpu_ioctl_run+0x1bc2/0x58b0 arch/x86/kvm/x86.c:7222 CPU: 1 PID: 2909 Comm: syzkaller879109 Not tainted 4.13.0-rc4-next-20170811 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 ___might_sleep+0x2b2/0x470 kernel/sched/core.c:6014 __might_sleep+0x95/0x190 kernel/sched/core.c:5967 __might_fault+0xab/0x1d0 mm/memory.c:4383 __copy_from_user include/linux/uaccess.h:71 [inline] __kvm_read_guest_page+0x58/0xa0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1771 kvm_vcpu_read_guest_page+0x44/0x60 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1791 kvm_read_guest_virt_helper+0x76/0x140 arch/x86/kvm/x86.c:4407 kvm_read_guest_virt_system+0x3c/0x50 arch/x86/kvm/x86.c:4466 segmented_read_std+0x10c/0x180 arch/x86/kvm/emulate.c:819 em_fxrstor+0x27b/0x410 arch/x86/kvm/emulate.c:4022 x86_emulate_insn+0x55d/0x3c50 arch/x86/kvm/emulate.c:5471 x86_emulate_instruction+0x411/0x1ca0 arch/x86/kvm/x86.c:5698 kvm_mmu_page_fault+0x18b/0x2c0 arch/x86/kvm/mmu.c:4854 handle_ept_violation+0x1fc/0x5e0 arch/x86/kvm/vmx.c:6400 vmx_handle_exit+0x281/0x1ab0 arch/x86/kvm/vmx.c:8718 vcpu_enter_guest arch/x86/kvm/x86.c:6999 [inline] vcpu_run arch/x86/kvm/x86.c:7061 [inline] kvm_arch_vcpu_ioctl_run+0x1cee/0x58b0 arch/x86/kvm/x86.c:7222 kvm_vcpu_ioctl+0x64c/0x1010 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2591 vfs_ioctl fs/ioctl.c:45 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685 SYSC_ioctl fs/ioctl.c:700 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x437fc9 RSP: 002b:00007ffc7b4d5ab8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00000000004002b0 RCX: 0000000000437fc9 RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000005 RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000020ae8000 R10: 0000000000009120 R11: 0000000000000206 R12: 0000000000000000 R13: 0000000000000004 R14: 0000000000000004 R15: 0000000020077000 Fixes: 9d643f6 ("KVM: x86: avoid large stack allocations in em_fxrstor") Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
dma-debug reports the following warning: WARNING: CPU: 3 PID: 298 at kernel-4.4/lib/dma-debug.c:604 debug _dma_assert_idle+0x1a8/0x230() DMA-API: cpu touching an active dma mapped cacheline [cln=0x00000882300] CPU: 3 PID: 298 Comm: vold Tainted: G W O 4.4.22+ #1 Hardware name: MT6739 (DT) Call trace: debug_dma_assert_idle+0x1a8/0x230 wp_page_copy.isra.96+0x118/0x520 do_wp_page+0x4fc/0x534 handle_mm_fault+0xd4c/0x1310 do_page_fault+0x1c8/0x394 do_mem_abort+0x50/0xec I found that debug_dma_alloc_coherent() and debug_dma_free_coherent() assume that dma_alloc_coherent() always returns a linear address. However it's possible that dma_alloc_coherent() returns a non-linear address. In this case, page_to_pfn(virt_to_page(virt)) will return an incorrect pfn. If the pfn is valid and mapped as a COW page, we will hit the warning when doing wp_page_copy(). Fix this by calculating pfn for linear and non-linear addresses. [[email protected]: v4] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Miles Chen <[email protected]> Reviewed-by: Robin Murphy <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Marek Szyprowski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Commit 21cc643 ("drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM") tried to fixup the check_flush_dependency warning for hitting i915_gem_userptr_mn_invalidate_range_start from within the shrinker, but I failed to notice userptr has 2 similarly named workqueues. I marked up i915-userptr-acquire as WQ_MEM_RECLAIM whereas we only wait upon i915-userptr-release from inside the reclaim paths. [62530.869510] workqueue: PF_MEMALLOC task 7983(gem_shrink) is flushing !WQ_MEM_RECLAIM i915-userptr-release: (null) [62530.869515] ------------[ cut here ]------------ [62530.869519] WARNING: CPU: 1 PID: 7983 at kernel/workqueue.c:2434 check_flush_dependency+0x7f/0x110 [62530.869519] Modules linked in: pegasus mii ip6table_filter ip6_tables bnep iptable_filter snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic binfmt_misc nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel snd_hda_codec kvm_intel snd_hda_core snd_hwdep kvm snd_pcm irqbypass snd_seq_midi snd_seq_midi_event snd_rawmidi crct10dif_pclmul crc32_pclmul 8250_dw ghash_clmulni_intel snd_seq pcbc snd_seq_device snd_timer btusb aesni_intel btrtl btbcm aes_x86_64 iwlwifi btintel crypto_simd glue_helper cryptd bluetooth snd intel_cstate input_leds idma64 intel_rapl_perf ecdh_generic serio_raw soundcore cfg80211 wmi_bmof virt_dma intel_lpss_pci intel_lpss acpi_als kfifo_buf industrialio winbond_cir soc_button_array rc_core spidev tpm_crb intel_hid acpi_pad mac_hid sparse_keymap [62530.869546] parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid i915 i2c_algo_bit prime_numbers drm_kms_helper syscopyarea e1000e sysfillrect sysimgblt fb_sys_fops ahci ptp pps_core libahci drm wmi video i2c_hid hid [62530.869557] CPU: 1 PID: 7983 Comm: gem_shrink Tainted: G U W L 4.14.0-rc8-drm-tip-ww45-commit-1342299+ #1 [62530.869558] Hardware name: Intel Corporation CoffeeLake Client Platform/CoffeeLake H DDR4 RVP, BIOS CNLSFWR1.R00.X098.A00.1707301945 07/30/2017 [62530.869559] task: ffffa1049dbeec80 task.stack: ffffae7d05c44000 [62530.869560] RIP: 0010:check_flush_dependency+0x7f/0x110 [62530.869561] RSP: 0018:ffffae7d05c473a0 EFLAGS: 00010286 [62530.869562] RAX: 000000000000006e RBX: ffffa1049540f400 RCX: ffffffffa3e55788 [62530.869562] RDX: 0000000000000000 RSI: 0000000000000092 RDI: 0000000000000202 [62530.869563] RBP: ffffae7d05c473c0 R08: 000000000000006e R09: 000000000038bb0e [62530.869563] R10: 0000000000000000 R11: 000000000000006e R12: ffffa1049dbeec80 [62530.869564] R13: 0000000000000000 R14: 0000000000000000 R15: ffffae7d05c473e0 [62530.869565] FS: 00007f621b129880(0000) GS:ffffa1050b240000(0000) knlGS:0000000000000000 [62530.869566] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [62530.869566] CR2: 00007f6214400000 CR3: 0000000353a17003 CR4: 00000000003606e0 [62530.869567] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [62530.869567] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [62530.869568] Call Trace: [62530.869570] flush_workqueue+0x115/0x3d0 [62530.869573] ? wake_up_process+0x15/0x20 [62530.869596] i915_gem_userptr_mn_invalidate_range_start+0x12f/0x160 [i915] [62530.869614] ? i915_gem_userptr_mn_invalidate_range_start+0x12f/0x160 [i915] [62530.869616] __mmu_notifier_invalidate_range_start+0x55/0x80 [62530.869618] try_to_unmap_one+0x791/0x8b0 [62530.869620] ? call_rwsem_down_read_failed+0x18/0x30 [62530.869622] rmap_walk_anon+0x10b/0x260 [62530.869624] rmap_walk+0x48/0x60 [62530.869625] try_to_unmap+0x93/0xf0 [62530.869626] ? page_remove_rmap+0x2a0/0x2a0 [62530.869627] ? page_not_mapped+0x20/0x20 [62530.869629] ? page_get_anon_vma+0x90/0x90 [62530.869630] ? invalid_mkclean_vma+0x20/0x20 [62530.869631] migrate_pages+0x946/0xaa0 [62530.869633] ? __ClearPageMovable+0x10/0x10 [62530.869635] ? isolate_freepages_block+0x3c0/0x3c0 [62530.869636] compact_zone+0x22f/0x970 [62530.869638] compact_zone_order+0xa3/0xd0 [62530.869640] try_to_compact_pages+0x1a5/0x2a0 [62530.869641] ? try_to_compact_pages+0x1a5/0x2a0 [62530.869643] __alloc_pages_direct_compact+0x50/0x110 [62530.869644] __alloc_pages_slowpath+0x4da/0xf30 [62530.869646] __alloc_pages_nodemask+0x262/0x280 [62530.869648] alloc_pages_vma+0x165/0x1e0 [62530.869649] shmem_alloc_hugepage+0xd0/0x130 [62530.869651] ? __radix_tree_insert+0x45/0x230 [62530.869652] ? __vm_enough_memory+0x29/0x130 [62530.869654] shmem_alloc_and_acct_page+0x10d/0x1e0 [62530.869655] shmem_getpage_gfp+0x426/0xc00 [62530.869657] shmem_fault+0xa0/0x1e0 [62530.869659] ? file_update_time+0x60/0x110 [62530.869660] __do_fault+0x1e/0xc0 [62530.869661] __handle_mm_fault+0xa35/0x1170 [62530.869662] handle_mm_fault+0xcc/0x1c0 [62530.869664] __do_page_fault+0x262/0x4f0 [62530.869666] do_page_fault+0x2e/0xe0 [62530.869667] page_fault+0x22/0x30 [62530.869668] RIP: 0033:0x404335 [62530.869669] RSP: 002b:00007fff7829e420 EFLAGS: 00010216 [62530.869670] RAX: 00007f6210400000 RBX: 0000000000000004 RCX: 0000000000b80000 [62530.869670] RDX: 0000000000002e01 RSI: 0000000000008000 RDI: 0000000000000004 [62530.869671] RBP: 0000000000000019 R08: 0000000000000002 R09: 0000000000000000 [62530.869671] R10: 0000000000000559 R11: 0000000000000246 R12: 0000000008000000 [62530.869672] R13: 00000000004042f0 R14: 0000000000000004 R15: 000000000000007e [62530.869673] Code: 00 8b b0 18 05 00 00 48 8d 8b b0 00 00 00 48 8d 90 c0 06 00 00 4d 89 f0 48 c7 c7 40 c0 c8 a3 c6 05 68 c5 e8 00 01 e8 c2 68 04 00 <0f> ff 4d 85 ed 74 18 49 8b 45 20 48 8b 70 08 8b 86 00 01 00 00 [62530.869691] ---[ end trace 01e01ad0ff5781f8 ]--- Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103739 Fixes: 21cc643 ("drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM") Signed-off-by: Chris Wilson <[email protected]> Cc: Michał Winiarski <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Reviewed-by: Matthew Auld <[email protected]> (cherry picked from commit 41729bf) Signed-off-by: Jani Nikula <[email protected]>
Use mutex_lock_nested to provide lockdep the parent child lock ordering of the tree. This fixes the lockdep Warning [ 305.275177] ============================================ [ 305.275178] WARNING: possible recursive locking detected [ 305.275179] 4.14.0-rc7+ #320 Not tainted [ 305.275180] -------------------------------------------- [ 305.275181] apparmor_parser/1339 is trying to acquire lock: [ 305.275182] (&ns->lock){+.+.}, at: [<ffffffff970544dd>] __aa_create_ns+0x6d/0x1e0 [ 305.275187] but task is already holding lock: [ 305.275187] (&ns->lock){+.+.}, at: [<ffffffff97054b5d>] aa_prepare_ns+0x3d/0xd0 [ 305.275190] other info that might help us debug this: [ 305.275191] Possible unsafe locking scenario: [ 305.275192] CPU0 [ 305.275193] ---- [ 305.275193] lock(&ns->lock); [ 305.275194] lock(&ns->lock); [ 305.275195] *** DEADLOCK *** [ 305.275196] May be due to missing lock nesting notation [ 305.275198] 2 locks held by apparmor_parser/1339: [ 305.275198] #0: (sb_writers#10){.+.+}, at: [<ffffffff96e9c6b7>] vfs_write+0x1a7/0x1d0 [ 305.275202] #1: (&ns->lock){+.+.}, at: [<ffffffff97054b5d>] aa_prepare_ns+0x3d/0xd0 [ 305.275205] stack backtrace: [ 305.275207] CPU: 1 PID: 1339 Comm: apparmor_parser Not tainted 4.14.0-rc7+ #320 [ 305.275208] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014 [ 305.275209] Call Trace: [ 305.275212] dump_stack+0x85/0xcb [ 305.275214] __lock_acquire+0x141c/0x1460 [ 305.275216] ? __aa_create_ns+0x6d/0x1e0 [ 305.275218] ? ___slab_alloc+0x183/0x540 [ 305.275219] ? ___slab_alloc+0x183/0x540 [ 305.275221] lock_acquire+0xed/0x1e0 [ 305.275223] ? lock_acquire+0xed/0x1e0 [ 305.275224] ? __aa_create_ns+0x6d/0x1e0 [ 305.275227] __mutex_lock+0x89/0x920 [ 305.275228] ? __aa_create_ns+0x6d/0x1e0 [ 305.275230] ? trace_hardirqs_on_caller+0x11f/0x190 [ 305.275231] ? __aa_create_ns+0x6d/0x1e0 [ 305.275233] ? __lockdep_init_map+0x57/0x1d0 [ 305.275234] ? lockdep_init_map+0x9/0x10 [ 305.275236] ? __rwlock_init+0x32/0x60 [ 305.275238] mutex_lock_nested+0x1b/0x20 [ 305.275240] ? mutex_lock_nested+0x1b/0x20 [ 305.275241] __aa_create_ns+0x6d/0x1e0 [ 305.275243] aa_prepare_ns+0xc2/0xd0 [ 305.275245] aa_replace_profiles+0x168/0xf30 [ 305.275247] ? __might_fault+0x85/0x90 [ 305.275250] policy_update+0xb9/0x380 [ 305.275252] profile_load+0x7e/0x90 [ 305.275254] __vfs_write+0x28/0x150 [ 305.275256] ? rcu_read_lock_sched_held+0x72/0x80 [ 305.275257] ? rcu_sync_lockdep_assert+0x2f/0x60 [ 305.275259] ? __sb_start_write+0xdc/0x1c0 [ 305.275261] ? vfs_write+0x1a7/0x1d0 [ 305.275262] vfs_write+0xca/0x1d0 [ 305.275264] ? trace_hardirqs_on_caller+0x11f/0x190 [ 305.275266] SyS_write+0x49/0xa0 [ 305.275268] entry_SYSCALL_64_fastpath+0x23/0xc2 [ 305.275271] RIP: 0033:0x7fa6b22e8c74 [ 305.275272] RSP: 002b:00007ffeaaee6288 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 305.275273] RAX: ffffffffffffffda RBX: 00007ffeaaee62a4 RCX: 00007fa6b22e8c74 [ 305.275274] RDX: 0000000000000a51 RSI: 00005566a8198c10 RDI: 0000000000000004 [ 305.275275] RBP: 0000000000000a39 R08: 0000000000000a51 R09: 0000000000000000 [ 305.275276] R10: 0000000000000000 R11: 0000000000000246 R12: 00005566a8198c10 [ 305.275277] R13: 0000000000000004 R14: 00005566a72ecb88 R15: 00005566a72ec3a8 Fixes: 73688d1 ("apparmor: refactor prepare_ns() and make usable from different views") Signed-off-by: John Johansen <[email protected]>
After starting VNC server or running CTS test, kernel will hang and can see below call trace: [961816] INFO: task khugepaged:42 blocked for more than 120 seconds. [968581] Tainted: G OE 4.13.0 #1 [973495] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [980962] khugepaged D 0 42 2 0x00000000 [980967] Call Trace: [980977] __schedule+0x28d/0x890 [980982] schedule+0x36/0x80 [980986] rwsem_down_read_failed+0x139/0x1c0 [980991] ? update_curr+0x100/0x1c0 [981004] call_rwsem_down_read_failed+0x18/0x30 [981007] down_read+0x20/0x40 [981012] khugepaged_scan_mm_slot+0x78/0x1ac0 [981018] ? __switch_to+0x23e/0x4a0 [981022] ? finish_task_switch+0x79/0x240 [981026] khugepaged+0x146/0x480 [981031] ? remove_wait_queue+0x60/0x60 [981035] kthread+0x109/0x140 [981037] ? khugepaged_scan_mm_slot+0x1ac0/0x1ac0 [981039] ? kthread_park+0x60/0x60 [981044] ret_from_fork+0x25/0x30 After checking code and found 'commit b72cf4f ("drm/amdgpu: move taking mmap_sem into get_user_pages v2")' forget to drop one case of up_read. Signed-off-by: Xiangliang.Yu <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
The commit below introduced thp support for ttm allocations, however it didn't take into account the case where dma32 was requested. Some drivers always request dma32, and the bochs driver is one of those. This fixes an oops: [ 30.108507] ------------[ cut here ]------------ [ 30.108920] kernel BUG at ./include/linux/gfp.h:408! [ 30.109356] invalid opcode: 0000 [#1] SMP [ 30.109700] Modules linked in: fuse nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack devlink ip_set nfnetlink ebtable_nat ebtable_broute bridge ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_generic kvm_intel kvm snd_hda_intel snd_hda_codec irqbypass ppdev snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm bochs_drm ttm joydev drm_kms_helper virtio_balloon snd_timer snd parport_pc drm soundcore parport i2c_piix4 nls_utf8 isofs squashfs zstd_decompress xxhash 8021q garp mrp stp llc virtio_net [ 30.115605] virtio_console virtio_scsi crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio_ring virtio ata_generic pata_acpi qemu_fw_cfg sunrpc scsi_transport_iscsi loop [ 30.117425] CPU: 0 PID: 1347 Comm: gnome-shell Not tainted 4.15.0-0.rc0.git6.1.fc28.x86_64 #1 [ 30.118141] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 [ 30.118866] task: ffff923a77e03380 task.stack: ffffa78182228000 [ 30.119366] RIP: 0010:__alloc_pages_nodemask+0x35e/0x430 [ 30.119810] RSP: 0000:ffffa7818222bba8 EFLAGS: 00010202 [ 30.120250] RAX: 0000000000000001 RBX: 00000000014382c6 RCX: 0000000000000006 [ 30.120840] RDX: 0000000000000000 RSI: 0000000000000009 RDI: 0000000000000000 [ 30.121443] RBP: ffff923a760d6000 R08: 0000000000000000 R09: 0000000000000006 [ 30.122039] R10: 0000000000000040 R11: 0000000000000300 R12: ffff923a729273c0 [ 30.122629] R13: 0000000000000000 R14: 0000000000000000 R15: ffff923a7483d400 [ 30.123223] FS: 00007fe48da7dac0(0000) GS:ffff923a7cc00000(0000) knlGS:0000000000000000 [ 30.123896] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 30.124373] CR2: 00007fe457b73000 CR3: 0000000078313000 CR4: 00000000000006f0 [ 30.124968] Call Trace: [ 30.125186] ttm_pool_populate+0x19b/0x400 [ttm] [ 30.125578] ttm_bo_vm_fault+0x325/0x570 [ttm] [ 30.125964] __do_fault+0x19/0x11e [ 30.126255] __handle_mm_fault+0xcd3/0x1260 [ 30.126609] handle_mm_fault+0x14c/0x310 [ 30.126947] __do_page_fault+0x28c/0x530 [ 30.127282] do_page_fault+0x32/0x270 [ 30.127593] async_page_fault+0x22/0x30 [ 30.127922] RIP: 0033:0x7fe48aae39a8 [ 30.128225] RSP: 002b:00007ffc21c4d928 EFLAGS: 00010206 [ 30.128664] RAX: 00007fe457b73000 RBX: 000055cd4c1041a0 RCX: 00007fe457b73040 [ 30.129259] RDX: 0000000000300000 RSI: 0000000000000000 RDI: 00007fe457b73000 [ 30.129855] RBP: 0000000000000300 R08: 000000000000000c R09: 0000000100000000 [ 30.130457] R10: 0000000000000001 R11: 0000000000000246 R12: 000055cd4c1041a0 [ 30.131054] R13: 000055cd4bdfe990 R14: 000055cd4c104110 R15: 0000000000000400 [ 30.131648] Code: 11 01 00 0f 84 a9 00 00 00 65 ff 0d 6d cc dd 44 e9 0f ff ff ff 40 80 cd 80 e9 99 fe ff ff 48 89 c7 e8 e7 f6 01 00 e9 b7 fe ff ff <0f> 0b 0f ff e9 40 fd ff ff 65 48 8b 04 25 80 d5 00 00 8b 40 4c [ 30.133245] RIP: __alloc_pages_nodemask+0x35e/0x430 RSP: ffffa7818222bba8 [ 30.133836] ---[ end trace d4f1deb60784f40a ]--- v2: handle free path as well. Reported-by: Laura Abbott <[email protected]> Reported-by: Adam Williamson <[email protected]> Fixes: 0284f1e (drm/ttm: add transparent huge page support for cached allocations v2) Reviewed-by: Christian König <[email protected]> Signed-off-by: Dave Airlie <[email protected]>
…l calls Provide a different lockdep key for rxrpc_call::user_mutex when the call is made on a kernel socket, such as by the AFS filesystem. The problem is that lockdep registers a false positive between userspace calling the sendmsg syscall on a user socket where call->user_mutex is held whilst userspace memory is accessed whereas the AFS filesystem may perform operations with mmap_sem held by the caller. In such a case, the following warning is produced. ====================================================== WARNING: possible circular locking dependency detected 4.14.0-fscache+ #243 Tainted: G E ------------------------------------------------------ modpost/16701 is trying to acquire lock: (&vnode->io_lock){+.+.}, at: [<ffffffffa000fc40>] afs_begin_vnode_operation+0x33/0x77 [kafs] but task is already holding lock: (&mm->mmap_sem){++++}, at: [<ffffffff8104376a>] __do_page_fault+0x1ef/0x486 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&mm->mmap_sem){++++}: __might_fault+0x61/0x89 _copy_from_iter_full+0x40/0x1fa rxrpc_send_data+0x8dc/0xff3 rxrpc_do_sendmsg+0x62f/0x6a1 rxrpc_sendmsg+0x166/0x1b7 sock_sendmsg+0x2d/0x39 ___sys_sendmsg+0x1ad/0x22b __sys_sendmsg+0x41/0x62 do_syscall_64+0x89/0x1be return_from_SYSCALL_64+0x0/0x75 -> #2 (&call->user_mutex){+.+.}: __mutex_lock+0x86/0x7d2 rxrpc_new_client_call+0x378/0x80e rxrpc_kernel_begin_call+0xf3/0x154 afs_make_call+0x195/0x454 [kafs] afs_vl_get_capabilities+0x193/0x198 [kafs] afs_vl_lookup_vldb+0x5f/0x151 [kafs] afs_create_volume+0x2e/0x2f4 [kafs] afs_mount+0x56a/0x8d7 [kafs] mount_fs+0x6a/0x109 vfs_kern_mount+0x67/0x135 do_mount+0x90b/0xb57 SyS_mount+0x72/0x98 do_syscall_64+0x89/0x1be return_from_SYSCALL_64+0x0/0x75 -> #1 (k-sk_lock-AF_RXRPC){+.+.}: lock_sock_nested+0x74/0x8a rxrpc_kernel_begin_call+0x8a/0x154 afs_make_call+0x195/0x454 [kafs] afs_fs_get_capabilities+0x17a/0x17f [kafs] afs_probe_fileserver+0xf7/0x2f0 [kafs] afs_select_fileserver+0x83f/0x903 [kafs] afs_fetch_status+0x89/0x11d [kafs] afs_iget+0x16f/0x4f8 [kafs] afs_mount+0x6c6/0x8d7 [kafs] mount_fs+0x6a/0x109 vfs_kern_mount+0x67/0x135 do_mount+0x90b/0xb57 SyS_mount+0x72/0x98 do_syscall_64+0x89/0x1be return_from_SYSCALL_64+0x0/0x75 -> #0 (&vnode->io_lock){+.+.}: lock_acquire+0x174/0x19f __mutex_lock+0x86/0x7d2 afs_begin_vnode_operation+0x33/0x77 [kafs] afs_fetch_data+0x80/0x12a [kafs] afs_readpages+0x314/0x405 [kafs] __do_page_cache_readahead+0x203/0x2ba filemap_fault+0x179/0x54d __do_fault+0x17/0x60 __handle_mm_fault+0x6d7/0x95c handle_mm_fault+0x24e/0x2a3 __do_page_fault+0x301/0x486 do_page_fault+0x236/0x259 page_fault+0x22/0x30 __clear_user+0x3d/0x60 padzero+0x1c/0x2b load_elf_binary+0x785/0xdc7 search_binary_handler+0x81/0x1ff do_execveat_common.isra.14+0x600/0x888 do_execve+0x1f/0x21 SyS_execve+0x28/0x2f do_syscall_64+0x89/0x1be return_from_SYSCALL_64+0x0/0x75 other info that might help us debug this: Chain exists of: &vnode->io_lock --> &call->user_mutex --> &mm->mmap_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&mm->mmap_sem); lock(&call->user_mutex); lock(&mm->mmap_sem); lock(&vnode->io_lock); *** DEADLOCK *** 1 lock held by modpost/16701: #0: (&mm->mmap_sem){++++}, at: [<ffffffff8104376a>] __do_page_fault+0x1ef/0x486 stack backtrace: CPU: 0 PID: 16701 Comm: modpost Tainted: G E 4.14.0-fscache+ #243 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 Call Trace: dump_stack+0x67/0x8e print_circular_bug+0x341/0x34f check_prev_add+0x11f/0x5d4 ? add_lock_to_list.isra.12+0x8b/0x8b ? add_lock_to_list.isra.12+0x8b/0x8b ? __lock_acquire+0xf77/0x10b4 __lock_acquire+0xf77/0x10b4 lock_acquire+0x174/0x19f ? afs_begin_vnode_operation+0x33/0x77 [kafs] __mutex_lock+0x86/0x7d2 ? afs_begin_vnode_operation+0x33/0x77 [kafs] ? afs_begin_vnode_operation+0x33/0x77 [kafs] ? afs_begin_vnode_operation+0x33/0x77 [kafs] afs_begin_vnode_operation+0x33/0x77 [kafs] afs_fetch_data+0x80/0x12a [kafs] afs_readpages+0x314/0x405 [kafs] __do_page_cache_readahead+0x203/0x2ba ? filemap_fault+0x179/0x54d filemap_fault+0x179/0x54d __do_fault+0x17/0x60 __handle_mm_fault+0x6d7/0x95c handle_mm_fault+0x24e/0x2a3 __do_page_fault+0x301/0x486 do_page_fault+0x236/0x259 page_fault+0x22/0x30 RIP: 0010:__clear_user+0x3d/0x60 RSP: 0018:ffff880071e93da0 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 000000000000011c RCX: 000000000000011c RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000060f720 RBP: 000000000060f720 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: ffff8800b5459b68 R12: ffff8800ce150e00 R13: 000000000060f720 R14: 00000000006127a8 R15: 0000000000000000 padzero+0x1c/0x2b load_elf_binary+0x785/0xdc7 search_binary_handler+0x81/0x1ff do_execveat_common.isra.14+0x600/0x888 do_execve+0x1f/0x21 SyS_execve+0x28/0x2f do_syscall_64+0x89/0x1be entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7fdb6009ee07 RSP: 002b:00007fff566d9728 EFLAGS: 00000246 ORIG_RAX: 000000000000003b RAX: ffffffffffffffda RBX: 000055ba57280900 RCX: 00007fdb6009ee07 RDX: 000055ba5727f270 RSI: 000055ba5727cac0 RDI: 000055ba57280900 RBP: 000055ba57280900 R08: 00007fff566d9700 R09: 0000000000000000 R10: 000055ba5727cac0 R11: 0000000000000246 R12: 0000000000000000 R13: 000055ba5727cac0 R14: 000055ba5727f270 R15: 0000000000000000 Signed-off-by: David Howells <[email protected]>
The apparmor_audit_data struct ordering got messed up during a merge conflict, resulting in the signal integer and peer pointer being in a union instead of a struct. For most of the 4.13 and 4.14 life cycle, this was hidden by commit 651e28c ("apparmor: add base infastructure for socket mediation") which fixed the apparmor_audit_data struct when its data was added. When that commit was reverted in -rc7 the signal audit bug was exposed, and unfortunately it never showed up in any of the testing until after 4.14 was released. Shaun Khan, Zephaniah E. Loss-Cutler-Hull filed nearly simultaneous bug reports (with different oopes, the smaller of which is included below). Full credit goes to Tetsuo Handa for jumping on this as well and noticing the audit data struct problem and reporting it. [ 76.178568] BUG: unable to handle kernel paging request at ffffffff0eee3bc0 [ 76.178579] IP: audit_signal_cb+0x6c/0xe0 [ 76.178581] PGD 1a640a067 P4D 1a640a067 PUD 0 [ 76.178586] Oops: 0000 [#1] PREEMPT SMP [ 76.178589] Modules linked in: fuse rfcomm bnep usblp uvcvideo btusb btrtl btbcm btintel bluetooth ecdh_generic ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables intel_rapl joydev wmi_bmof serio_raw iwldvm iwlwifi shpchp kvm_intel kvm irqbypass autofs4 algif_skcipher nls_iso8859_1 nls_cp437 crc32_pclmul ghash_clmulni_intel [ 76.178620] CPU: 0 PID: 10675 Comm: pidgin Not tainted 4.14.0-f1-dirty #135 [ 76.178623] Hardware name: Hewlett-Packard HP EliteBook Folio 9470m/18DF, BIOS 68IBD Ver. F.62 10/22/2015 [ 76.178625] task: ffff9c7a94c31dc0 task.stack: ffffa09b02a4c000 [ 76.178628] RIP: 0010:audit_signal_cb+0x6c/0xe0 [ 76.178631] RSP: 0018:ffffa09b02a4fc08 EFLAGS: 00010292 [ 76.178634] RAX: ffffa09b02a4fd60 RBX: ffff9c7aee0741f8 RCX: 0000000000000000 [ 76.178636] RDX: ffffffffee012290 RSI: 0000000000000006 RDI: ffff9c7a9493d800 [ 76.178638] RBP: ffffa09b02a4fd40 R08: 000000000000004d R09: ffffa09b02a4fc46 [ 76.178641] R10: ffffa09b02a4fcb8 R11: ffff9c7ab44f5072 R12: ffffa09b02a4fd40 [ 76.178643] R13: ffffffff9e447be0 R14: ffff9c7a94c31dc0 R15: 0000000000000001 [ 76.178646] FS: 00007f8b11ba2a80(0000) GS:ffff9c7afea00000(0000) knlGS:0000000000000000 [ 76.178648] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 76.178650] CR2: ffffffff0eee3bc0 CR3: 00000003d5209002 CR4: 00000000001606f0 [ 76.178652] Call Trace: [ 76.178660] common_lsm_audit+0x1da/0x780 [ 76.178665] ? d_absolute_path+0x60/0x90 [ 76.178669] ? aa_check_perms+0xcd/0xe0 [ 76.178672] aa_check_perms+0xcd/0xe0 [ 76.178675] profile_signal_perm.part.0+0x90/0xa0 [ 76.178679] aa_may_signal+0x16e/0x1b0 [ 76.178686] apparmor_task_kill+0x51/0x120 [ 76.178690] security_task_kill+0x44/0x60 [ 76.178695] group_send_sig_info+0x25/0x60 [ 76.178699] kill_pid_info+0x36/0x60 [ 76.178703] SYSC_kill+0xdb/0x180 [ 76.178707] ? preempt_count_sub+0x92/0xd0 [ 76.178712] ? _raw_write_unlock_irq+0x13/0x30 [ 76.178716] ? task_work_run+0x6a/0x90 [ 76.178720] ? exit_to_usermode_loop+0x80/0xa0 [ 76.178723] entry_SYSCALL_64_fastpath+0x13/0x94 [ 76.178727] RIP: 0033:0x7f8b0e58b767 [ 76.178729] RSP: 002b:00007fff19efd4d8 EFLAGS: 00000206 ORIG_RAX: 000000000000003e [ 76.178732] RAX: ffffffffffffffda RBX: 0000557f3e3c2050 RCX: 00007f8b0e58b767 [ 76.178735] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000263b [ 76.178737] RBP: 0000000000000000 R08: 0000557f3e3c2270 R09: 0000000000000001 [ 76.178739] R10: 000000000000022d R11: 0000000000000206 R12: 0000000000000000 [ 76.178741] R13: 0000000000000001 R14: 0000557f3e3c13c0 R15: 0000000000000000 [ 76.178745] Code: 48 8b 55 18 48 89 df 41 b8 20 00 08 01 5b 5d 48 8b 42 10 48 8b 52 30 48 63 48 4c 48 8b 44 c8 48 31 c9 48 8b 70 38 e9 f4 fd 00 00 <48> 8b 14 d5 40 27 e5 9e 48 c7 c6 7d 07 19 9f 48 89 df e8 fd 35 [ 76.178794] RIP: audit_signal_cb+0x6c/0xe0 RSP: ffffa09b02a4fc08 [ 76.178796] CR2: ffffffff0eee3bc0 [ 76.178799] ---[ end trace 514af9529297f1a3 ]--- Fixes: cd1dbf7 ("apparmor: add the ability to mediate signals") Reported-by: Zephaniah E. Loss-Cutler-Hull <[email protected]> Reported-by: Shuah Khan <[email protected]> Suggested-by: Tetsuo Handa <[email protected]> Tested-by: Ivan Kozik <[email protected]> Tested-by: Zephaniah E. Loss-Cutler-Hull <[email protected]> Tested-by: Christian Boltz <[email protected]> Tested-by: Shuah Khan <[email protected]> Cc: [email protected] Signed-off-by: John Johansen <[email protected]>
As both the hotplug event and fbdev configuration run asynchronously, it is possible for them to run concurrently. If configuration fails, we were freeing the fbdev causing a use-after-free in the hotplug event. <7>[ 3069.935211] [drm:intel_fb_initial_config [i915]] Not using firmware configuration <7>[ 3069.935225] [drm:drm_setup_crtcs] looking for cmdline mode on connector 77 <7>[ 3069.935229] [drm:drm_setup_crtcs] looking for preferred mode on connector 77 0 <7>[ 3069.935233] [drm:drm_setup_crtcs] found mode 3200x1800 <7>[ 3069.935236] [drm:drm_setup_crtcs] picking CRTCs for 8192x8192 config <7>[ 3069.935253] [drm:drm_setup_crtcs] desired mode 3200x1800 set on crtc 43 (0,0) <7>[ 3069.935323] [drm:intelfb_create [i915]] no BIOS fb, allocating a new one <4>[ 3069.967737] general protection fault: 0000 [#1] PREEMPT SMP <0>[ 3069.977453] --------------------------------- <4>[ 3069.977457] Modules linked in: i915(+) vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm r8169 mei_me mii prime_numbers mei i2c_hid pinctrl_geminilake pinctrl_intel [last unloaded: i915] <4>[ 3069.977492] CPU: 1 PID: 15414 Comm: kworker/1:0 Tainted: G U 4.14.0-CI-CI_DRM_3388+ #1 <4>[ 3069.977497] Hardware name: Intel Corp. Geminilake/GLK RVP1 DDR4 (05), BIOS GELKRVPA.X64.0062.B30.1708222146 08/22/2017 <4>[ 3069.977508] Workqueue: events output_poll_execute <4>[ 3069.977512] task: ffff880177734e40 task.stack: ffffc90001fe4000 <4>[ 3069.977519] RIP: 0010:__lock_acquire+0x109/0x1b60 <4>[ 3069.977523] RSP: 0018:ffffc90001fe7bb0 EFLAGS: 00010002 <4>[ 3069.977526] RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000282 RCX: 0000000000000000 <4>[ 3069.977530] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880170d4efd0 <4>[ 3069.977534] RBP: ffffc90001fe7c70 R08: 0000000000000001 R09: 0000000000000000 <4>[ 3069.977538] R10: 0000000000000000 R11: ffffffff81899609 R12: ffff880170d4efd0 <4>[ 3069.977542] R13: ffff880177734e40 R14: 0000000000000001 R15: 0000000000000000 <4>[ 3069.977547] FS: 0000000000000000(0000) GS:ffff88017fc80000(0000) knlGS:0000000000000000 <4>[ 3069.977551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 3069.977555] CR2: 00007f7e8b7bcf04 CR3: 0000000003e0f000 CR4: 00000000003406e0 <4>[ 3069.977559] Call Trace: <4>[ 3069.977565] ? mark_held_locks+0x64/0x90 <4>[ 3069.977571] ? _raw_spin_unlock_irq+0x24/0x50 <4>[ 3069.977575] ? _raw_spin_unlock_irq+0x24/0x50 <4>[ 3069.977579] ? trace_hardirqs_on_caller+0xde/0x1c0 <4>[ 3069.977583] ? _raw_spin_unlock_irq+0x2f/0x50 <4>[ 3069.977588] ? finish_task_switch+0xa5/0x210 <4>[ 3069.977592] ? lock_acquire+0xaf/0x200 <4>[ 3069.977596] lock_acquire+0xaf/0x200 <4>[ 3069.977600] ? __mutex_lock+0x5e9/0x9b0 <4>[ 3069.977604] _raw_spin_lock+0x2a/0x40 <4>[ 3069.977608] ? __mutex_lock+0x5e9/0x9b0 <4>[ 3069.977612] __mutex_lock+0x5e9/0x9b0 <4>[ 3069.977616] ? drm_fb_helper_hotplug_event.part.19+0x16/0xa0 <4>[ 3069.977621] ? drm_fb_helper_hotplug_event.part.19+0x16/0xa0 <4>[ 3069.977625] drm_fb_helper_hotplug_event.part.19+0x16/0xa0 <4>[ 3069.977630] output_poll_execute+0x8d/0x180 <4>[ 3069.977635] process_one_work+0x22e/0x660 <4>[ 3069.977640] worker_thread+0x48/0x3a0 <4>[ 3069.977644] ? _raw_spin_unlock_irqrestore+0x4c/0x60 <4>[ 3069.977649] kthread+0x102/0x140 <4>[ 3069.977653] ? process_one_work+0x660/0x660 <4>[ 3069.977657] ? kthread_create_on_node+0x40/0x40 <4>[ 3069.977662] ret_from_fork+0x27/0x40 <4>[ 3069.977666] Code: 8d 62 f8 c3 49 81 3c 24 e0 fa 3c 82 41 be 00 00 00 00 45 0f 45 f0 83 fe 01 77 86 89 f0 49 8b 44 c4 08 48 85 c0 0f 84 76 ff ff ff <f0> ff 80 38 01 00 00 8b 1d 62 f9 e8 01 45 8b 85 b8 08 00 00 85 <1>[ 3069.977707] RIP: __lock_acquire+0x109/0x1b60 RSP: ffffc90001fe7bb0 <4>[ 3069.977712] ---[ end trace 4ad012eb3af62df7 ]--- In order to keep the dev_priv->ifbdev alive after failure, we have to avoid the free and leave it empty until we unload the module (which is less than ideal, but a necessary evil for simplicity). Then we can use intel_fbdev_sync() to serialise the hotplug event with the configuration. The serialisation between the two was removed in commit 934458c ("Revert "drm/i915: Fix races on fbdev""), but the use after free is much older, commit 366e39b ("drm/i915: Tear down fbdev if initialization fails") Fixes: 366e39b ("drm/i915: Tear down fbdev if initialization fails") Fixes: 934458c ("Revert "drm/i915: Fix races on fbdev"") Signed-off-by: Chris Wilson <[email protected]> Cc: Lukas Wunner <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: [email protected] Reviewed-by: Lukas Wunner <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit ad88d7f) Signed-off-by: Joonas Lahtinen <[email protected]>
Jiri Pirko says: ==================== mlxsw: GRE offloading fixes Petr says: This patchset fixes a couple bugs in offloading GRE tunnels in mlxsw driver. Patch #1 fixes a problem that local routes pointing at a GRE tunnel device are offloaded even if that netdevice is down. Patch #2 detects that as a result of moving a GRE netdevice to a different VRF, two tunnels now have a conflict of local addresses, something that the mlxsw driver can't offload. Patch #3 fixes a FIB abort caused by forming a route pointing at a GRE tunnel that is eligible for offloading but already onloaded. Patch #4 fixes a problem that next hops migrated to a new RIF kept the old RIF reference, which went dangling shortly afterwards. ==================== Signed-off-by: David S. Miller <[email protected]>
When an allocation in the txq_init path fails, the allocated buffers
end-up being freed twice: in the txq_init error path, and in txq_deinit.
This lead to issues as txq_deinit would work on already freed memory
regions:
kernel BUG at mm/slub.c:3915!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
This patch fixes this by removing the txq_init own error path, as the
txq_deinit function is always called on errors. This was introduced by
TSO as way more buffers are allocated.
Fixes: 186cd4d ("net: mvpp2: software tso support")
Signed-off-by: Antoine Tenart <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
syzbot reported crashes [1] and provided a C repro easing bug hunting. When/if packet_do_bind() calls __unregister_prot_hook() and releases po->bind_lock, another thread can run packet_notifier() and process an NETDEV_UP event. This calls register_prot_hook() and hooks again the socket right before first thread is able to grab again po->bind_lock. Fixes this issue by temporarily setting po->num to 0, as suggested by David Miller. [1] dev_remove_pack: ffff8801bf16fa80 not found ------------[ cut here ]------------ kernel BUG at net/core/dev.c:7945! ( BUG_ON(!list_empty(&dev->ptype_all)); ) invalid opcode: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: device syz0 entered promiscuous mode CPU: 0 PID: 3161 Comm: syzkaller404108 Not tainted 4.14.0+ #190 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 task: ffff8801cc57a500 task.stack: ffff8801cc588000 RIP: 0010:netdev_run_todo+0x772/0xae0 net/core/dev.c:7945 RSP: 0018:ffff8801cc58f598 EFLAGS: 00010293 RAX: ffff8801cc57a500 RBX: dffffc0000000000 RCX: ffffffff841f75b2 RDX: 0000000000000000 RSI: 1ffff100398b1ede RDI: ffff8801bf1f8810 device syz0 entered promiscuous mode RBP: ffff8801cc58f898 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801bf1f8cd8 R13: ffff8801cc58f870 R14: ffff8801bf1f8780 R15: ffff8801cc58f7f0 FS: 0000000001716880(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020b13000 CR3: 0000000005e25000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: rtnl_unlock+0xe/0x10 net/core/rtnetlink.c:106 tun_detach drivers/net/tun.c:670 [inline] tun_chr_close+0x49/0x60 drivers/net/tun.c:2845 __fput+0x333/0x7f0 fs/file_table.c:210 ____fput+0x15/0x20 fs/file_table.c:244 task_work_run+0x199/0x270 kernel/task_work.c:113 exit_task_work include/linux/task_work.h:22 [inline] do_exit+0x9bb/0x1ae0 kernel/exit.c:865 do_group_exit+0x149/0x400 kernel/exit.c:968 SYSC_exit_group kernel/exit.c:979 [inline] SyS_exit_group+0x1d/0x20 kernel/exit.c:977 entry_SYSCALL_64_fastpath+0x1f/0x96 RIP: 0033:0x44ad19 Fixes: 30f7ea1 ("packet: race condition in packet_bind") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Cc: Francesco Ruggeri <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Remove the second tipc_rcv() call in tipc_udp_recv(). We have just checked that the bearer is not up, and calling tipc_rcv() with a bearer that is not up leads to a TIPC div-by-zero crash in tipc_node_calculate_timer(). The crash is rare in practice, but can happen like this: We're enabling a bearer, but it's not yet up and fully initialized. At the same time we receive a discovery packet, and in tipc_udp_recv() we end up calling tipc_rcv() with the not-yet-initialized bearer, causing later the div-by-zero crash in tipc_node_calculate_timer(). Jon Maloy explains the impact of removing the second tipc_rcv() call: "link setup in the worst case will be delayed until the next arriving discovery messages, 1 sec later, and this is an acceptable delay." As the tipc_rcv() call is removed, just leave the function via the rcu_out label, so that we will kfree_skb(). [ 12.590450] Own node address <1.1.1>, network identity 1 [ 12.668088] divide error: 0000 [#1] SMP [ 12.676952] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.14.2-dirty #1 [ 12.679225] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 [ 12.682095] task: ffff8c2a761edb80 task.stack: ffffa41cc0cac000 [ 12.684087] RIP: 0010:tipc_node_calculate_timer.isra.12+0x45/0x60 [tipc] [ 12.686486] RSP: 0018:ffff8c2a7fc838a0 EFLAGS: 00010246 [ 12.688451] RAX: 0000000000000000 RBX: ffff8c2a5b382600 RCX: 0000000000000000 [ 12.691197] RDX: 0000000000000000 RSI: ffff8c2a5b382600 RDI: ffff8c2a5b382600 [ 12.693945] RBP: ffff8c2a7fc838b0 R08: 0000000000000001 R09: 0000000000000001 [ 12.696632] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c2a5d8949d8 [ 12.699491] R13: ffffffff95ede400 R14: 0000000000000000 R15: ffff8c2a5d894800 [ 12.702338] FS: 0000000000000000(0000) GS:ffff8c2a7fc80000(0000) knlGS:0000000000000000 [ 12.705099] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 12.706776] CR2: 0000000001bb9440 CR3: 00000000bd009001 CR4: 00000000003606e0 [ 12.708847] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 12.711016] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 12.712627] Call Trace: [ 12.713390] <IRQ> [ 12.714011] tipc_node_check_dest+0x2e8/0x350 [tipc] [ 12.715286] tipc_disc_rcv+0x14d/0x1d0 [tipc] [ 12.716370] tipc_rcv+0x8b0/0xd40 [tipc] [ 12.717396] ? minmax_running_min+0x2f/0x60 [ 12.718248] ? dst_alloc+0x4c/0xa0 [ 12.718964] ? tcp_ack+0xaf1/0x10b0 [ 12.719658] ? tipc_udp_is_known_peer+0xa0/0xa0 [tipc] [ 12.720634] tipc_udp_recv+0x71/0x1d0 [tipc] [ 12.721459] ? dst_alloc+0x4c/0xa0 [ 12.722130] udp_queue_rcv_skb+0x264/0x490 [ 12.722924] __udp4_lib_rcv+0x21e/0x990 [ 12.723670] ? ip_route_input_rcu+0x2dd/0xbf0 [ 12.724442] ? tcp_v4_rcv+0x958/0xa40 [ 12.725039] udp_rcv+0x1a/0x20 [ 12.725587] ip_local_deliver_finish+0x97/0x1d0 [ 12.726323] ip_local_deliver+0xaf/0xc0 [ 12.726959] ? ip_route_input_noref+0x19/0x20 [ 12.727689] ip_rcv_finish+0xdd/0x3b0 [ 12.728307] ip_rcv+0x2ac/0x360 [ 12.728839] __netif_receive_skb_core+0x6fb/0xa90 [ 12.729580] ? udp4_gro_receive+0x1a7/0x2c0 [ 12.730274] __netif_receive_skb+0x1d/0x60 [ 12.730953] ? __netif_receive_skb+0x1d/0x60 [ 12.731637] netif_receive_skb_internal+0x37/0xd0 [ 12.732371] napi_gro_receive+0xc7/0xf0 [ 12.732920] receive_buf+0x3c3/0xd40 [ 12.733441] virtnet_poll+0xb1/0x250 [ 12.733944] net_rx_action+0x23e/0x370 [ 12.734476] __do_softirq+0xc5/0x2f8 [ 12.734922] irq_exit+0xfa/0x100 [ 12.735315] do_IRQ+0x4f/0xd0 [ 12.735680] common_interrupt+0xa2/0xa2 [ 12.736126] </IRQ> [ 12.736416] RIP: 0010:native_safe_halt+0x6/0x10 [ 12.736925] RSP: 0018:ffffa41cc0cafe90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff4d [ 12.737756] RAX: 0000000000000000 RBX: ffff8c2a761edb80 RCX: 0000000000000000 [ 12.738504] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 12.739258] RBP: ffffa41cc0cafe90 R08: 0000014b5b9795e5 R09: ffffa41cc12c7e88 [ 12.740118] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002 [ 12.740964] R13: ffff8c2a761edb80 R14: 0000000000000000 R15: 0000000000000000 [ 12.741831] default_idle+0x2a/0x100 [ 12.742323] arch_cpu_idle+0xf/0x20 [ 12.742796] default_idle_call+0x28/0x40 [ 12.743312] do_idle+0x179/0x1f0 [ 12.743761] cpu_startup_entry+0x1d/0x20 [ 12.744291] start_secondary+0x112/0x120 [ 12.744816] secondary_startup_64+0xa5/0xa5 [ 12.745367] Code: b9 f4 01 00 00 48 89 c2 48 c1 ea 02 48 3d d3 07 00 00 48 0f 47 d1 49 8b 0c 24 48 39 d1 76 07 49 89 14 24 48 89 d1 31 d2 48 89 df <48> f7 f1 89 c6 e8 81 6e ff ff 5b 41 5c 5d c3 66 90 66 2e 0f 1f [ 12.747527] RIP: tipc_node_calculate_timer.isra.12+0x45/0x60 [tipc] RSP: ffff8c2a7fc838a0 [ 12.748555] ---[ end trace 1399ab83390650fd ]--- [ 12.749296] Kernel panic - not syncing: Fatal exception in interrupt [ 12.750123] Kernel Offset: 0x13200000 from 0xffffffff82000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 12.751215] Rebooting in 60 seconds.. Fixes: c9b64d4 ("tipc: add replicast peer discovery") Signed-off-by: Tommi Rantala <[email protected]> Cc: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
syzbot reports a bug as below: general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] PREEMPT SMP KASAN RIP: 0010:__lock_acquire+0x69/0x2000 kernel/locking/lockdep.c:4942 Call Trace: lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5691 __raw_write_lock include/linux/rwlock_api_smp.h:209 [inline] _raw_write_lock+0x2e/0x40 kernel/locking/spinlock.c:300 __drop_extent_tree+0x3ac/0x660 fs/f2fs/extent_cache.c:1100 f2fs_drop_extent_tree+0x17/0x30 fs/f2fs/extent_cache.c:1116 f2fs_insert_range+0x2d5/0x3c0 fs/f2fs/file.c:1664 f2fs_fallocate+0x4e4/0x6d0 fs/f2fs/file.c:1838 vfs_fallocate+0x54b/0x6b0 fs/open.c:324 ksys_fallocate fs/open.c:347 [inline] __do_sys_fallocate fs/open.c:355 [inline] __se_sys_fallocate fs/open.c:353 [inline] __x64_sys_fallocate+0xbd/0x100 fs/open.c:353 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd The root cause is race condition as below: - since it tries to remount rw filesystem, so that do_remount won't call sb_prepare_remount_readonly to block fallocate, there may be race condition in between remount and fallocate. - in f2fs_remount(), default_options() will reset mount option to default one, and then update it based on result of parse_options(), so there is a hole which race condition can happen. Thread A Thread B - f2fs_fill_super - parse_options - clear_opt(READ_EXTENT_CACHE) - f2fs_remount - default_options - set_opt(READ_EXTENT_CACHE) - f2fs_fallocate - f2fs_insert_range - f2fs_drop_extent_tree - __drop_extent_tree - __may_extent_tree - test_opt(READ_EXTENT_CACHE) return true - write_lock(&et->lock) access NULL pointer - parse_options - clear_opt(READ_EXTENT_CACHE) Cc: <[email protected]> Reported-by: [email protected] Closes: https://lore.kernel.org/linux-f2fs-devel/[email protected] Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
Thread #1: [122554.641906][ T92] f2fs_getxattr+0xd4/0x5fc -> waiting for f2fs_down_read(&F2FS_I(inode)->i_xattr_sem); [122554.641927][ T92] __f2fs_get_acl+0x50/0x284 [122554.641948][ T92] f2fs_init_acl+0x84/0x54c [122554.641969][ T92] f2fs_init_inode_metadata+0x460/0x5f0 [122554.641990][ T92] f2fs_add_inline_entry+0x11c/0x350 -> Locked dir->inode_page by f2fs_get_node_page() [122554.642009][ T92] f2fs_do_add_link+0x100/0x1e4 [122554.642025][ T92] f2fs_create+0xf4/0x22c [122554.642047][ T92] vfs_create+0x130/0x1f4 Thread #2: [123996.386358][ T92] __get_node_page+0x8c/0x504 -> waiting for dir->inode_page lock [123996.386383][ T92] read_all_xattrs+0x11c/0x1f4 [123996.386405][ T92] __f2fs_setxattr+0xcc/0x528 [123996.386424][ T92] f2fs_setxattr+0x158/0x1f4 -> f2fs_down_write(&F2FS_I(inode)->i_xattr_sem); [123996.386443][ T92] __f2fs_set_acl+0x328/0x430 [123996.386618][ T92] f2fs_set_acl+0x38/0x50 [123996.386642][ T92] posix_acl_chmod+0xc8/0x1c8 [123996.386669][ T92] f2fs_setattr+0x5e0/0x6bc [123996.386689][ T92] notify_change+0x4d8/0x580 [123996.386717][ T92] chmod_common+0xd8/0x184 [123996.386748][ T92] do_fchmodat+0x60/0x124 [123996.386766][ T92] __arm64_sys_fchmodat+0x28/0x3c Cc: <[email protected]> Fixes: 27161f1 "f2fs: avoid race in between read xattr & write xattr" Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
ni_create_attr_list uses WARN_ON to catch error cases while generating attribute list, which only prints out stack trace and may not be enough. This repalces them with more proper error handling flow. [ 59.666332] BUG: kernel NULL pointer dereference, address: 000000000000000e [ 59.673268] #PF: supervisor read access in kernel mode [ 59.678354] #PF: error_code(0x0000) - not-present page [ 59.682831] PGD 8000000005ff1067 P4D 8000000005ff1067 PUD 7dee067 PMD 0 [ 59.688556] Oops: 0000 [#1] PREEMPT SMP KASAN PTI [ 59.692642] CPU: 0 PID: 198 Comm: poc Tainted: G B W 6.2.0-rc1+ #4 [ 59.698868] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 59.708795] RIP: 0010:ni_create_attr_list+0x505/0x860 [ 59.713657] Code: 7e 10 e8 5e d0 d0 ff 45 0f b7 76 10 48 8d 7b 16 e8 00 d1 d0 ff 66 44 89 73 16 4d 8d 75 0e 4c 89 f7 e8 3f d0 d0 ff 4c 8d8 [ 59.731559] RSP: 0018:ffff88800a56f1e0 EFLAGS: 00010282 [ 59.735691] RAX: 0000000000000001 RBX: ffff88800b7b5088 RCX: ffffffffb83079fe [ 59.741792] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffbb7f9fc0 [ 59.748423] RBP: ffff88800a56f3a8 R08: ffff88800b7b50a0 R09: fffffbfff76ff3f9 [ 59.754654] R10: ffffffffbb7f9fc7 R11: fffffbfff76ff3f8 R12: ffff88800b756180 [ 59.761552] R13: 0000000000000000 R14: 000000000000000e R15: 0000000000000050 [ 59.768323] FS: 00007feaa8c96440(0000) GS:ffff88806d400000(0000) knlGS:0000000000000000 [ 59.776027] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 59.781395] CR2: 00007f3a2e0b1000 CR3: 000000000a5bc000 CR4: 00000000000006f0 [ 59.787607] Call Trace: [ 59.790271] <TASK> [ 59.792488] ? __pfx_ni_create_attr_list+0x10/0x10 [ 59.797235] ? kernel_text_address+0xd3/0xe0 [ 59.800856] ? unwind_get_return_address+0x3e/0x60 [ 59.805101] ? __kasan_check_write+0x18/0x20 [ 59.809296] ? preempt_count_sub+0x1c/0xd0 [ 59.813421] ni_ins_attr_ext+0x52c/0x5c0 [ 59.817034] ? __pfx_ni_ins_attr_ext+0x10/0x10 [ 59.821926] ? __vfs_setxattr+0x121/0x170 [ 59.825718] ? __vfs_setxattr_noperm+0x97/0x300 [ 59.829562] ? __vfs_setxattr_locked+0x145/0x170 [ 59.833987] ? vfs_setxattr+0x137/0x2a0 [ 59.836732] ? do_setxattr+0xce/0x150 [ 59.839807] ? setxattr+0x126/0x140 [ 59.842353] ? path_setxattr+0x164/0x180 [ 59.845275] ? __x64_sys_setxattr+0x71/0x90 [ 59.848838] ? do_syscall_64+0x3f/0x90 [ 59.851898] ? entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 59.857046] ? stack_depot_save+0x17/0x20 [ 59.860299] ni_insert_attr+0x1ba/0x420 [ 59.863104] ? __pfx_ni_insert_attr+0x10/0x10 [ 59.867069] ? preempt_count_sub+0x1c/0xd0 [ 59.869897] ? _raw_spin_unlock_irqrestore+0x2b/0x50 [ 59.874088] ? __create_object+0x3ae/0x5d0 [ 59.877865] ni_insert_resident+0xc4/0x1c0 [ 59.881430] ? __pfx_ni_insert_resident+0x10/0x10 [ 59.886355] ? kasan_save_alloc_info+0x1f/0x30 [ 59.891117] ? __kasan_kmalloc+0x8b/0xa0 [ 59.894383] ntfs_set_ea+0x90d/0xbf0 [ 59.897703] ? __pfx_ntfs_set_ea+0x10/0x10 [ 59.901011] ? kernel_text_address+0xd3/0xe0 [ 59.905308] ? __kernel_text_address+0x16/0x50 [ 59.909811] ? unwind_get_return_address+0x3e/0x60 [ 59.914898] ? __pfx_stack_trace_consume_entry+0x10/0x10 [ 59.920250] ? arch_stack_walk+0xa2/0x100 [ 59.924560] ? filter_irq_stacks+0x27/0x80 [ 59.928722] ntfs_setxattr+0x405/0x440 [ 59.932512] ? __pfx_ntfs_setxattr+0x10/0x10 [ 59.936634] ? kvmalloc_node+0x2d/0x120 [ 59.940378] ? kasan_save_stack+0x41/0x60 [ 59.943870] ? kasan_save_stack+0x2a/0x60 [ 59.947719] ? kasan_set_track+0x29/0x40 [ 59.951417] ? kasan_save_alloc_info+0x1f/0x30 [ 59.955733] ? __kasan_kmalloc+0x8b/0xa0 [ 59.959598] ? __kmalloc_node+0x68/0x150 [ 59.963163] ? kvmalloc_node+0x2d/0x120 [ 59.966490] ? vmemdup_user+0x2b/0xa0 [ 59.969060] __vfs_setxattr+0x121/0x170 [ 59.972456] ? __pfx___vfs_setxattr+0x10/0x10 [ 59.976008] __vfs_setxattr_noperm+0x97/0x300 [ 59.981562] __vfs_setxattr_locked+0x145/0x170 [ 59.986100] vfs_setxattr+0x137/0x2a0 [ 59.989964] ? __pfx_vfs_setxattr+0x10/0x10 [ 59.993616] ? __kasan_check_write+0x18/0x20 [ 59.997425] do_setxattr+0xce/0x150 [ 60.000304] setxattr+0x126/0x140 [ 60.002967] ? __pfx_setxattr+0x10/0x10 [ 60.006471] ? __virt_addr_valid+0xcb/0x140 [ 60.010461] ? __call_rcu_common.constprop.0+0x1c7/0x330 [ 60.016037] ? debug_smp_processor_id+0x1b/0x30 [ 60.021008] ? kasan_quarantine_put+0x5b/0x190 [ 60.025545] ? putname+0x84/0xa0 [ 60.027910] ? __kasan_slab_free+0x11e/0x1b0 [ 60.031483] ? putname+0x84/0xa0 [ 60.033986] ? preempt_count_sub+0x1c/0xd0 [ 60.036876] ? __mnt_want_write+0xae/0x100 [ 60.040738] ? mnt_want_write+0x8f/0x150 [ 60.044317] path_setxattr+0x164/0x180 [ 60.048096] ? __pfx_path_setxattr+0x10/0x10 [ 60.052096] ? strncpy_from_user+0x175/0x1c0 [ 60.056482] ? debug_smp_processor_id+0x1b/0x30 [ 60.059848] ? fpregs_assert_state_consistent+0x6b/0x80 [ 60.064557] __x64_sys_setxattr+0x71/0x90 [ 60.068892] do_syscall_64+0x3f/0x90 [ 60.072868] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 60.077523] RIP: 0033:0x7feaa86e4469 [ 60.080915] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 088 [ 60.097353] RSP: 002b:00007ffdbd8311e8 EFLAGS: 00000286 ORIG_RAX: 00000000000000bc [ 60.103386] RAX: ffffffffffffffda RBX: 9461c5e290baac00 RCX: 00007feaa86e4469 [ 60.110322] RDX: 00007ffdbd831fe0 RSI: 00007ffdbd831305 RDI: 00007ffdbd831263 [ 60.116808] RBP: 00007ffdbd836180 R08: 0000000000000001 R09: 00007ffdbd836268 [ 60.123879] R10: 000000000000007d R11: 0000000000000286 R12: 0000000000400500 [ 60.130540] R13: 00007ffdbd836260 R14: 0000000000000000 R15: 0000000000000000 [ 60.136553] </TASK> [ 60.138818] Modules linked in: [ 60.141839] CR2: 000000000000000e [ 60.144831] ---[ end trace 0000000000000000 ]--- [ 60.149058] RIP: 0010:ni_create_attr_list+0x505/0x860 [ 60.153975] Code: 7e 10 e8 5e d0 d0 ff 45 0f b7 76 10 48 8d 7b 16 e8 00 d1 d0 ff 66 44 89 73 16 4d 8d 75 0e 4c 89 f7 e8 3f d0 d0 ff 4c 8d8 [ 60.172443] RSP: 0018:ffff88800a56f1e0 EFLAGS: 00010282 [ 60.176246] RAX: 0000000000000001 RBX: ffff88800b7b5088 RCX: ffffffffb83079fe [ 60.182752] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffbb7f9fc0 [ 60.189949] RBP: ffff88800a56f3a8 R08: ffff88800b7b50a0 R09: fffffbfff76ff3f9 [ 60.196950] R10: ffffffffbb7f9fc7 R11: fffffbfff76ff3f8 R12: ffff88800b756180 [ 60.203671] R13: 0000000000000000 R14: 000000000000000e R15: 0000000000000050 [ 60.209595] FS: 00007feaa8c96440(0000) GS:ffff88806d400000(0000) knlGS:0000000000000000 [ 60.216299] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 60.222276] CR2: 00007f3a2e0b1000 CR3: 000000000a5bc000 CR4: 00000000000006f0 Signed-off-by: Edward Lo <[email protected]> Signed-off-by: Konstantin Komarov <[email protected]>
After commit 45c7e8a ("MIPS: Remove KVM_TE support") we get a NULL pointer dereference when creating a KVM guest: [ 146.243409] Starting KVM with MIPS VZ extensions [ 149.849151] CPU 3 Unable to handle kernel paging request at virtual address 0000000000000300, epc == ffffffffc06356ec, ra == ffffffffc063568c [ 149.849177] Oops[#1]: [ 149.849182] CPU: 3 PID: 2265 Comm: qemu-system-mip Not tainted 6.4.0-rc3+ #1671 [ 149.849188] Hardware name: THTF CX TL630 Series/THTF-LS3A4000-7A1000-ML4A, BIOS KL4.1F.TF.D.166.201225.R 12/25/2020 [ 149.849192] $ 0 : 0000000000000000 000000007400cce0 0000000000400004 ffffffff8119c740 [ 149.849209] $ 4 : 000000007400cce1 000000007400cce1 0000000000000000 0000000000000000 [ 149.849221] $ 8 : 000000240058bb36 ffffffff81421ac0 0000000000000000 0000000000400dc0 [ 149.849233] $12 : 9800000102a07cc8 ffffffff80e40e38 0000000000000001 0000000000400dc0 [ 149.849245] $16 : 0000000000000000 9800000106cd0000 9800000106cd0000 9800000100cce000 [ 149.849257] $20 : ffffffffc0632b28 ffffffffc05b31b0 9800000100ccca00 0000000000400000 [ 149.849269] $24 : 9800000106cd09ce ffffffff802f69d0 [ 149.849281] $28 : 9800000102a04000 9800000102a07cd0 98000001106a8000 ffffffffc063568c [ 149.849293] Hi : 00000335b2111e66 [ 149.849295] Lo : 6668d90061ae0ae9 [ 149.849298] epc : ffffffffc06356ec kvm_vz_vcpu_setup+0xc4/0x328 [kvm] [ 149.849324] ra : ffffffffc063568c kvm_vz_vcpu_setup+0x64/0x328 [kvm] [ 149.849336] Status: 7400cce3 KX SX UX KERNEL EXL IE [ 149.849351] Cause : 1000000 (ExcCode 03) [ 149.849354] BadVA : 0000000000000300 [ 149.849357] PrId : 0014c004 (ICT Loongson-3) [ 149.849360] Modules linked in: kvm nfnetlink_queue nfnetlink_log nfnetlink fuse sha256_generic libsha256 cfg80211 rfkill binfmt_misc vfat fat snd_hda_codec_hdmi input_leds led_class snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_pcm snd_timer snd serio_raw xhci_pci radeon drm_suballoc_helper drm_display_helper xhci_hcd ip_tables x_tables [ 149.849432] Process qemu-system-mip (pid: 2265, threadinfo=00000000ae2982d2, task=0000000038e09ad4, tls=000000ffeba16030) [ 149.849439] Stack : 9800000000000003 9800000100ccca00 9800000100ccc000 ffffffffc062cef4 [ 149.849453] 9800000102a07d18 c89b63a7ab338e00 0000000000000000 ffffffff811a0000 [ 149.849465] 0000000000000000 9800000106cd0000 ffffffff80e59938 98000001106a8920 [ 149.849476] ffffffff80e57f30 ffffffffc062854c ffffffff811a0000 9800000102bf4240 [ 149.849488] ffffffffc05b0000 ffffffff80e3a798 000000ff78000000 000000ff78000010 [ 149.849500] 0000000000000255 98000001021f7de0 98000001023f0078 ffffffff81434000 [ 149.849511] 0000000000000000 0000000000000000 9800000102ae0000 980000025e92ae28 [ 149.849523] 0000000000000000 c89b63a7ab338e00 0000000000000001 ffffffff8119dce0 [ 149.849535] 000000ff78000010 ffffffff804f3d3c 9800000102a07eb0 0000000000000255 [ 149.849546] 0000000000000000 ffffffff8049460c 000000ff78000010 0000000000000255 [ 149.849558] ... [ 149.849565] Call Trace: [ 149.849567] [<ffffffffc06356ec>] kvm_vz_vcpu_setup+0xc4/0x328 [kvm] [ 149.849586] [<ffffffffc062cef4>] kvm_arch_vcpu_create+0x184/0x228 [kvm] [ 149.849605] [<ffffffffc062854c>] kvm_vm_ioctl+0x64c/0xf28 [kvm] [ 149.849623] [<ffffffff805209c0>] sys_ioctl+0xc8/0x118 [ 149.849631] [<ffffffff80219eb0>] syscall_common+0x34/0x58 The root cause is the deletion of kvm_mips_commpage_init() leaves vcpu ->arch.cop0 NULL. So fix it by making cop0 from a pointer to an embedded object. Fixes: 45c7e8a ("MIPS: Remove KVM_TE support") Cc: [email protected] Reported-by: Yu Zhao <[email protected]> Suggested-by: Thomas Bogendoerfer <[email protected]> Reviewed-by: Philippe Mathieu-Daudé <[email protected]> Signed-off-by: Huacai Chen <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>
The initial memblock metadata is accessed from kernel image mapping. The regions arrays need to "reallocated" from memblock and accessed through linear mapping to cover more memblock regions. So the resizing should not be allowed until linear mapping is ready. Note that there are memblock allocations when building linear mapping. This patch is similar to 24cc61d ("arm64: memblock: don't permit memblock resizing until linear mapping is up"). In following log, many memblock regions are reserved before create_linear_mapping_page_table(). And then it triggered reallocation of memblock.reserved.regions and memcpy the old array in kernel image mapping to the new array in linear mapping which caused a page fault. [ 0.000000] memblock_reserve: [0x00000000bf01f000-0x00000000bf01ffff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6 [ 0.000000] memblock_reserve: [0x00000000bf021000-0x00000000bf021fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6 [ 0.000000] memblock_reserve: [0x00000000bf023000-0x00000000bf023fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6 [ 0.000000] memblock_reserve: [0x00000000bf025000-0x00000000bf025fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6 [ 0.000000] memblock_reserve: [0x00000000bf027000-0x00000000bf027fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6 [ 0.000000] memblock_reserve: [0x00000000bf029000-0x00000000bf029fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6 [ 0.000000] memblock_reserve: [0x00000000bf02b000-0x00000000bf02bfff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6 [ 0.000000] memblock_reserve: [0x00000000bf02d000-0x00000000bf02dfff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6 [ 0.000000] memblock_reserve: [0x00000000bf02f000-0x00000000bf02ffff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6 [ 0.000000] memblock_reserve: [0x00000000bf030000-0x00000000bf030fff] early_init_fdt_scan_reserved_mem+0x28c/0x2c6 [ 0.000000] OF: reserved mem: 0x0000000080000000..0x000000008007ffff (512 KiB) map non-reusable mmode_resv0@80000000 [ 0.000000] memblock_reserve: [0x00000000bf000000-0x00000000bf001fed] paging_init+0x19a/0x5ae [ 0.000000] memblock_phys_alloc_range: 4096 bytes align=0x1000 from=0x0000000000000000 max_addr=0x0000000000000000 alloc_pmd_fixmap+0x14/0x1c [ 0.000000] memblock_reserve: [0x000000017ffff000-0x000000017fffffff] memblock_alloc_range_nid+0xb8/0x128 [ 0.000000] memblock: reserved is doubled to 256 at [0x000000017fffd000-0x000000017fffe7ff] [ 0.000000] Unable to handle kernel paging request at virtual address ff600000ffffd000 [ 0.000000] Oops [#1] [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.4.0-rc1-00011-g99a670b2069c #66 [ 0.000000] Hardware name: riscv-virtio,qemu (DT) [ 0.000000] epc : __memcpy+0x60/0xf8 [ 0.000000] ra : memblock_double_array+0x192/0x248 [ 0.000000] epc : ffffffff8081d214 ra : ffffffff80a3dfc0 sp : ffffffff81403bd0 [ 0.000000] gp : ffffffff814fbb38 tp : ffffffff8140dac0 t0 : 0000000001600000 [ 0.000000] t1 : 0000000000000000 t2 : 000000008f001000 s0 : ffffffff81403c60 [ 0.000000] s1 : ffffffff80c0bc98 a0 : ff600000ffffd000 a1 : ffffffff80c0bcd8 [ 0.000000] a2 : 0000000000000c00 a3 : ffffffff80c0c8d8 a4 : 0000000080000000 [ 0.000000] a5 : 0000000000080000 a6 : 0000000000000000 a7 : 0000000080200000 [ 0.000000] s2 : ff600000ffffd000 s3 : 0000000000002000 s4 : 0000000000000c00 [ 0.000000] s5 : ffffffff80c0bc60 s6 : ffffffff80c0bcc8 s7 : 0000000000000000 [ 0.000000] s8 : ffffffff814fd0a8 s9 : 000000017fffe7ff s10: 0000000000000000 [ 0.000000] s11: 0000000000001000 t3 : 0000000000001000 t4 : 0000000000000000 [ 0.000000] t5 : 000000008f003000 t6 : ff600000ffffd000 [ 0.000000] status: 0000000200000100 badaddr: ff600000ffffd000 cause: 000000000000000f [ 0.000000] [<ffffffff8081d214>] __memcpy+0x60/0xf8 [ 0.000000] [<ffffffff80a3e1a2>] memblock_add_range.isra.14+0x12c/0x162 [ 0.000000] [<ffffffff80a3e36a>] memblock_reserve+0x6e/0x8c [ 0.000000] [<ffffffff80a123fc>] memblock_alloc_range_nid+0xb8/0x128 [ 0.000000] [<ffffffff80a1256a>] memblock_phys_alloc_range+0x5e/0x6a [ 0.000000] [<ffffffff80a04732>] alloc_pmd_fixmap+0x14/0x1c [ 0.000000] [<ffffffff80a0475a>] alloc_p4d_fixmap+0xc/0x14 [ 0.000000] [<ffffffff80a04a36>] create_pgd_mapping+0x98/0x17c [ 0.000000] [<ffffffff80a04e9e>] create_linear_mapping_range.constprop.10+0xe4/0x112 [ 0.000000] [<ffffffff80a05bb8>] paging_init+0x3ec/0x5ae [ 0.000000] [<ffffffff80a03354>] setup_arch+0xb2/0x576 [ 0.000000] [<ffffffff80a00726>] start_kernel+0x72/0x57e [ 0.000000] Code: b303 0285 b383 0305 be03 0385 be83 0405 bf03 0485 (b023) 00ef [ 0.000000] ---[ end trace 0000000000000000 ]--- [ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task! [ 0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]--- Fixes: 671f9a3 ("RISC-V: Setup initial page tables in two stages") Signed-off-by: Woody Zhang <[email protected]> Tested-by: Song Shuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
The CMT device can be used as both a clocksource and a clockevent
provider. The driver tries to be smart and power itself on and off, as
well as enabling and disabling its clock when it's not in operation.
This behavior is slightly altered if the CMT is used as an early
platform device in which case the device is left powered on after probe,
but the clock is still enabled and disabled at runtime.
This has worked for a long time, but recent improvements in PREEMPT_RT
and PROVE_LOCKING have highlighted an issue. As the CMT registers itself
as a clockevent provider, clockevents_register_device(), it needs to use
raw spinlocks internally as this is the context of which the clockevent
framework interacts with the CMT driver. However in the context of
holding a raw spinlock the CMT driver can't really manage its power
state or clock with calls to pm_runtime_*() and clk_*() as these calls
end up in other platform drivers using regular spinlocks to control
power and clocks.
This mix of spinlock contexts trips a lockdep warning.
=============================
[ BUG: Invalid wait context ]
6.17.0-rc3-arm64-renesas-03071-gb3c4f4122b28-dirty #21 Not tainted
-----------------------------
swapper/1/0 is trying to lock:
ffff00000898d180 (&dev->power.lock){-...}-{3:3}, at: __pm_runtime_resume+0x38/0x88
ccree e6601000.crypto: ARM CryptoCell 630P Driver: HW version 0xAF400001/0xDCC63000, Driver version 5.0
other info that might help us debug this:
ccree e6601000.crypto: ARM ccree device initialized
context-{5:5}
2 locks held by swapper/1/0:
#0: ffff80008173c298 (tick_broadcast_lock){-...}-{2:2}, at: __tick_broadcast_oneshot_control+0xa4/0x3a8
#1: ffff0000089a5858 (&ch->lock){....}-{2:2}
usbcore: registered new interface driver usbhid
, at: sh_cmt_start+0x30/0x364
stack backtrace:
CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.17.0-rc3-arm64-renesas-03071-gb3c4f4122b28-dirty #21 PREEMPT
Hardware name: Renesas Salvator-X 2nd version board based on r8a77965 (DT)
Call trace:
show_stack+0x14/0x1c (C)
dump_stack_lvl+0x6c/0x90
dump_stack+0x14/0x1c
__lock_acquire+0x904/0x1584
lock_acquire+0x220/0x34c
_raw_spin_lock_irqsave+0x58/0x80
__pm_runtime_resume+0x38/0x88
sh_cmt_start+0x54/0x364
sh_cmt_clock_event_set_oneshot+0x64/0xb8
clockevents_switch_state+0xfc/0x13c
tick_broadcast_set_event+0x30/0xa4
__tick_broadcast_oneshot_control+0x1e0/0x3a8
tick_broadcast_oneshot_control+0x30/0x40
cpuidle_enter_state+0x40c/0x680
cpuidle_enter+0x30/0x40
do_idle+0x1f4/0x26c
cpu_startup_entry+0x34/0x40
secondary_start_kernel+0x11c/0x13c
__secondary_switched+0x74/0x78
For non-PREEMPT_RT builds this is not really an issue, but for
PREEMPT_RT builds where normal spinlocks can sleep this might be an
issue. Be cautious and always leave the power and clock running after
probe.
Signed-off-by: Niklas Söderlund <[email protected]>
Signed-off-by: Daniel Lezcano <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Link: https://patch.msgid.link/[email protected]
… transaction We can't log a conflicting inode if it's a directory and it was moved from one parent directory to another parent directory in the current transaction, as this can result an attempt to have a directory with two hard links during log replay, one for the old parent directory and another for the new parent directory. The following scenario triggers that issue: 1) We have directories "dir1" and "dir2" created in a past transaction. Directory "dir1" has inode A as its parent directory; 2) We move "dir1" to some other directory; 3) We create a file with the name "dir1" in directory inode A; 4) We fsync the new file. This resuls in logging the inode of the new file and the inode for the directory "dir1" that was previously moved in the current transaction. So the log tree has the INODE_REF item for the new location of "dir1"; 5) We move the new file to some other directory. This results in updating the log tree to included the new INODE_REF for the new location of the file and removes the INODE_REF for the old location. This happens during the rename when we call btrfs_log_new_name(); 6) We fsync the file, and that persists the log tree changes done in the previous step (btrfs_log_new_name() only updates the log tree in memory); 7) We have a power failure; 8) Next time the fs is mounted, log replay happens and when processing the inode for directory "dir1" we find a new INODE_REF and add that link, but we don't remove the old link of the inode since we have not logged the old parent directory of the directory inode "dir1". As a result after log replay finishes when we trigger writeback of the subvolume tree's extent buffers, the tree check will detect that we have a directory a hard link count of 2 and we get a mount failure. The errors and stack traces reported in dmesg/syslog are like this: [ 3845.729764] BTRFS info (device dm-0): start tree-log replay [ 3845.730304] page: refcount:3 mapcount:0 mapping:000000005c8a3027 index:0x1d00 pfn:0x11510c [ 3845.731236] memcg:ffff9264c02f4e00 [ 3845.731751] aops:btree_aops [btrfs] ino:1 [ 3845.732300] flags: 0x17fffc00000400a(uptodate|private|writeback|node=0|zone=2|lastcpupid=0x1ffff) [ 3845.733346] raw: 017fffc00000400a 0000000000000000 dead000000000122 ffff9264d978aea8 [ 3845.734265] raw: 0000000000001d00 ffff92650e6d4738 00000003ffffffff ffff9264c02f4e00 [ 3845.735305] page dumped because: eb page dump [ 3845.735981] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=6 ino=257, invalid nlink: has 2 expect no more than 1 for dir [ 3845.737786] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14881 owner 5 [ 3845.737789] BTRFS info (device dm-0): refs 4 lock_owner 0 current 30701 [ 3845.737792] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 [ 3845.737794] inode generation 3 transid 9 size 16 nbytes 16384 [ 3845.737795] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737797] rdev 0 sequence 2 flags 0x0 [ 3845.737798] atime 1764259517.0 [ 3845.737800] ctime 1764259517.572889464 [ 3845.737801] mtime 1764259517.572889464 [ 3845.737802] otime 1764259517.0 [ 3845.737803] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 [ 3845.737805] index 0 name_len 2 [ 3845.737807] item 2 key (256 DIR_ITEM 2363071922) itemoff 16077 itemsize 34 [ 3845.737808] location key (257 1 0) type 2 [ 3845.737810] transid 9 data_len 0 name_len 4 [ 3845.737811] item 3 key (256 DIR_ITEM 2676584006) itemoff 16043 itemsize 34 [ 3845.737813] location key (258 1 0) type 2 [ 3845.737814] transid 9 data_len 0 name_len 4 [ 3845.737815] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34 [ 3845.737816] location key (257 1 0) type 2 [ 3845.737818] transid 9 data_len 0 name_len 4 [ 3845.737819] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34 [ 3845.737820] location key (258 1 0) type 2 [ 3845.737821] transid 9 data_len 0 name_len 4 [ 3845.737822] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160 [ 3845.737824] inode generation 9 transid 10 size 6 nbytes 0 [ 3845.737825] block group 0 mode 40755 links 2 uid 0 gid 0 [ 3845.737826] rdev 0 sequence 1 flags 0x0 [ 3845.737827] atime 1764259517.572889464 [ 3845.737828] ctime 1764259517.572889464 [ 3845.737830] mtime 1764259517.572889464 [ 3845.737831] otime 1764259517.572889464 [ 3845.737832] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14 [ 3845.737833] index 2 name_len 4 [ 3845.737834] item 8 key (257 INODE_REF 258) itemoff 15787 itemsize 14 [ 3845.737836] index 2 name_len 4 [ 3845.737837] item 9 key (257 DIR_ITEM 2507850652) itemoff 15754 itemsize 33 [ 3845.737838] location key (259 1 0) type 1 [ 3845.737839] transid 10 data_len 0 name_len 3 [ 3845.737840] item 10 key (257 DIR_INDEX 2) itemoff 15721 itemsize 33 [ 3845.737842] location key (259 1 0) type 1 [ 3845.737843] transid 10 data_len 0 name_len 3 [ 3845.737844] item 11 key (258 INODE_ITEM 0) itemoff 15561 itemsize 160 [ 3845.737846] inode generation 9 transid 10 size 8 nbytes 0 [ 3845.737847] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737848] rdev 0 sequence 1 flags 0x0 [ 3845.737849] atime 1764259517.572889464 [ 3845.737850] ctime 1764259517.572889464 [ 3845.737851] mtime 1764259517.572889464 [ 3845.737852] otime 1764259517.572889464 [ 3845.737853] item 12 key (258 INODE_REF 256) itemoff 15547 itemsize 14 [ 3845.737855] index 3 name_len 4 [ 3845.737856] item 13 key (258 DIR_ITEM 1843588421) itemoff 15513 itemsize 34 [ 3845.737857] location key (257 1 0) type 2 [ 3845.737858] transid 10 data_len 0 name_len 4 [ 3845.737860] item 14 key (258 DIR_INDEX 2) itemoff 15479 itemsize 34 [ 3845.737861] location key (257 1 0) type 2 [ 3845.737862] transid 10 data_len 0 name_len 4 [ 3845.737863] item 15 key (259 INODE_ITEM 0) itemoff 15319 itemsize 160 [ 3845.737865] inode generation 10 transid 10 size 0 nbytes 0 [ 3845.737866] block group 0 mode 100600 links 1 uid 0 gid 0 [ 3845.737867] rdev 0 sequence 2 flags 0x0 [ 3845.737868] atime 1764259517.580874966 [ 3845.737869] ctime 1764259517.586121869 [ 3845.737870] mtime 1764259517.580874966 [ 3845.737872] otime 1764259517.580874966 [ 3845.737873] item 16 key (259 INODE_REF 257) itemoff 15306 itemsize 13 [ 3845.737874] index 2 name_len 3 [ 3845.737875] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected [ 3845.739448] ------------[ cut here ]------------ [ 3845.740092] WARNING: CPU: 5 PID: 30701 at fs/btrfs/disk-io.c:335 btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.741439] Modules linked in: btrfs dm_flakey crc32c_cryptoapi (...) [ 3845.750626] CPU: 5 UID: 0 PID: 30701 Comm: mount Tainted: G W 6.18.0-rc6-btrfs-next-218+ #1 PREEMPT(full) [ 3845.752414] Tainted: [W]=WARN [ 3845.752828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 3845.754499] RIP: 0010:btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.755460] Code: 31 f6 48 89 (...) [ 3845.758685] RSP: 0018:ffffa8d9c5677678 EFLAGS: 00010246 [ 3845.759450] RAX: 0000000000000000 RBX: ffff92650e6d4738 RCX: 0000000000000000 [ 3845.760309] RDX: 0000000000000000 RSI: ffffffff9aab45b9 RDI: ffff9264c4748000 [ 3845.761239] RBP: ffff9264d4324000 R08: 0000000000000000 R09: ffffa8d9c5677468 [ 3845.762607] R10: ffff926bdc1fffa8 R11: 0000000000000003 R12: ffffa8d9c5677680 [ 3845.764099] R13: 0000000000004000 R14: ffff9264dd624000 R15: ffff9264d978aba8 [ 3845.765094] FS: 00007f751fa5a840(0000) GS:ffff926c42a82000(0000) knlGS:0000000000000000 [ 3845.766226] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3845.766970] CR2: 0000558df1815380 CR3: 000000010ed88003 CR4: 0000000000370ef0 [ 3845.768009] Call Trace: [ 3845.768392] <TASK> [ 3845.768714] btrfs_submit_bbio+0x6ee/0x7f0 [btrfs] [ 3845.769640] ? write_one_eb+0x28e/0x340 [btrfs] [ 3845.770588] btree_write_cache_pages+0x2f0/0x550 [btrfs] [ 3845.771286] ? alloc_extent_state+0x19/0x100 [btrfs] [ 3845.771967] ? merge_next_state+0x1a/0x90 [btrfs] [ 3845.772586] ? set_extent_bit+0x233/0x8b0 [btrfs] [ 3845.773198] ? xas_load+0x9/0xc0 [ 3845.773589] ? xas_find+0x14d/0x1a0 [ 3845.773969] do_writepages+0xc6/0x160 [ 3845.774367] filemap_fdatawrite_wbc+0x48/0x60 [ 3845.775003] __filemap_fdatawrite_range+0x5b/0x80 [ 3845.775902] btrfs_write_marked_extents+0x61/0x170 [btrfs] [ 3845.776707] btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs] [ 3845.777379] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 3845.777923] btrfs_commit_transaction+0x5ea/0xd20 [btrfs] [ 3845.778551] ? _raw_spin_unlock+0x15/0x30 [ 3845.778986] ? release_extent_buffer+0x34/0x160 [btrfs] [ 3845.779659] btrfs_recover_log_trees+0x7a3/0x7c0 [btrfs] [ 3845.780416] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs] [ 3845.781499] open_ctree+0x10bb/0x15f0 [btrfs] [ 3845.782194] btrfs_get_tree.cold+0xb/0x16c [btrfs] [ 3845.782764] ? fscontext_read+0x15c/0x180 [ 3845.783202] ? rw_verify_area+0x50/0x180 [ 3845.783667] vfs_get_tree+0x25/0xd0 [ 3845.784047] vfs_cmd_create+0x59/0xe0 [ 3845.784458] __do_sys_fsconfig+0x4f6/0x6b0 [ 3845.784914] do_syscall_64+0x50/0x1220 [ 3845.785340] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 3845.785980] RIP: 0033:0x7f751fc7f4aa [ 3845.786759] Code: 73 01 c3 48 (...) [ 3845.789951] RSP: 002b:00007ffcdba45dc8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af [ 3845.791402] RAX: ffffffffffffffda RBX: 000055ccc8291c20 RCX: 00007f751fc7f4aa [ 3845.792688] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 [ 3845.794308] RBP: 000055ccc8292120 R08: 0000000000000000 R09: 0000000000000000 [ 3845.795829] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 3845.797183] R13: 00007f751fe11580 R14: 00007f751fe1326c R15: 00007f751fdf8a23 [ 3845.798633] </TASK> [ 3845.799067] ---[ end trace 0000000000000000 ]--- [ 3845.800215] BTRFS: error (device dm-0) in btrfs_commit_transaction:2553: errno=-5 IO failure (Error while writing out transaction) [ 3845.801860] BTRFS warning (device dm-0 state E): Skipping commit of aborted transaction. [ 3845.802815] BTRFS error (device dm-0 state EA): Transaction aborted (error -5) [ 3845.803728] BTRFS: error (device dm-0 state EA) in cleanup_transaction:2036: errno=-5 IO failure [ 3845.805374] BTRFS: error (device dm-0 state EA) in btrfs_replay_log:2083: errno=-5 IO failure (Failed to recover log tree) [ 3845.807919] BTRFS error (device dm-0 state EA): open_ctree failed: -5 Fix this by never logging a conflicting inode that is a directory and was moved in the current transaction (its last_unlink_trans equals the current transaction) and instead fallback to a transaction commit. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ CC: [email protected] # 6.1+ Signed-off-by: Filipe Manana <[email protected]>
… transaction We can't log a conflicting inode if it's a directory and it was moved from one parent directory to another parent directory in the current transaction, as this can result an attempt to have a directory with two hard links during log replay, one for the old parent directory and another for the new parent directory. The following scenario triggers that issue: 1) We have directories "dir1" and "dir2" created in a past transaction. Directory "dir1" has inode A as its parent directory; 2) We move "dir1" to some other directory; 3) We create a file with the name "dir1" in directory inode A; 4) We fsync the new file. This results in logging the inode of the new file and the inode for the directory "dir1" that was previously moved in the current transaction. So the log tree has the INODE_REF item for the new location of "dir1"; 5) We move the new file to some other directory. This results in updating the log tree to included the new INODE_REF for the new location of the file and removes the INODE_REF for the old location. This happens during the rename when we call btrfs_log_new_name(); 6) We fsync the file, and that persists the log tree changes done in the previous step (btrfs_log_new_name() only updates the log tree in memory); 7) We have a power failure; 8) Next time the fs is mounted, log replay happens and when processing the inode for directory "dir1" we find a new INODE_REF and add that link, but we don't remove the old link of the inode since we have not logged the old parent directory of the directory inode "dir1". As a result after log replay finishes when we trigger writeback of the subvolume tree's extent buffers, the tree check will detect that we have a directory a hard link count of 2 and we get a mount failure. The errors and stack traces reported in dmesg/syslog are like this: [ 3845.729764] BTRFS info (device dm-0): start tree-log replay [ 3845.730304] page: refcount:3 mapcount:0 mapping:000000005c8a3027 index:0x1d00 pfn:0x11510c [ 3845.731236] memcg:ffff9264c02f4e00 [ 3845.731751] aops:btree_aops [btrfs] ino:1 [ 3845.732300] flags: 0x17fffc00000400a(uptodate|private|writeback|node=0|zone=2|lastcpupid=0x1ffff) [ 3845.733346] raw: 017fffc00000400a 0000000000000000 dead000000000122 ffff9264d978aea8 [ 3845.734265] raw: 0000000000001d00 ffff92650e6d4738 00000003ffffffff ffff9264c02f4e00 [ 3845.735305] page dumped because: eb page dump [ 3845.735981] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=6 ino=257, invalid nlink: has 2 expect no more than 1 for dir [ 3845.737786] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14881 owner 5 [ 3845.737789] BTRFS info (device dm-0): refs 4 lock_owner 0 current 30701 [ 3845.737792] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 [ 3845.737794] inode generation 3 transid 9 size 16 nbytes 16384 [ 3845.737795] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737797] rdev 0 sequence 2 flags 0x0 [ 3845.737798] atime 1764259517.0 [ 3845.737800] ctime 1764259517.572889464 [ 3845.737801] mtime 1764259517.572889464 [ 3845.737802] otime 1764259517.0 [ 3845.737803] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 [ 3845.737805] index 0 name_len 2 [ 3845.737807] item 2 key (256 DIR_ITEM 2363071922) itemoff 16077 itemsize 34 [ 3845.737808] location key (257 1 0) type 2 [ 3845.737810] transid 9 data_len 0 name_len 4 [ 3845.737811] item 3 key (256 DIR_ITEM 2676584006) itemoff 16043 itemsize 34 [ 3845.737813] location key (258 1 0) type 2 [ 3845.737814] transid 9 data_len 0 name_len 4 [ 3845.737815] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34 [ 3845.737816] location key (257 1 0) type 2 [ 3845.737818] transid 9 data_len 0 name_len 4 [ 3845.737819] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34 [ 3845.737820] location key (258 1 0) type 2 [ 3845.737821] transid 9 data_len 0 name_len 4 [ 3845.737822] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160 [ 3845.737824] inode generation 9 transid 10 size 6 nbytes 0 [ 3845.737825] block group 0 mode 40755 links 2 uid 0 gid 0 [ 3845.737826] rdev 0 sequence 1 flags 0x0 [ 3845.737827] atime 1764259517.572889464 [ 3845.737828] ctime 1764259517.572889464 [ 3845.737830] mtime 1764259517.572889464 [ 3845.737831] otime 1764259517.572889464 [ 3845.737832] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14 [ 3845.737833] index 2 name_len 4 [ 3845.737834] item 8 key (257 INODE_REF 258) itemoff 15787 itemsize 14 [ 3845.737836] index 2 name_len 4 [ 3845.737837] item 9 key (257 DIR_ITEM 2507850652) itemoff 15754 itemsize 33 [ 3845.737838] location key (259 1 0) type 1 [ 3845.737839] transid 10 data_len 0 name_len 3 [ 3845.737840] item 10 key (257 DIR_INDEX 2) itemoff 15721 itemsize 33 [ 3845.737842] location key (259 1 0) type 1 [ 3845.737843] transid 10 data_len 0 name_len 3 [ 3845.737844] item 11 key (258 INODE_ITEM 0) itemoff 15561 itemsize 160 [ 3845.737846] inode generation 9 transid 10 size 8 nbytes 0 [ 3845.737847] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737848] rdev 0 sequence 1 flags 0x0 [ 3845.737849] atime 1764259517.572889464 [ 3845.737850] ctime 1764259517.572889464 [ 3845.737851] mtime 1764259517.572889464 [ 3845.737852] otime 1764259517.572889464 [ 3845.737853] item 12 key (258 INODE_REF 256) itemoff 15547 itemsize 14 [ 3845.737855] index 3 name_len 4 [ 3845.737856] item 13 key (258 DIR_ITEM 1843588421) itemoff 15513 itemsize 34 [ 3845.737857] location key (257 1 0) type 2 [ 3845.737858] transid 10 data_len 0 name_len 4 [ 3845.737860] item 14 key (258 DIR_INDEX 2) itemoff 15479 itemsize 34 [ 3845.737861] location key (257 1 0) type 2 [ 3845.737862] transid 10 data_len 0 name_len 4 [ 3845.737863] item 15 key (259 INODE_ITEM 0) itemoff 15319 itemsize 160 [ 3845.737865] inode generation 10 transid 10 size 0 nbytes 0 [ 3845.737866] block group 0 mode 100600 links 1 uid 0 gid 0 [ 3845.737867] rdev 0 sequence 2 flags 0x0 [ 3845.737868] atime 1764259517.580874966 [ 3845.737869] ctime 1764259517.586121869 [ 3845.737870] mtime 1764259517.580874966 [ 3845.737872] otime 1764259517.580874966 [ 3845.737873] item 16 key (259 INODE_REF 257) itemoff 15306 itemsize 13 [ 3845.737874] index 2 name_len 3 [ 3845.737875] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected [ 3845.739448] ------------[ cut here ]------------ [ 3845.740092] WARNING: CPU: 5 PID: 30701 at fs/btrfs/disk-io.c:335 btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.741439] Modules linked in: btrfs dm_flakey crc32c_cryptoapi (...) [ 3845.750626] CPU: 5 UID: 0 PID: 30701 Comm: mount Tainted: G W 6.18.0-rc6-btrfs-next-218+ #1 PREEMPT(full) [ 3845.752414] Tainted: [W]=WARN [ 3845.752828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 3845.754499] RIP: 0010:btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.755460] Code: 31 f6 48 89 (...) [ 3845.758685] RSP: 0018:ffffa8d9c5677678 EFLAGS: 00010246 [ 3845.759450] RAX: 0000000000000000 RBX: ffff92650e6d4738 RCX: 0000000000000000 [ 3845.760309] RDX: 0000000000000000 RSI: ffffffff9aab45b9 RDI: ffff9264c4748000 [ 3845.761239] RBP: ffff9264d4324000 R08: 0000000000000000 R09: ffffa8d9c5677468 [ 3845.762607] R10: ffff926bdc1fffa8 R11: 0000000000000003 R12: ffffa8d9c5677680 [ 3845.764099] R13: 0000000000004000 R14: ffff9264dd624000 R15: ffff9264d978aba8 [ 3845.765094] FS: 00007f751fa5a840(0000) GS:ffff926c42a82000(0000) knlGS:0000000000000000 [ 3845.766226] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3845.766970] CR2: 0000558df1815380 CR3: 000000010ed88003 CR4: 0000000000370ef0 [ 3845.768009] Call Trace: [ 3845.768392] <TASK> [ 3845.768714] btrfs_submit_bbio+0x6ee/0x7f0 [btrfs] [ 3845.769640] ? write_one_eb+0x28e/0x340 [btrfs] [ 3845.770588] btree_write_cache_pages+0x2f0/0x550 [btrfs] [ 3845.771286] ? alloc_extent_state+0x19/0x100 [btrfs] [ 3845.771967] ? merge_next_state+0x1a/0x90 [btrfs] [ 3845.772586] ? set_extent_bit+0x233/0x8b0 [btrfs] [ 3845.773198] ? xas_load+0x9/0xc0 [ 3845.773589] ? xas_find+0x14d/0x1a0 [ 3845.773969] do_writepages+0xc6/0x160 [ 3845.774367] filemap_fdatawrite_wbc+0x48/0x60 [ 3845.775003] __filemap_fdatawrite_range+0x5b/0x80 [ 3845.775902] btrfs_write_marked_extents+0x61/0x170 [btrfs] [ 3845.776707] btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs] [ 3845.777379] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 3845.777923] btrfs_commit_transaction+0x5ea/0xd20 [btrfs] [ 3845.778551] ? _raw_spin_unlock+0x15/0x30 [ 3845.778986] ? release_extent_buffer+0x34/0x160 [btrfs] [ 3845.779659] btrfs_recover_log_trees+0x7a3/0x7c0 [btrfs] [ 3845.780416] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs] [ 3845.781499] open_ctree+0x10bb/0x15f0 [btrfs] [ 3845.782194] btrfs_get_tree.cold+0xb/0x16c [btrfs] [ 3845.782764] ? fscontext_read+0x15c/0x180 [ 3845.783202] ? rw_verify_area+0x50/0x180 [ 3845.783667] vfs_get_tree+0x25/0xd0 [ 3845.784047] vfs_cmd_create+0x59/0xe0 [ 3845.784458] __do_sys_fsconfig+0x4f6/0x6b0 [ 3845.784914] do_syscall_64+0x50/0x1220 [ 3845.785340] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 3845.785980] RIP: 0033:0x7f751fc7f4aa [ 3845.786759] Code: 73 01 c3 48 (...) [ 3845.789951] RSP: 002b:00007ffcdba45dc8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af [ 3845.791402] RAX: ffffffffffffffda RBX: 000055ccc8291c20 RCX: 00007f751fc7f4aa [ 3845.792688] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 [ 3845.794308] RBP: 000055ccc8292120 R08: 0000000000000000 R09: 0000000000000000 [ 3845.795829] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 3845.797183] R13: 00007f751fe11580 R14: 00007f751fe1326c R15: 00007f751fdf8a23 [ 3845.798633] </TASK> [ 3845.799067] ---[ end trace 0000000000000000 ]--- [ 3845.800215] BTRFS: error (device dm-0) in btrfs_commit_transaction:2553: errno=-5 IO failure (Error while writing out transaction) [ 3845.801860] BTRFS warning (device dm-0 state E): Skipping commit of aborted transaction. [ 3845.802815] BTRFS error (device dm-0 state EA): Transaction aborted (error -5) [ 3845.803728] BTRFS: error (device dm-0 state EA) in cleanup_transaction:2036: errno=-5 IO failure [ 3845.805374] BTRFS: error (device dm-0 state EA) in btrfs_replay_log:2083: errno=-5 IO failure (Failed to recover log tree) [ 3845.807919] BTRFS error (device dm-0 state EA): open_ctree failed: -5 Fix this by never logging a conflicting inode that is a directory and was moved in the current transaction (its last_unlink_trans equals the current transaction) and instead fallback to a transaction commit. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ CC: [email protected] # 6.1+ Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
… transaction We can't log a conflicting inode if it's a directory and it was moved from one parent directory to another parent directory in the current transaction, as this can result an attempt to have a directory with two hard links during log replay, one for the old parent directory and another for the new parent directory. The following scenario triggers that issue: 1) We have directories "dir1" and "dir2" created in a past transaction. Directory "dir1" has inode A as its parent directory; 2) We move "dir1" to some other directory; 3) We create a file with the name "dir1" in directory inode A; 4) We fsync the new file. This results in logging the inode of the new file and the inode for the directory "dir1" that was previously moved in the current transaction. So the log tree has the INODE_REF item for the new location of "dir1"; 5) We move the new file to some other directory. This results in updating the log tree to included the new INODE_REF for the new location of the file and removes the INODE_REF for the old location. This happens during the rename when we call btrfs_log_new_name(); 6) We fsync the file, and that persists the log tree changes done in the previous step (btrfs_log_new_name() only updates the log tree in memory); 7) We have a power failure; 8) Next time the fs is mounted, log replay happens and when processing the inode for directory "dir1" we find a new INODE_REF and add that link, but we don't remove the old link of the inode since we have not logged the old parent directory of the directory inode "dir1". As a result after log replay finishes when we trigger writeback of the subvolume tree's extent buffers, the tree check will detect that we have a directory a hard link count of 2 and we get a mount failure. The errors and stack traces reported in dmesg/syslog are like this: [ 3845.729764] BTRFS info (device dm-0): start tree-log replay [ 3845.730304] page: refcount:3 mapcount:0 mapping:000000005c8a3027 index:0x1d00 pfn:0x11510c [ 3845.731236] memcg:ffff9264c02f4e00 [ 3845.731751] aops:btree_aops [btrfs] ino:1 [ 3845.732300] flags: 0x17fffc00000400a(uptodate|private|writeback|node=0|zone=2|lastcpupid=0x1ffff) [ 3845.733346] raw: 017fffc00000400a 0000000000000000 dead000000000122 ffff9264d978aea8 [ 3845.734265] raw: 0000000000001d00 ffff92650e6d4738 00000003ffffffff ffff9264c02f4e00 [ 3845.735305] page dumped because: eb page dump [ 3845.735981] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=6 ino=257, invalid nlink: has 2 expect no more than 1 for dir [ 3845.737786] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14881 owner 5 [ 3845.737789] BTRFS info (device dm-0): refs 4 lock_owner 0 current 30701 [ 3845.737792] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 [ 3845.737794] inode generation 3 transid 9 size 16 nbytes 16384 [ 3845.737795] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737797] rdev 0 sequence 2 flags 0x0 [ 3845.737798] atime 1764259517.0 [ 3845.737800] ctime 1764259517.572889464 [ 3845.737801] mtime 1764259517.572889464 [ 3845.737802] otime 1764259517.0 [ 3845.737803] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 [ 3845.737805] index 0 name_len 2 [ 3845.737807] item 2 key (256 DIR_ITEM 2363071922) itemoff 16077 itemsize 34 [ 3845.737808] location key (257 1 0) type 2 [ 3845.737810] transid 9 data_len 0 name_len 4 [ 3845.737811] item 3 key (256 DIR_ITEM 2676584006) itemoff 16043 itemsize 34 [ 3845.737813] location key (258 1 0) type 2 [ 3845.737814] transid 9 data_len 0 name_len 4 [ 3845.737815] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34 [ 3845.737816] location key (257 1 0) type 2 [ 3845.737818] transid 9 data_len 0 name_len 4 [ 3845.737819] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34 [ 3845.737820] location key (258 1 0) type 2 [ 3845.737821] transid 9 data_len 0 name_len 4 [ 3845.737822] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160 [ 3845.737824] inode generation 9 transid 10 size 6 nbytes 0 [ 3845.737825] block group 0 mode 40755 links 2 uid 0 gid 0 [ 3845.737826] rdev 0 sequence 1 flags 0x0 [ 3845.737827] atime 1764259517.572889464 [ 3845.737828] ctime 1764259517.572889464 [ 3845.737830] mtime 1764259517.572889464 [ 3845.737831] otime 1764259517.572889464 [ 3845.737832] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14 [ 3845.737833] index 2 name_len 4 [ 3845.737834] item 8 key (257 INODE_REF 258) itemoff 15787 itemsize 14 [ 3845.737836] index 2 name_len 4 [ 3845.737837] item 9 key (257 DIR_ITEM 2507850652) itemoff 15754 itemsize 33 [ 3845.737838] location key (259 1 0) type 1 [ 3845.737839] transid 10 data_len 0 name_len 3 [ 3845.737840] item 10 key (257 DIR_INDEX 2) itemoff 15721 itemsize 33 [ 3845.737842] location key (259 1 0) type 1 [ 3845.737843] transid 10 data_len 0 name_len 3 [ 3845.737844] item 11 key (258 INODE_ITEM 0) itemoff 15561 itemsize 160 [ 3845.737846] inode generation 9 transid 10 size 8 nbytes 0 [ 3845.737847] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737848] rdev 0 sequence 1 flags 0x0 [ 3845.737849] atime 1764259517.572889464 [ 3845.737850] ctime 1764259517.572889464 [ 3845.737851] mtime 1764259517.572889464 [ 3845.737852] otime 1764259517.572889464 [ 3845.737853] item 12 key (258 INODE_REF 256) itemoff 15547 itemsize 14 [ 3845.737855] index 3 name_len 4 [ 3845.737856] item 13 key (258 DIR_ITEM 1843588421) itemoff 15513 itemsize 34 [ 3845.737857] location key (257 1 0) type 2 [ 3845.737858] transid 10 data_len 0 name_len 4 [ 3845.737860] item 14 key (258 DIR_INDEX 2) itemoff 15479 itemsize 34 [ 3845.737861] location key (257 1 0) type 2 [ 3845.737862] transid 10 data_len 0 name_len 4 [ 3845.737863] item 15 key (259 INODE_ITEM 0) itemoff 15319 itemsize 160 [ 3845.737865] inode generation 10 transid 10 size 0 nbytes 0 [ 3845.737866] block group 0 mode 100600 links 1 uid 0 gid 0 [ 3845.737867] rdev 0 sequence 2 flags 0x0 [ 3845.737868] atime 1764259517.580874966 [ 3845.737869] ctime 1764259517.586121869 [ 3845.737870] mtime 1764259517.580874966 [ 3845.737872] otime 1764259517.580874966 [ 3845.737873] item 16 key (259 INODE_REF 257) itemoff 15306 itemsize 13 [ 3845.737874] index 2 name_len 3 [ 3845.737875] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected [ 3845.739448] ------------[ cut here ]------------ [ 3845.740092] WARNING: CPU: 5 PID: 30701 at fs/btrfs/disk-io.c:335 btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.741439] Modules linked in: btrfs dm_flakey crc32c_cryptoapi (...) [ 3845.750626] CPU: 5 UID: 0 PID: 30701 Comm: mount Tainted: G W 6.18.0-rc6-btrfs-next-218+ #1 PREEMPT(full) [ 3845.752414] Tainted: [W]=WARN [ 3845.752828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 3845.754499] RIP: 0010:btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.755460] Code: 31 f6 48 89 (...) [ 3845.758685] RSP: 0018:ffffa8d9c5677678 EFLAGS: 00010246 [ 3845.759450] RAX: 0000000000000000 RBX: ffff92650e6d4738 RCX: 0000000000000000 [ 3845.760309] RDX: 0000000000000000 RSI: ffffffff9aab45b9 RDI: ffff9264c4748000 [ 3845.761239] RBP: ffff9264d4324000 R08: 0000000000000000 R09: ffffa8d9c5677468 [ 3845.762607] R10: ffff926bdc1fffa8 R11: 0000000000000003 R12: ffffa8d9c5677680 [ 3845.764099] R13: 0000000000004000 R14: ffff9264dd624000 R15: ffff9264d978aba8 [ 3845.765094] FS: 00007f751fa5a840(0000) GS:ffff926c42a82000(0000) knlGS:0000000000000000 [ 3845.766226] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3845.766970] CR2: 0000558df1815380 CR3: 000000010ed88003 CR4: 0000000000370ef0 [ 3845.768009] Call Trace: [ 3845.768392] <TASK> [ 3845.768714] btrfs_submit_bbio+0x6ee/0x7f0 [btrfs] [ 3845.769640] ? write_one_eb+0x28e/0x340 [btrfs] [ 3845.770588] btree_write_cache_pages+0x2f0/0x550 [btrfs] [ 3845.771286] ? alloc_extent_state+0x19/0x100 [btrfs] [ 3845.771967] ? merge_next_state+0x1a/0x90 [btrfs] [ 3845.772586] ? set_extent_bit+0x233/0x8b0 [btrfs] [ 3845.773198] ? xas_load+0x9/0xc0 [ 3845.773589] ? xas_find+0x14d/0x1a0 [ 3845.773969] do_writepages+0xc6/0x160 [ 3845.774367] filemap_fdatawrite_wbc+0x48/0x60 [ 3845.775003] __filemap_fdatawrite_range+0x5b/0x80 [ 3845.775902] btrfs_write_marked_extents+0x61/0x170 [btrfs] [ 3845.776707] btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs] [ 3845.777379] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 3845.777923] btrfs_commit_transaction+0x5ea/0xd20 [btrfs] [ 3845.778551] ? _raw_spin_unlock+0x15/0x30 [ 3845.778986] ? release_extent_buffer+0x34/0x160 [btrfs] [ 3845.779659] btrfs_recover_log_trees+0x7a3/0x7c0 [btrfs] [ 3845.780416] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs] [ 3845.781499] open_ctree+0x10bb/0x15f0 [btrfs] [ 3845.782194] btrfs_get_tree.cold+0xb/0x16c [btrfs] [ 3845.782764] ? fscontext_read+0x15c/0x180 [ 3845.783202] ? rw_verify_area+0x50/0x180 [ 3845.783667] vfs_get_tree+0x25/0xd0 [ 3845.784047] vfs_cmd_create+0x59/0xe0 [ 3845.784458] __do_sys_fsconfig+0x4f6/0x6b0 [ 3845.784914] do_syscall_64+0x50/0x1220 [ 3845.785340] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 3845.785980] RIP: 0033:0x7f751fc7f4aa [ 3845.786759] Code: 73 01 c3 48 (...) [ 3845.789951] RSP: 002b:00007ffcdba45dc8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af [ 3845.791402] RAX: ffffffffffffffda RBX: 000055ccc8291c20 RCX: 00007f751fc7f4aa [ 3845.792688] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 [ 3845.794308] RBP: 000055ccc8292120 R08: 0000000000000000 R09: 0000000000000000 [ 3845.795829] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 3845.797183] R13: 00007f751fe11580 R14: 00007f751fe1326c R15: 00007f751fdf8a23 [ 3845.798633] </TASK> [ 3845.799067] ---[ end trace 0000000000000000 ]--- [ 3845.800215] BTRFS: error (device dm-0) in btrfs_commit_transaction:2553: errno=-5 IO failure (Error while writing out transaction) [ 3845.801860] BTRFS warning (device dm-0 state E): Skipping commit of aborted transaction. [ 3845.802815] BTRFS error (device dm-0 state EA): Transaction aborted (error -5) [ 3845.803728] BTRFS: error (device dm-0 state EA) in cleanup_transaction:2036: errno=-5 IO failure [ 3845.805374] BTRFS: error (device dm-0 state EA) in btrfs_replay_log:2083: errno=-5 IO failure (Failed to recover log tree) [ 3845.807919] BTRFS error (device dm-0 state EA): open_ctree failed: -5 Fix this by never logging a conflicting inode that is a directory and was moved in the current transaction (its last_unlink_trans equals the current transaction) and instead fallback to a transaction commit. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ CC: [email protected] # 6.1+ Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
Commit 1d2da79 ("pinctrl: renesas: rzg2l: Avoid configuring ISEL in gpio_irq_{en,dis}able*()") dropped the configuration of ISEL from struct irq_chip::{irq_enable, irq_disable} APIs and moved it to struct gpio_chip::irq::{child_to_parent_hwirq, child_irq_domain_ops::free} APIs to fix spurious IRQs. After commit 1d2da79 ("pinctrl: renesas: rzg2l: Avoid configuring ISEL in gpio_irq_{en,dis}able*()"), ISEL was no longer configured properly on resume. This is because the pinctrl resume code used struct irq_chip::irq_enable (called from rzg2l_gpio_irq_restore()) to reconfigure the wakeup interrupts. Some drivers (e.g. Ethernet) may also reconfigure non-wakeup interrupts on resume through their own code, eventually calling struct irq_chip::irq_enable. Fix this by adding ISEL configuration back into the struct irq_chip::irq_enable API and on resume path for wakeup interrupts. As struct irq_chip::irq_enable needs now to lock to update the ISEL, convert the struct rzg2l_pinctrl::lock to a raw spinlock and replace the locking API calls with the raw variants. Otherwise the lockdep reports invalid wait context when probing the adv7511 module on RZ/G2L: [ BUG: Invalid wait context ] 6.17.0-rc5-next-20250911-00001-gfcfac22533c9 #18 Not tainted ----------------------------- (udev-worker)/165 is trying to lock: ffff00000e3664a8 (&pctrl->lock){....}-{3:3}, at: rzg2l_gpio_irq_enable+0x38/0x78 other info that might help us debug this: context-{5:5} 3 locks held by (udev-worker)/165: #0: ffff00000e890108 (&dev->mutex){....}-{4:4}, at: __driver_attach+0x90/0x1ac #1: ffff000011c07240 (request_class){+.+.}-{4:4}, at: __setup_irq+0xb4/0x6dc #2: ffff000011c070c8 (lock_class){....}-{2:2}, at: __setup_irq+0xdc/0x6dc stack backtrace: CPU: 1 UID: 0 PID: 165 Comm: (udev-worker) Not tainted 6.17.0-rc5-next-20250911-00001-gfcfac22533c9 #18 PREEMPT Hardware name: Renesas SMARC EVK based on r9a07g044l2 (DT) Call trace: show_stack+0x18/0x24 (C) dump_stack_lvl+0x90/0xd0 dump_stack+0x18/0x24 __lock_acquire+0xa14/0x20b4 lock_acquire+0x1c8/0x354 _raw_spin_lock_irqsave+0x60/0x88 rzg2l_gpio_irq_enable+0x38/0x78 irq_enable+0x40/0x8c __irq_startup+0x78/0xa4 irq_startup+0x108/0x16c __setup_irq+0x3c0/0x6dc request_threaded_irq+0xec/0x1ac devm_request_threaded_irq+0x80/0x134 adv7511_probe+0x928/0x9a4 [adv7511] i2c_device_probe+0x22c/0x3dc really_probe+0xbc/0x2a0 __driver_probe_device+0x78/0x12c driver_probe_device+0x40/0x164 __driver_attach+0x9c/0x1ac bus_for_each_dev+0x74/0xd0 driver_attach+0x24/0x30 bus_add_driver+0xe4/0x208 driver_register+0x60/0x128 i2c_register_driver+0x48/0xd0 adv7511_init+0x5c/0x1000 [adv7511] do_one_initcall+0x64/0x30c do_init_module+0x58/0x23c load_module+0x1bcc/0x1d40 init_module_from_file+0x88/0xc4 idempotent_init_module+0x188/0x27c __arm64_sys_finit_module+0x68/0xac invoke_syscall+0x48/0x110 el0_svc_common.constprop.0+0xc0/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x4c/0x160 el0t_64_sync_handler+0xa0/0xe4 el0t_64_sync+0x198/0x19c Having ISEL configuration back into the struct irq_chip::irq_enable API should be safe with respect to spurious IRQs, as in the probe case IRQs are enabled anyway in struct gpio_chip::irq::child_to_parent_hwirq. No spurious IRQs were detected on suspend/resume, boot, ethernet link insert/remove tests (executed on RZ/G3S). Boot, ethernet link insert/remove tests were also executed successfully on RZ/G2L. Fixes: 1d2da79 ("pinctrl: renesas: rzg2l: Avoid configuring ISEL in gpio_irq_{en,dis}able*(") Cc: [email protected] Signed-off-by: Claudiu Beznea <[email protected]> Reviewed-by: Geert Uytterhoeven <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Geert Uytterhoeven <[email protected]>
The problem is that scan_fast() allocate memory for ubi->fm and ubi->fm->e[x], but if the following attach process fails in ubi_wl_init or ubi_read_volume_table, the whole attach process will fail without executing ubi_wl_close to free the memory under ubi->fm. Fix this by add a new ubi_free_fastmap function in fastmap.c to free the memory allocated for fm. If SLUB_DEBUG and KUNIT are enabled, the following warning messages will show: ubi0: detaching mtd0 ubi0: mtd0 is detached ubi0: default fastmap pool size: 200 ubi0: default fastmap WL pool size: 100 ubi0: attaching mtd0 ubi0: attached by fastmap ubi0: fastmap pool size: 200 ubi0: fastmap WL pool size: 100 ubi0 error: ubi_wl_init [ubi]: no enough physical eraseblocks (4, need 203) ubi0 error: ubi_attach_mtd_dev [ubi]: failed to attach mtd0, error -28 UBI error: cannot attach mtd0 ================================================================= BUG ubi_wl_entry_slab (Tainted: G B O L ): Objects remaining in ubi_wl_entry_slab on __kmem_cache_shutdown() ----------------------------------------------------------------------------- Slab 0xffff2fd23a40cd00 objects=22 used=1 fp=0xffff2fd1d0334fd8 flags=0x883fffc010200(slab|head|section=34|node=0|zone=1|lastcpupid=0x7fff) CPU: 0 PID: 5884 Comm: insmod Tainted: G B O L 5.10.0 #1 Hardware name: LS1043A RDB Board (DT) Call trace: dump_backtrace+0x0/0x198 show_stack+0x18/0x28 dump_stack+0xe8/0x15c slab_err+0x94/0xc0 __kmem_cache_shutdown+0x1fc/0x39c kmem_cache_destroy+0x48/0x138 ubi_init+0x1d4/0xf34 [ubi] do_one_initcall+0xb4/0x24c do_init_module+0x4c/0x1dc load_module+0x212c/0x2260 __se_sys_finit_module+0xb4/0xd8 __arm64_sys_finit_module+0x18/0x28 el0_svc_common.constprop.0+0x78/0x1a0 do_el0_svc+0x78/0x90 el0_svc+0x20/0x38 el0_sync_handler+0xf0/0x140 normal+0x3d8/0x400 Object 0xffff2fd1d0334e68 @offset=3688 Allocated in ubi_scan_fastmap+0xf04/0xf40 [ubi] age=80 cpu=0 pid=5884 __slab_alloc.isra.21+0x6c/0xb4 kmem_cache_alloc+0x1e4/0x80c ubi_scan_fastmap+0xf04/0xf40 [ubi] ubi_attach+0x1f0/0x3a8 [ubi] ubi_attach_mtd_dev+0x810/0xbc8 [ubi] ubi_init+0x238/0xf34 [ubi] do_one_initcall+0xb4/0x24c do_init_module+0x4c/0x1dc load_module+0x212c/0x2260 __se_sys_finit_module+0xb4/0xd8 __arm64_sys_finit_module+0x18/0x28 el0_svc_common.constprop.0+0x78/0x1a0 do_el0_svc+0x78/0x90 el0_svc+0x20/0x38 el0_sync_handler+0xf0/0x140 normal+0x3d8/0x400 Link: https://bugzilla.kernel.org/show_bug.cgi?id=220744 Signed-off-by: Liyuan Pang <[email protected]> Reviewed-by: Zhihao Cheng <[email protected]> Signed-off-by: Richard Weinberger <[email protected]>
As Jiaming Zhang and syzbot reported, there is potential deadlock in
f2fs as below:
Chain exists of:
&sbi->cp_rwsem --> fs_reclaim --> sb_internal#2
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
rlock(sb_internal#2);
lock(fs_reclaim);
lock(sb_internal#2);
rlock(&sbi->cp_rwsem);
*** DEADLOCK ***
3 locks held by kswapd0/73:
#0: ffffffff8e247a40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:7015 [inline]
#0: ffffffff8e247a40 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0x951/0x2800 mm/vmscan.c:7389
#1: ffff8880118400e0 (&type->s_umount_key#50){.+.+}-{4:4}, at: super_trylock_shared fs/super.c:562 [inline]
#1: ffff8880118400e0 (&type->s_umount_key#50){.+.+}-{4:4}, at: super_cache_scan+0x91/0x4b0 fs/super.c:197
#2: ffff888011840610 (sb_internal#2){.+.+}-{0:0}, at: f2fs_evict_inode+0x8d9/0x1b60 fs/f2fs/inode.c:890
stack backtrace:
CPU: 0 UID: 0 PID: 73 Comm: kswapd0 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_circular_bug+0x2ee/0x310 kernel/locking/lockdep.c:2043
check_noncircular+0x134/0x160 kernel/locking/lockdep.c:2175
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain+0xb9b/0x2140 kernel/locking/lockdep.c:3908
__lock_acquire+0xab9/0xd20 kernel/locking/lockdep.c:5237
lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5868
down_read+0x46/0x2e0 kernel/locking/rwsem.c:1537
f2fs_down_read fs/f2fs/f2fs.h:2278 [inline]
f2fs_lock_op fs/f2fs/f2fs.h:2357 [inline]
f2fs_do_truncate_blocks+0x21c/0x10c0 fs/f2fs/file.c:791
f2fs_truncate_blocks+0x10a/0x300 fs/f2fs/file.c:867
f2fs_truncate+0x489/0x7c0 fs/f2fs/file.c:925
f2fs_evict_inode+0x9f2/0x1b60 fs/f2fs/inode.c:897
evict+0x504/0x9c0 fs/inode.c:810
f2fs_evict_inode+0x1dc/0x1b60 fs/f2fs/inode.c:853
evict+0x504/0x9c0 fs/inode.c:810
dispose_list fs/inode.c:852 [inline]
prune_icache_sb+0x21b/0x2c0 fs/inode.c:1000
super_cache_scan+0x39b/0x4b0 fs/super.c:224
do_shrink_slab+0x6ef/0x1110 mm/shrinker.c:437
shrink_slab_memcg mm/shrinker.c:550 [inline]
shrink_slab+0x7ef/0x10d0 mm/shrinker.c:628
shrink_one+0x28a/0x7c0 mm/vmscan.c:4955
shrink_many mm/vmscan.c:5016 [inline]
lru_gen_shrink_node mm/vmscan.c:5094 [inline]
shrink_node+0x315d/0x3780 mm/vmscan.c:6081
kswapd_shrink_node mm/vmscan.c:6941 [inline]
balance_pgdat mm/vmscan.c:7124 [inline]
kswapd+0x147c/0x2800 mm/vmscan.c:7389
kthread+0x70e/0x8a0 kernel/kthread.c:463
ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
The root cause is deadlock among four locks as below:
kswapd
- fs_reclaim --- Lock A
- shrink_one
- evict
- f2fs_evict_inode
- sb_start_intwrite --- Lock B
- iput
- evict
- f2fs_evict_inode
- sb_start_intwrite --- Lock B
- f2fs_truncate
- f2fs_truncate_blocks
- f2fs_do_truncate_blocks
- f2fs_lock_op --- Lock C
ioctl
- f2fs_ioc_commit_atomic_write
- f2fs_lock_op --- Lock C
- __f2fs_commit_atomic_write
- __replace_atomic_write_block
- f2fs_get_dnode_of_data
- __get_node_folio
- f2fs_check_nid_range
- f2fs_handle_error
- f2fs_record_errors
- f2fs_down_write --- Lock D
open
- do_open
- do_truncate
- security_inode_need_killpriv
- f2fs_getxattr
- lookup_all_xattrs
- f2fs_handle_error
- f2fs_record_errors
- f2fs_down_write --- Lock D
- f2fs_commit_super
- read_mapping_folio
- filemap_alloc_folio_noprof
- prepare_alloc_pages
- fs_reclaim_acquire --- Lock A
In order to avoid such deadlock, we need to avoid grabbing sb_lock in
f2fs_handle_error(), so, let's use asynchronous method instead:
- remove f2fs_handle_error() implementation
- rename f2fs_handle_error_async() to f2fs_handle_error()
- spread f2fs_handle_error()
Fixes: 95fa90c ("f2fs: support recording errors into superblock")
Cc: [email protected]
Reported-by: [email protected]
Closes: https://lore.kernel.org/linux-f2fs-devel/[email protected]
Reported-by: Jiaming Zhang <[email protected]>
Closes: https://lore.kernel.org/lkml/CANypQFa-Gy9sD-N35o3PC+FystOWkNuN8pv6S75HLT0ga-Tzgw@mail.gmail.com
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
As syzbot reported: F2FS-fs (loop0): __update_extent_tree_range: extent len is zero, type: 0, extent [0, 0, 0], age [0, 0] ------------[ cut here ]------------ kernel BUG at fs/f2fs/extent_cache.c:678! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 0 UID: 0 PID: 5336 Comm: syz.0.0 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:__update_extent_tree_range+0x13bc/0x1500 fs/f2fs/extent_cache.c:678 Call Trace: <TASK> f2fs_update_read_extent_cache_range+0x192/0x3e0 fs/f2fs/extent_cache.c:1085 f2fs_do_zero_range fs/f2fs/file.c:1657 [inline] f2fs_zero_range+0x10c1/0x1580 fs/f2fs/file.c:1737 f2fs_fallocate+0x583/0x990 fs/f2fs/file.c:2030 vfs_fallocate+0x669/0x7e0 fs/open.c:342 ioctl_preallocate fs/ioctl.c:289 [inline] file_ioctl+0x611/0x780 fs/ioctl.c:-1 do_vfs_ioctl+0xb33/0x1430 fs/ioctl.c:576 __do_sys_ioctl fs/ioctl.c:595 [inline] __se_sys_ioctl+0x82/0x170 fs/ioctl.c:583 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f07bc58eec9 In error path of f2fs_zero_range(), it may add a zero-sized extent into extent cache, it should be avoided. Fixes: 6e96194 ("f2fs: support in batch fzero in dnode page") Cc: [email protected] Reported-by: [email protected] Closes: https://lore.kernel.org/linux-f2fs-devel/[email protected] Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
Bai, Shuangpeng <[email protected]> reported a bug as below: Oops: divide error: 0000 [#1] SMP KASAN PTI CPU: 0 UID: 0 PID: 11441 Comm: syz.0.46 Not tainted 6.17.0 #1 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:f2fs_all_cluster_page_ready+0x106/0x550 fs/f2fs/compress.c:857 Call Trace: <TASK> f2fs_write_cache_pages fs/f2fs/data.c:3078 [inline] __f2fs_write_data_pages fs/f2fs/data.c:3290 [inline] f2fs_write_data_pages+0x1c19/0x3600 fs/f2fs/data.c:3317 do_writepages+0x38e/0x640 mm/page-writeback.c:2634 filemap_fdatawrite_wbc mm/filemap.c:386 [inline] __filemap_fdatawrite_range mm/filemap.c:419 [inline] file_write_and_wait_range+0x2ba/0x3e0 mm/filemap.c:794 f2fs_do_sync_file+0x6e6/0x1b00 fs/f2fs/file.c:294 generic_write_sync include/linux/fs.h:3043 [inline] f2fs_file_write_iter+0x76e/0x2700 fs/f2fs/file.c:5259 new_sync_write fs/read_write.c:593 [inline] vfs_write+0x7e9/0xe00 fs/read_write.c:686 ksys_write+0x19d/0x2d0 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xf7/0x470 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f The bug was triggered w/ below race condition: fsync setattr ioctl - f2fs_do_sync_file - file_write_and_wait_range - f2fs_write_cache_pages : inode is non-compressed : cc.cluster_size = F2FS_I(inode)->i_cluster_size = 0 - tag_pages_for_writeback - f2fs_setattr - truncate_setsize - f2fs_truncate - f2fs_fileattr_set - f2fs_setflags_common - set_compress_context : F2FS_I(inode)->i_cluster_size = 4 : set_inode_flag(inode, FI_COMPRESSED_FILE) - f2fs_compressed_file : return true - f2fs_all_cluster_page_ready : "pgidx % cc->cluster_size" trigger dividing 0 issue Let's change as below to fix this issue: - introduce a new atomic type variable .writeback in structure f2fs_inode_info to track the number of threads which calling f2fs_write_cache_pages(). - use .i_sem lock to protect .writeback update. - check .writeback before update compression context in f2fs_setflags_common() to avoid race w/ ->writepages. Fixes: 4c8ff70 ("f2fs: support data compression") Cc: [email protected] Reported-by: Bai, Shuangpeng <[email protected]> Tested-by: Bai, Shuangpeng <[email protected]> Closes: https://lore.kernel.org/lkml/[email protected] Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
As Hong Yun reported in mailing list: loop7: detected capacity change from 0 to 131072 ------------[ cut here ]------------ kmem_cache of name 'f2fs_xattr_entry-7:7' already exists WARNING: CPU: 0 PID: 24426 at mm/slab_common.c:110 kmem_cache_sanity_check mm/slab_common.c:109 [inline] WARNING: CPU: 0 PID: 24426 at mm/slab_common.c:110 __kmem_cache_create_args+0xa6/0x320 mm/slab_common.c:307 CPU: 0 UID: 0 PID: 24426 Comm: syz.7.1370 Not tainted 6.17.0-rc4 #1 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:kmem_cache_sanity_check mm/slab_common.c:109 [inline] RIP: 0010:__kmem_cache_create_args+0xa6/0x320 mm/slab_common.c:307 Call Trace: __kmem_cache_create include/linux/slab.h:353 [inline] f2fs_kmem_cache_create fs/f2fs/f2fs.h:2943 [inline] f2fs_init_xattr_caches+0xa5/0xe0 fs/f2fs/xattr.c:843 f2fs_fill_super+0x1645/0x2620 fs/f2fs/super.c:4918 get_tree_bdev_flags+0x1fb/0x260 fs/super.c:1692 vfs_get_tree+0x43/0x140 fs/super.c:1815 do_new_mount+0x201/0x550 fs/namespace.c:3808 do_mount fs/namespace.c:4136 [inline] __do_sys_mount fs/namespace.c:4347 [inline] __se_sys_mount+0x298/0x2f0 fs/namespace.c:4324 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x8e/0x3a0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x76/0x7e The bug can be reproduced w/ below scripts: - mount /dev/vdb /mnt1 - mount /dev/vdc /mnt2 - umount /mnt1 - mounnt /dev/vdb /mnt1 The reason is if we created two slab caches, named f2fs_xattr_entry-7:3 and f2fs_xattr_entry-7:7, and they have the same slab size. Actually, slab system will only create one slab cache core structure which has slab name of "f2fs_xattr_entry-7:3", and two slab caches share the same structure and cache address. So, if we destroy f2fs_xattr_entry-7:3 cache w/ cache address, it will decrease reference count of slab cache, rather than release slab cache entirely, since there is one more user has referenced the cache. Then, if we try to create slab cache w/ name "f2fs_xattr_entry-7:3" again, slab system will find that there is existed cache which has the same name and trigger the warning. Let's changes to use global inline_xattr_slab instead of per-sb slab cache for fixing. Fixes: a999150 ("f2fs: use kmem_cache pool during inline xattr lookups") Cc: [email protected] Reported-by: Hong Yun <[email protected]> Tested-by: Hong Yun <[email protected]> Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
Xfstests generic/335, generic/336 sometimes crash with the following message: F2FS-fs (dm-0): detect filesystem reference count leak during umount, type: 9, count: 1 ------------[ cut here ]------------ kernel BUG at fs/f2fs/super.c:1939! Oops: invalid opcode: 0000 [#1] SMP NOPTI CPU: 1 UID: 0 PID: 609351 Comm: umount Tainted: G W 6.17.0-rc5-xfstests-g9dd1835ecda5 #1 PREEMPT(none) Tainted: [W]=WARN Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:f2fs_put_super+0x3b3/0x3c0 Call Trace: <TASK> generic_shutdown_super+0x7e/0x190 kill_block_super+0x1a/0x40 kill_f2fs_super+0x9d/0x190 deactivate_locked_super+0x30/0xb0 cleanup_mnt+0xba/0x150 task_work_run+0x5c/0xa0 exit_to_user_mode_loop+0xb7/0xc0 do_syscall_64+0x1ae/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e </TASK> ---[ end trace 0000000000000000 ]--- It appears that sometimes it is possible that f2fs_put_super() is called before all node page reads are completed. Adding a call to f2fs_wait_on_all_pages() for F2FS_RD_NODE fixes the problem. Cc: [email protected] Fixes: 2087258 ("f2fs: fix to drop all dirty meta/node pages during umount()") Signed-off-by: Jan Prusakowski <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
With below scripts, it will trigger panic in f2fs: mkfs.f2fs -f /dev/vdd mount /dev/vdd /mnt/f2fs touch /mnt/f2fs/foo sync echo 111 >> /mnt/f2fs/foo f2fs_io fsync /mnt/f2fs/foo f2fs_io shutdown 2 /mnt/f2fs umount /mnt/f2fs mount -o ro,norecovery /dev/vdd /mnt/f2fs or mount -o ro,disable_roll_forward /dev/vdd /mnt/f2fs F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 0 F2FS-fs (vdd): Mounted with checkpoint version = 7f5c361f F2FS-fs (vdd): Stopped filesystem due to reason: 0 F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 1 Filesystem f2fs get_tree() didn't set fc->root, returned 1 ------------[ cut here ]------------ kernel BUG at fs/super.c:1761! Oops: invalid opcode: 0000 [#1] SMP PTI CPU: 3 UID: 0 PID: 722 Comm: mount Not tainted 6.18.0-rc2+ #721 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:vfs_get_tree.cold+0x18/0x1a Call Trace: <TASK> fc_mount+0x13/0xa0 path_mount+0x34e/0xc50 __x64_sys_mount+0x121/0x150 do_syscall_64+0x84/0x800 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fa6cc126cfe The root cause is we missed to handle error number returned from f2fs_recover_fsync_data() when mounting image w/ ro,norecovery or ro,disable_roll_forward mount option, result in returning a positive error number to vfs_get_tree(), fix it. Cc: [email protected] Fixes: 6781eab ("f2fs: give -EINVAL for norecovery and rw mount") Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
… transaction We can't log a conflicting inode if it's a directory and it was moved from one parent directory to another parent directory in the current transaction, as this can result an attempt to have a directory with two hard links during log replay, one for the old parent directory and another for the new parent directory. The following scenario triggers that issue: 1) We have directories "dir1" and "dir2" created in a past transaction. Directory "dir1" has inode A as its parent directory; 2) We move "dir1" to some other directory; 3) We create a file with the name "dir1" in directory inode A; 4) We fsync the new file. This results in logging the inode of the new file and the inode for the directory "dir1" that was previously moved in the current transaction. So the log tree has the INODE_REF item for the new location of "dir1"; 5) We move the new file to some other directory. This results in updating the log tree to included the new INODE_REF for the new location of the file and removes the INODE_REF for the old location. This happens during the rename when we call btrfs_log_new_name(); 6) We fsync the file, and that persists the log tree changes done in the previous step (btrfs_log_new_name() only updates the log tree in memory); 7) We have a power failure; 8) Next time the fs is mounted, log replay happens and when processing the inode for directory "dir1" we find a new INODE_REF and add that link, but we don't remove the old link of the inode since we have not logged the old parent directory of the directory inode "dir1". As a result after log replay finishes when we trigger writeback of the subvolume tree's extent buffers, the tree check will detect that we have a directory a hard link count of 2 and we get a mount failure. The errors and stack traces reported in dmesg/syslog are like this: [ 3845.729764] BTRFS info (device dm-0): start tree-log replay [ 3845.730304] page: refcount:3 mapcount:0 mapping:000000005c8a3027 index:0x1d00 pfn:0x11510c [ 3845.731236] memcg:ffff9264c02f4e00 [ 3845.731751] aops:btree_aops [btrfs] ino:1 [ 3845.732300] flags: 0x17fffc00000400a(uptodate|private|writeback|node=0|zone=2|lastcpupid=0x1ffff) [ 3845.733346] raw: 017fffc00000400a 0000000000000000 dead000000000122 ffff9264d978aea8 [ 3845.734265] raw: 0000000000001d00 ffff92650e6d4738 00000003ffffffff ffff9264c02f4e00 [ 3845.735305] page dumped because: eb page dump [ 3845.735981] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=6 ino=257, invalid nlink: has 2 expect no more than 1 for dir [ 3845.737786] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14881 owner 5 [ 3845.737789] BTRFS info (device dm-0): refs 4 lock_owner 0 current 30701 [ 3845.737792] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 [ 3845.737794] inode generation 3 transid 9 size 16 nbytes 16384 [ 3845.737795] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737797] rdev 0 sequence 2 flags 0x0 [ 3845.737798] atime 1764259517.0 [ 3845.737800] ctime 1764259517.572889464 [ 3845.737801] mtime 1764259517.572889464 [ 3845.737802] otime 1764259517.0 [ 3845.737803] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 [ 3845.737805] index 0 name_len 2 [ 3845.737807] item 2 key (256 DIR_ITEM 2363071922) itemoff 16077 itemsize 34 [ 3845.737808] location key (257 1 0) type 2 [ 3845.737810] transid 9 data_len 0 name_len 4 [ 3845.737811] item 3 key (256 DIR_ITEM 2676584006) itemoff 16043 itemsize 34 [ 3845.737813] location key (258 1 0) type 2 [ 3845.737814] transid 9 data_len 0 name_len 4 [ 3845.737815] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34 [ 3845.737816] location key (257 1 0) type 2 [ 3845.737818] transid 9 data_len 0 name_len 4 [ 3845.737819] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34 [ 3845.737820] location key (258 1 0) type 2 [ 3845.737821] transid 9 data_len 0 name_len 4 [ 3845.737822] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160 [ 3845.737824] inode generation 9 transid 10 size 6 nbytes 0 [ 3845.737825] block group 0 mode 40755 links 2 uid 0 gid 0 [ 3845.737826] rdev 0 sequence 1 flags 0x0 [ 3845.737827] atime 1764259517.572889464 [ 3845.737828] ctime 1764259517.572889464 [ 3845.737830] mtime 1764259517.572889464 [ 3845.737831] otime 1764259517.572889464 [ 3845.737832] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14 [ 3845.737833] index 2 name_len 4 [ 3845.737834] item 8 key (257 INODE_REF 258) itemoff 15787 itemsize 14 [ 3845.737836] index 2 name_len 4 [ 3845.737837] item 9 key (257 DIR_ITEM 2507850652) itemoff 15754 itemsize 33 [ 3845.737838] location key (259 1 0) type 1 [ 3845.737839] transid 10 data_len 0 name_len 3 [ 3845.737840] item 10 key (257 DIR_INDEX 2) itemoff 15721 itemsize 33 [ 3845.737842] location key (259 1 0) type 1 [ 3845.737843] transid 10 data_len 0 name_len 3 [ 3845.737844] item 11 key (258 INODE_ITEM 0) itemoff 15561 itemsize 160 [ 3845.737846] inode generation 9 transid 10 size 8 nbytes 0 [ 3845.737847] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737848] rdev 0 sequence 1 flags 0x0 [ 3845.737849] atime 1764259517.572889464 [ 3845.737850] ctime 1764259517.572889464 [ 3845.737851] mtime 1764259517.572889464 [ 3845.737852] otime 1764259517.572889464 [ 3845.737853] item 12 key (258 INODE_REF 256) itemoff 15547 itemsize 14 [ 3845.737855] index 3 name_len 4 [ 3845.737856] item 13 key (258 DIR_ITEM 1843588421) itemoff 15513 itemsize 34 [ 3845.737857] location key (257 1 0) type 2 [ 3845.737858] transid 10 data_len 0 name_len 4 [ 3845.737860] item 14 key (258 DIR_INDEX 2) itemoff 15479 itemsize 34 [ 3845.737861] location key (257 1 0) type 2 [ 3845.737862] transid 10 data_len 0 name_len 4 [ 3845.737863] item 15 key (259 INODE_ITEM 0) itemoff 15319 itemsize 160 [ 3845.737865] inode generation 10 transid 10 size 0 nbytes 0 [ 3845.737866] block group 0 mode 100600 links 1 uid 0 gid 0 [ 3845.737867] rdev 0 sequence 2 flags 0x0 [ 3845.737868] atime 1764259517.580874966 [ 3845.737869] ctime 1764259517.586121869 [ 3845.737870] mtime 1764259517.580874966 [ 3845.737872] otime 1764259517.580874966 [ 3845.737873] item 16 key (259 INODE_REF 257) itemoff 15306 itemsize 13 [ 3845.737874] index 2 name_len 3 [ 3845.737875] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected [ 3845.739448] ------------[ cut here ]------------ [ 3845.740092] WARNING: CPU: 5 PID: 30701 at fs/btrfs/disk-io.c:335 btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.741439] Modules linked in: btrfs dm_flakey crc32c_cryptoapi (...) [ 3845.750626] CPU: 5 UID: 0 PID: 30701 Comm: mount Tainted: G W 6.18.0-rc6-btrfs-next-218+ #1 PREEMPT(full) [ 3845.752414] Tainted: [W]=WARN [ 3845.752828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 3845.754499] RIP: 0010:btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.755460] Code: 31 f6 48 89 (...) [ 3845.758685] RSP: 0018:ffffa8d9c5677678 EFLAGS: 00010246 [ 3845.759450] RAX: 0000000000000000 RBX: ffff92650e6d4738 RCX: 0000000000000000 [ 3845.760309] RDX: 0000000000000000 RSI: ffffffff9aab45b9 RDI: ffff9264c4748000 [ 3845.761239] RBP: ffff9264d4324000 R08: 0000000000000000 R09: ffffa8d9c5677468 [ 3845.762607] R10: ffff926bdc1fffa8 R11: 0000000000000003 R12: ffffa8d9c5677680 [ 3845.764099] R13: 0000000000004000 R14: ffff9264dd624000 R15: ffff9264d978aba8 [ 3845.765094] FS: 00007f751fa5a840(0000) GS:ffff926c42a82000(0000) knlGS:0000000000000000 [ 3845.766226] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3845.766970] CR2: 0000558df1815380 CR3: 000000010ed88003 CR4: 0000000000370ef0 [ 3845.768009] Call Trace: [ 3845.768392] <TASK> [ 3845.768714] btrfs_submit_bbio+0x6ee/0x7f0 [btrfs] [ 3845.769640] ? write_one_eb+0x28e/0x340 [btrfs] [ 3845.770588] btree_write_cache_pages+0x2f0/0x550 [btrfs] [ 3845.771286] ? alloc_extent_state+0x19/0x100 [btrfs] [ 3845.771967] ? merge_next_state+0x1a/0x90 [btrfs] [ 3845.772586] ? set_extent_bit+0x233/0x8b0 [btrfs] [ 3845.773198] ? xas_load+0x9/0xc0 [ 3845.773589] ? xas_find+0x14d/0x1a0 [ 3845.773969] do_writepages+0xc6/0x160 [ 3845.774367] filemap_fdatawrite_wbc+0x48/0x60 [ 3845.775003] __filemap_fdatawrite_range+0x5b/0x80 [ 3845.775902] btrfs_write_marked_extents+0x61/0x170 [btrfs] [ 3845.776707] btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs] [ 3845.777379] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 3845.777923] btrfs_commit_transaction+0x5ea/0xd20 [btrfs] [ 3845.778551] ? _raw_spin_unlock+0x15/0x30 [ 3845.778986] ? release_extent_buffer+0x34/0x160 [btrfs] [ 3845.779659] btrfs_recover_log_trees+0x7a3/0x7c0 [btrfs] [ 3845.780416] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs] [ 3845.781499] open_ctree+0x10bb/0x15f0 [btrfs] [ 3845.782194] btrfs_get_tree.cold+0xb/0x16c [btrfs] [ 3845.782764] ? fscontext_read+0x15c/0x180 [ 3845.783202] ? rw_verify_area+0x50/0x180 [ 3845.783667] vfs_get_tree+0x25/0xd0 [ 3845.784047] vfs_cmd_create+0x59/0xe0 [ 3845.784458] __do_sys_fsconfig+0x4f6/0x6b0 [ 3845.784914] do_syscall_64+0x50/0x1220 [ 3845.785340] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 3845.785980] RIP: 0033:0x7f751fc7f4aa [ 3845.786759] Code: 73 01 c3 48 (...) [ 3845.789951] RSP: 002b:00007ffcdba45dc8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af [ 3845.791402] RAX: ffffffffffffffda RBX: 000055ccc8291c20 RCX: 00007f751fc7f4aa [ 3845.792688] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 [ 3845.794308] RBP: 000055ccc8292120 R08: 0000000000000000 R09: 0000000000000000 [ 3845.795829] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 3845.797183] R13: 00007f751fe11580 R14: 00007f751fe1326c R15: 00007f751fdf8a23 [ 3845.798633] </TASK> [ 3845.799067] ---[ end trace 0000000000000000 ]--- [ 3845.800215] BTRFS: error (device dm-0) in btrfs_commit_transaction:2553: errno=-5 IO failure (Error while writing out transaction) [ 3845.801860] BTRFS warning (device dm-0 state E): Skipping commit of aborted transaction. [ 3845.802815] BTRFS error (device dm-0 state EA): Transaction aborted (error -5) [ 3845.803728] BTRFS: error (device dm-0 state EA) in cleanup_transaction:2036: errno=-5 IO failure [ 3845.805374] BTRFS: error (device dm-0 state EA) in btrfs_replay_log:2083: errno=-5 IO failure (Failed to recover log tree) [ 3845.807919] BTRFS error (device dm-0 state EA): open_ctree failed: -5 Fix this by never logging a conflicting inode that is a directory and was moved in the current transaction (its last_unlink_trans equals the current transaction) and instead fallback to a transaction commit. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ CC: [email protected] # 6.1+ Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
kbd_led_set() can sleep, and so may not be used as the brightness_set() callback. Otherwise using this led with a trigger leads to system hangs accompanied by: BUG: scheduling while atomic: acpi_fakekeyd/2588/0x00000003 CPU: 4 UID: 0 PID: 2588 Comm: acpi_fakekeyd Not tainted 6.17.9+deb14-amd64 #1 PREEMPT(lazy) Debian 6.17.9-1 Hardware name: ASUSTeK COMPUTER INC. ASUS EXPERTBOOK B9403CVAR/B9403CVAR, BIOS B9403CVAR.311 12/24/2024 Call Trace: <TASK> [...] schedule_timeout+0xbd/0x100 __down_common+0x175/0x290 down_timeout+0x67/0x70 acpi_os_wait_semaphore+0x57/0x90 [...] asus_wmi_evaluate_method3+0x87/0x190 [asus_wmi] led_trigger_event+0x3f/0x60 [...] Fixes: 9fe44fc ("platform/x86: asus-wmi: Simplify the keyboard brightness updating process") Signed-off-by: Anton Khirnov <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Reviewed-by: Denis Benato <[email protected]> Link: https://patch.msgid.link/[email protected] Reviewed-by: Ilpo Järvinen <[email protected]> Signed-off-by: Ilpo Järvinen <[email protected]>
At the moment - the memory allocation for fwsec-sb is created as-needed and is released after being used. Typically this is at some point well after driver load, which can cause runtime suspend/resume to initially work on driver load but then later fail on a machine that has been running for long enough with sufficiently high enough memory pressure: kworker/7:1: page allocation failure: order:5, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 7 UID: 0 PID: 875159 Comm: kworker/7:1 Not tainted 6.17.8-300.fc43.x86_64 #1 PREEMPT(lazy) Hardware name: SLIMBOOK Executive/Executive, BIOS N.1.10GRU06 02/02/2024 Workqueue: pm pm_runtime_work Call Trace: <TASK> dump_stack_lvl+0x5d/0x80 warn_alloc+0x163/0x190 ? __alloc_pages_direct_compact+0x1b3/0x220 __alloc_pages_slowpath.constprop.0+0x57a/0xb10 __alloc_frozen_pages_noprof+0x334/0x350 __alloc_pages_noprof+0xe/0x20 __dma_direct_alloc_pages.isra.0+0x1eb/0x330 dma_direct_alloc_pages+0x3c/0x190 dma_alloc_pages+0x29/0x130 nvkm_firmware_ctor+0x1ae/0x280 [nouveau] nvkm_falcon_fw_ctor+0x3e/0x60 [nouveau] nvkm_gsp_fwsec+0x10e/0x2c0 [nouveau] ? sysvec_apic_timer_interrupt+0xe/0x90 nvkm_gsp_fwsec_sb+0x27/0x70 [nouveau] tu102_gsp_fini+0x65/0x110 [nouveau] ? ktime_get+0x3c/0xf0 nvkm_subdev_fini+0x67/0xc0 [nouveau] nvkm_device_fini+0x94/0x140 [nouveau] nvkm_udevice_fini+0x50/0x70 [nouveau] nvkm_object_fini+0xb1/0x140 [nouveau] nvkm_object_fini+0x70/0x140 [nouveau] ? __pfx_pci_pm_runtime_suspend+0x10/0x10 nouveau_do_suspend+0xe4/0x170 [nouveau] nouveau_pmops_runtime_suspend+0x3e/0xb0 [nouveau] pci_pm_runtime_suspend+0x67/0x1a0 ? __pfx_pci_pm_runtime_suspend+0x10/0x10 __rpm_callback+0x45/0x1f0 ? __pfx_pci_pm_runtime_suspend+0x10/0x10 rpm_callback+0x6d/0x80 rpm_suspend+0xe5/0x5e0 ? finish_task_switch.isra.0+0x99/0x2c0 pm_runtime_work+0x98/0xb0 process_one_work+0x18f/0x350 worker_thread+0x25a/0x3a0 ? __pfx_worker_thread+0x10/0x10 kthread+0xf9/0x240 ? __pfx_kthread+0x10/0x10 ? __pfx_kthread+0x10/0x10 ret_from_fork+0xf1/0x110 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> The reason this happens is because the fwsec-sb firmware image only supports being booted from a contiguous coherent sysmem allocation. If a system runs into enough memory fragmentation from memory pressure, such as what can happen on systems with low amounts of memory, this can lead to a situation where it later becomes impossible to find space for a large enough contiguous allocation to hold fwsec-sb. This causes us to fail to boot the firmware image, causing the GPU to fail booting and causing the driver to fail. Since this firmware can't use non-contiguous allocations, the best solution to avoid this issue is to simply allocate the memory for fwsec-sb during initial driver-load, and reuse the memory allocation when fwsec-sb needs to be used. We then release the memory allocations on driver unload. Signed-off-by: Lyude Paul <[email protected]> Fixes: 594766c ("drm/nouveau/gsp: move booter handling to GPU-specific code") Cc: <[email protected]> # v6.16+ Reviewed-by: Timur Tabi <[email protected]> Link: https://patch.msgid.link/[email protected]
Since commit a735831 ("drm/nouveau: vendor in drm_encoder_slave API") nouveau appears to be broken for all dispnv04 GPUs (before NV50). Depending on the kernel version, either having no display output and hanging in kernel for a long time, or even oopsing in the cleanup path like: Hardware name: PowerMac11,2 PPC970MP 0x440101 PowerMac ... nouveau 0000:0a:00.0: drm: 0x14C5: Parsing digital output script table BUG: Unable to handle kernel data access on read at 0x00041520 Faulting instruction address: 0xc0003d0001be0844 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=8 NUMA PowerMac Modules linked in: windfarm_cpufreq_clamp windfarm_smu_sensors windfarm_smu_controls windfarm_pm112 snd_aoa_codec_onyx snd_aoa_fabric_layout snd_aoa windfarm_pid jo apple_mfi_fastcharge rndis_host cdc_ether usbnet mii snd_aoa_i2sbus snd_aoa_soundbus snd_pcm snd_timer snd soundcore rack_meter windfarm_smu_sat windfarm_max6690_s m75_sensor windfarm_core gpu_sched drm_gpuvm drm_exec drm_client_lib drm_ttm_helper ttm drm_display_helper drm_kms_helper drm drm_panel_orientation_quirks syscopyar _sys_fops i2c_algo_bit backlight uio_pdrv_genirq uio uninorth_agp agpgart zram dm_mod dax ipv6 nfsv4 dns_resolver nfs lockd grace sunrpc offb cfbfillrect cfbimgblt ont input_leds sr_mod cdrom sd_mod uas ata_generic hid_apple hid_generic usbhid hid usb_storage pata_macio sata_svw libata firewire_ohci scsi_mod firewire_core ohci ehci_pci ehci_hcd tg3 ohci_hcd libphy usbcore usb_common nls_base led_class CPU: 0 UID: 0 PID: 245 Comm: (udev-worker) Not tainted 6.14.0-09584-g7d06015d936c #7 PREEMPTLAZY Hardware name: PowerMac11,2 PPC970MP 0x440101 PowerMac NIP: c0003d0001be0844 LR: c0003d0001be0830 CTR: 0000000000000000 REGS: c0000000053f70e0 TRAP: 0300 Not tainted (6.14.0-09584-g7d06015d936c) MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI> CR: 24222220 XER: 00000000 DAR: 0000000000041520 DSISR: 40000000 IRQMASK: 0 \x0aGPR00: c0003d0001be0830 c0000000053f7380 c0003d0000911900 c000000007bc6800 \x0aGPR04: 0000000000000000 0000000000000000 c000000007bc6e70 0000000000000001 \x0aGPR08: 01f3040000000000 0000000000041520 0000000000000000 c0003d0000813958 \x0aGPR12: c000000000071a48 c000000000e28000 0000000000000020 0000000000000000 \x0aGPR16: 0000000000000000 0000000000f52630 0000000000000000 0000000000000000 \x0aGPR20: 0000000000000000 0000000000000000 0000000000000001 c0003d0000928528 \x0aGPR24: c0003d0000928598 0000000000000000 c000000007025480 c000000007025480 \x0aGPR28: c0000000010b4000 0000000000000000 c000000007bc1800 c000000007bc6800 NIP [c0003d0001be0844] nv_crtc_destroy+0x44/0xd4 [nouveau] LR [c0003d0001be0830] nv_crtc_destroy+0x30/0xd4 [nouveau] Call Trace: [c0000000053f7380] [c0003d0001be0830] nv_crtc_destroy+0x30/0xd4 [nouveau] (unreliable) [c0000000053f73c0] [c0003d00007f7bf4] drm_mode_config_cleanup+0x27c/0x30c [drm] [c0000000053f7490] [c0003d0001bdea50] nouveau_display_create+0x1cc/0x550 [nouveau] [c0000000053f7500] [c0003d0001bcc29c] nouveau_drm_device_init+0x1c8/0x844 [nouveau] [c0000000053f75e0] [c0003d0001bcc9ec] nouveau_drm_probe+0xd4/0x1e0 [nouveau] [c0000000053f7670] [c000000000557d24] local_pci_probe+0x50/0xa8 [c0000000053f76f0] [c000000000557fa8] pci_device_probe+0x22c/0x240 [c0000000053f7760] [c0000000005fff3c] really_probe+0x188/0x31c [c0000000053f77e0] [c000000000600204] __driver_probe_device+0x134/0x13c [c0000000053f7860] [c0000000006002c0] driver_probe_device+0x3c/0xb4 [c0000000053f78a0] [c000000000600534] __driver_attach+0x118/0x128 [c0000000053f78e0] [c0000000005fe038] bus_for_each_dev+0xa8/0xf4 [c0000000053f7950] [c0000000005ff460] driver_attach+0x2c/0x40 [c0000000053f7970] [c0000000005fea68] bus_add_driver+0x130/0x278 [c0000000053f7a00] [c00000000060117c] driver_register+0x9c/0x1a0 [c0000000053f7a80] [c00000000055623c] __pci_register_driver+0x5c/0x70 [c0000000053f7aa0] [c0003d0001c058a0] nouveau_drm_init+0x254/0x278 [nouveau] [c0000000053f7b10] [c00000000000e9bc] do_one_initcall+0x84/0x268 [c0000000053f7bf0] [c0000000001a0ba0] do_init_module+0x70/0x2d8 [c0000000053f7c70] [c0000000001a42bc] init_module_from_file+0xb4/0x108 [c0000000053f7d50] [c0000000001a4504] sys_finit_module+0x1ac/0x478 [c0000000053f7e10] [c000000000023230] system_call_exception+0x1a4/0x20c [c0000000053f7e50] [c00000000000c554] system_call_common+0xf4/0x258 --- interrupt: c00 at 0xfd5f988 NIP: 000000000fd5f988 LR: 000000000ff9b148 CTR: 0000000000000000 REGS: c0000000053f7e80 TRAP: 0c00 Not tainted (6.14.0-09584-g7d06015d936c) MSR: 100000000000d032 <HV,EE,PR,ME,IR,DR,RI> CR: 28222244 XER: 00000000 IRQMASK: 0 \x0aGPR00: 0000000000000161 00000000ffcdc2d0 00000000405db160 0000000000000020 \x0aGPR04: 000000000ffa2c9c 0000000000000000 000000000000001f 0000000000000045 \x0aGPR08: 0000000011a13770 0000000000000000 0000000000000000 0000000000000000 \x0aGPR12: 0000000000000000 0000000010249d8c 0000000000000020 0000000000000000 \x0aGPR16: 0000000000000000 0000000000f52630 0000000000000000 0000000000000000 \x0aGPR20: 0000000000000000 0000000000000000 0000000000000000 0000000011a11a70 \x0aGPR24: 0000000011a13580 0000000011a11950 0000000011a11a70 0000000000020000 \x0aGPR28: 000000000ffa2c9c 0000000000000000 000000000ffafc40 0000000011a11a70 NIP [000000000fd5f988] 0xfd5f988 LR [000000000ff9b148] 0xff9b148 --- interrupt: c00 Code: f821ffc1 418200ac e93f0000 e9290038 e9291468 eba90000 48026c0d e8410018 e93f06aa 3d290001 392982a4 79291f24 <7fdd482a> 2c3e0000 41820030 7fc3f378 ---[ end trace 0000000000000000 ]--- This is caused by the i2c encoder modules vendored into nouveau/ now depending on the equally vendored nouveau_i2c_encoder_destroy function. Trying to auto-load this modules hangs on nouveau initialization until timeout, and nouveau continues without i2c video encoders. Fix by avoiding nouveau dependency by __always_inlining that helper functions into those i2c video encoder modules. Fixes: a735831 ("drm/nouveau: vendor in drm_encoder_slave API") Signed-off-by: René Rebe <[email protected]> Reviewed-by: Lyude Paul <[email protected]> [Lyude: fixed commit reference in description] Signed-off-by: Lyude Paul <[email protected]> Link: https://patch.msgid.link/[email protected]
… transaction We can't log a conflicting inode if it's a directory and it was moved from one parent directory to another parent directory in the current transaction, as this can result an attempt to have a directory with two hard links during log replay, one for the old parent directory and another for the new parent directory. The following scenario triggers that issue: 1) We have directories "dir1" and "dir2" created in a past transaction. Directory "dir1" has inode A as its parent directory; 2) We move "dir1" to some other directory; 3) We create a file with the name "dir1" in directory inode A; 4) We fsync the new file. This results in logging the inode of the new file and the inode for the directory "dir1" that was previously moved in the current transaction. So the log tree has the INODE_REF item for the new location of "dir1"; 5) We move the new file to some other directory. This results in updating the log tree to included the new INODE_REF for the new location of the file and removes the INODE_REF for the old location. This happens during the rename when we call btrfs_log_new_name(); 6) We fsync the file, and that persists the log tree changes done in the previous step (btrfs_log_new_name() only updates the log tree in memory); 7) We have a power failure; 8) Next time the fs is mounted, log replay happens and when processing the inode for directory "dir1" we find a new INODE_REF and add that link, but we don't remove the old link of the inode since we have not logged the old parent directory of the directory inode "dir1". As a result after log replay finishes when we trigger writeback of the subvolume tree's extent buffers, the tree check will detect that we have a directory a hard link count of 2 and we get a mount failure. The errors and stack traces reported in dmesg/syslog are like this: [ 3845.729764] BTRFS info (device dm-0): start tree-log replay [ 3845.730304] page: refcount:3 mapcount:0 mapping:000000005c8a3027 index:0x1d00 pfn:0x11510c [ 3845.731236] memcg:ffff9264c02f4e00 [ 3845.731751] aops:btree_aops [btrfs] ino:1 [ 3845.732300] flags: 0x17fffc00000400a(uptodate|private|writeback|node=0|zone=2|lastcpupid=0x1ffff) [ 3845.733346] raw: 017fffc00000400a 0000000000000000 dead000000000122 ffff9264d978aea8 [ 3845.734265] raw: 0000000000001d00 ffff92650e6d4738 00000003ffffffff ffff9264c02f4e00 [ 3845.735305] page dumped because: eb page dump [ 3845.735981] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=6 ino=257, invalid nlink: has 2 expect no more than 1 for dir [ 3845.737786] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14881 owner 5 [ 3845.737789] BTRFS info (device dm-0): refs 4 lock_owner 0 current 30701 [ 3845.737792] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 [ 3845.737794] inode generation 3 transid 9 size 16 nbytes 16384 [ 3845.737795] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737797] rdev 0 sequence 2 flags 0x0 [ 3845.737798] atime 1764259517.0 [ 3845.737800] ctime 1764259517.572889464 [ 3845.737801] mtime 1764259517.572889464 [ 3845.737802] otime 1764259517.0 [ 3845.737803] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 [ 3845.737805] index 0 name_len 2 [ 3845.737807] item 2 key (256 DIR_ITEM 2363071922) itemoff 16077 itemsize 34 [ 3845.737808] location key (257 1 0) type 2 [ 3845.737810] transid 9 data_len 0 name_len 4 [ 3845.737811] item 3 key (256 DIR_ITEM 2676584006) itemoff 16043 itemsize 34 [ 3845.737813] location key (258 1 0) type 2 [ 3845.737814] transid 9 data_len 0 name_len 4 [ 3845.737815] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34 [ 3845.737816] location key (257 1 0) type 2 [ 3845.737818] transid 9 data_len 0 name_len 4 [ 3845.737819] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34 [ 3845.737820] location key (258 1 0) type 2 [ 3845.737821] transid 9 data_len 0 name_len 4 [ 3845.737822] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160 [ 3845.737824] inode generation 9 transid 10 size 6 nbytes 0 [ 3845.737825] block group 0 mode 40755 links 2 uid 0 gid 0 [ 3845.737826] rdev 0 sequence 1 flags 0x0 [ 3845.737827] atime 1764259517.572889464 [ 3845.737828] ctime 1764259517.572889464 [ 3845.737830] mtime 1764259517.572889464 [ 3845.737831] otime 1764259517.572889464 [ 3845.737832] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14 [ 3845.737833] index 2 name_len 4 [ 3845.737834] item 8 key (257 INODE_REF 258) itemoff 15787 itemsize 14 [ 3845.737836] index 2 name_len 4 [ 3845.737837] item 9 key (257 DIR_ITEM 2507850652) itemoff 15754 itemsize 33 [ 3845.737838] location key (259 1 0) type 1 [ 3845.737839] transid 10 data_len 0 name_len 3 [ 3845.737840] item 10 key (257 DIR_INDEX 2) itemoff 15721 itemsize 33 [ 3845.737842] location key (259 1 0) type 1 [ 3845.737843] transid 10 data_len 0 name_len 3 [ 3845.737844] item 11 key (258 INODE_ITEM 0) itemoff 15561 itemsize 160 [ 3845.737846] inode generation 9 transid 10 size 8 nbytes 0 [ 3845.737847] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737848] rdev 0 sequence 1 flags 0x0 [ 3845.737849] atime 1764259517.572889464 [ 3845.737850] ctime 1764259517.572889464 [ 3845.737851] mtime 1764259517.572889464 [ 3845.737852] otime 1764259517.572889464 [ 3845.737853] item 12 key (258 INODE_REF 256) itemoff 15547 itemsize 14 [ 3845.737855] index 3 name_len 4 [ 3845.737856] item 13 key (258 DIR_ITEM 1843588421) itemoff 15513 itemsize 34 [ 3845.737857] location key (257 1 0) type 2 [ 3845.737858] transid 10 data_len 0 name_len 4 [ 3845.737860] item 14 key (258 DIR_INDEX 2) itemoff 15479 itemsize 34 [ 3845.737861] location key (257 1 0) type 2 [ 3845.737862] transid 10 data_len 0 name_len 4 [ 3845.737863] item 15 key (259 INODE_ITEM 0) itemoff 15319 itemsize 160 [ 3845.737865] inode generation 10 transid 10 size 0 nbytes 0 [ 3845.737866] block group 0 mode 100600 links 1 uid 0 gid 0 [ 3845.737867] rdev 0 sequence 2 flags 0x0 [ 3845.737868] atime 1764259517.580874966 [ 3845.737869] ctime 1764259517.586121869 [ 3845.737870] mtime 1764259517.580874966 [ 3845.737872] otime 1764259517.580874966 [ 3845.737873] item 16 key (259 INODE_REF 257) itemoff 15306 itemsize 13 [ 3845.737874] index 2 name_len 3 [ 3845.737875] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected [ 3845.739448] ------------[ cut here ]------------ [ 3845.740092] WARNING: CPU: 5 PID: 30701 at fs/btrfs/disk-io.c:335 btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.741439] Modules linked in: btrfs dm_flakey crc32c_cryptoapi (...) [ 3845.750626] CPU: 5 UID: 0 PID: 30701 Comm: mount Tainted: G W 6.18.0-rc6-btrfs-next-218+ #1 PREEMPT(full) [ 3845.752414] Tainted: [W]=WARN [ 3845.752828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 3845.754499] RIP: 0010:btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.755460] Code: 31 f6 48 89 (...) [ 3845.758685] RSP: 0018:ffffa8d9c5677678 EFLAGS: 00010246 [ 3845.759450] RAX: 0000000000000000 RBX: ffff92650e6d4738 RCX: 0000000000000000 [ 3845.760309] RDX: 0000000000000000 RSI: ffffffff9aab45b9 RDI: ffff9264c4748000 [ 3845.761239] RBP: ffff9264d4324000 R08: 0000000000000000 R09: ffffa8d9c5677468 [ 3845.762607] R10: ffff926bdc1fffa8 R11: 0000000000000003 R12: ffffa8d9c5677680 [ 3845.764099] R13: 0000000000004000 R14: ffff9264dd624000 R15: ffff9264d978aba8 [ 3845.765094] FS: 00007f751fa5a840(0000) GS:ffff926c42a82000(0000) knlGS:0000000000000000 [ 3845.766226] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3845.766970] CR2: 0000558df1815380 CR3: 000000010ed88003 CR4: 0000000000370ef0 [ 3845.768009] Call Trace: [ 3845.768392] <TASK> [ 3845.768714] btrfs_submit_bbio+0x6ee/0x7f0 [btrfs] [ 3845.769640] ? write_one_eb+0x28e/0x340 [btrfs] [ 3845.770588] btree_write_cache_pages+0x2f0/0x550 [btrfs] [ 3845.771286] ? alloc_extent_state+0x19/0x100 [btrfs] [ 3845.771967] ? merge_next_state+0x1a/0x90 [btrfs] [ 3845.772586] ? set_extent_bit+0x233/0x8b0 [btrfs] [ 3845.773198] ? xas_load+0x9/0xc0 [ 3845.773589] ? xas_find+0x14d/0x1a0 [ 3845.773969] do_writepages+0xc6/0x160 [ 3845.774367] filemap_fdatawrite_wbc+0x48/0x60 [ 3845.775003] __filemap_fdatawrite_range+0x5b/0x80 [ 3845.775902] btrfs_write_marked_extents+0x61/0x170 [btrfs] [ 3845.776707] btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs] [ 3845.777379] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 3845.777923] btrfs_commit_transaction+0x5ea/0xd20 [btrfs] [ 3845.778551] ? _raw_spin_unlock+0x15/0x30 [ 3845.778986] ? release_extent_buffer+0x34/0x160 [btrfs] [ 3845.779659] btrfs_recover_log_trees+0x7a3/0x7c0 [btrfs] [ 3845.780416] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs] [ 3845.781499] open_ctree+0x10bb/0x15f0 [btrfs] [ 3845.782194] btrfs_get_tree.cold+0xb/0x16c [btrfs] [ 3845.782764] ? fscontext_read+0x15c/0x180 [ 3845.783202] ? rw_verify_area+0x50/0x180 [ 3845.783667] vfs_get_tree+0x25/0xd0 [ 3845.784047] vfs_cmd_create+0x59/0xe0 [ 3845.784458] __do_sys_fsconfig+0x4f6/0x6b0 [ 3845.784914] do_syscall_64+0x50/0x1220 [ 3845.785340] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 3845.785980] RIP: 0033:0x7f751fc7f4aa [ 3845.786759] Code: 73 01 c3 48 (...) [ 3845.789951] RSP: 002b:00007ffcdba45dc8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af [ 3845.791402] RAX: ffffffffffffffda RBX: 000055ccc8291c20 RCX: 00007f751fc7f4aa [ 3845.792688] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 [ 3845.794308] RBP: 000055ccc8292120 R08: 0000000000000000 R09: 0000000000000000 [ 3845.795829] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 3845.797183] R13: 00007f751fe11580 R14: 00007f751fe1326c R15: 00007f751fdf8a23 [ 3845.798633] </TASK> [ 3845.799067] ---[ end trace 0000000000000000 ]--- [ 3845.800215] BTRFS: error (device dm-0) in btrfs_commit_transaction:2553: errno=-5 IO failure (Error while writing out transaction) [ 3845.801860] BTRFS warning (device dm-0 state E): Skipping commit of aborted transaction. [ 3845.802815] BTRFS error (device dm-0 state EA): Transaction aborted (error -5) [ 3845.803728] BTRFS: error (device dm-0 state EA) in cleanup_transaction:2036: errno=-5 IO failure [ 3845.805374] BTRFS: error (device dm-0 state EA) in btrfs_replay_log:2083: errno=-5 IO failure (Failed to recover log tree) [ 3845.807919] BTRFS error (device dm-0 state EA): open_ctree failed: -5 Fix this by never logging a conflicting inode that is a directory and was moved in the current transaction (its last_unlink_trans equals the current transaction) and instead fallback to a transaction commit. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ CC: [email protected] # 6.1+ Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
… transaction We can't log a conflicting inode if it's a directory and it was moved from one parent directory to another parent directory in the current transaction, as this can result an attempt to have a directory with two hard links during log replay, one for the old parent directory and another for the new parent directory. The following scenario triggers that issue: 1) We have directories "dir1" and "dir2" created in a past transaction. Directory "dir1" has inode A as its parent directory; 2) We move "dir1" to some other directory; 3) We create a file with the name "dir1" in directory inode A; 4) We fsync the new file. This results in logging the inode of the new file and the inode for the directory "dir1" that was previously moved in the current transaction. So the log tree has the INODE_REF item for the new location of "dir1"; 5) We move the new file to some other directory. This results in updating the log tree to included the new INODE_REF for the new location of the file and removes the INODE_REF for the old location. This happens during the rename when we call btrfs_log_new_name(); 6) We fsync the file, and that persists the log tree changes done in the previous step (btrfs_log_new_name() only updates the log tree in memory); 7) We have a power failure; 8) Next time the fs is mounted, log replay happens and when processing the inode for directory "dir1" we find a new INODE_REF and add that link, but we don't remove the old link of the inode since we have not logged the old parent directory of the directory inode "dir1". As a result after log replay finishes when we trigger writeback of the subvolume tree's extent buffers, the tree check will detect that we have a directory a hard link count of 2 and we get a mount failure. The errors and stack traces reported in dmesg/syslog are like this: [ 3845.729764] BTRFS info (device dm-0): start tree-log replay [ 3845.730304] page: refcount:3 mapcount:0 mapping:000000005c8a3027 index:0x1d00 pfn:0x11510c [ 3845.731236] memcg:ffff9264c02f4e00 [ 3845.731751] aops:btree_aops [btrfs] ino:1 [ 3845.732300] flags: 0x17fffc00000400a(uptodate|private|writeback|node=0|zone=2|lastcpupid=0x1ffff) [ 3845.733346] raw: 017fffc00000400a 0000000000000000 dead000000000122 ffff9264d978aea8 [ 3845.734265] raw: 0000000000001d00 ffff92650e6d4738 00000003ffffffff ffff9264c02f4e00 [ 3845.735305] page dumped because: eb page dump [ 3845.735981] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=6 ino=257, invalid nlink: has 2 expect no more than 1 for dir [ 3845.737786] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14881 owner 5 [ 3845.737789] BTRFS info (device dm-0): refs 4 lock_owner 0 current 30701 [ 3845.737792] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 [ 3845.737794] inode generation 3 transid 9 size 16 nbytes 16384 [ 3845.737795] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737797] rdev 0 sequence 2 flags 0x0 [ 3845.737798] atime 1764259517.0 [ 3845.737800] ctime 1764259517.572889464 [ 3845.737801] mtime 1764259517.572889464 [ 3845.737802] otime 1764259517.0 [ 3845.737803] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 [ 3845.737805] index 0 name_len 2 [ 3845.737807] item 2 key (256 DIR_ITEM 2363071922) itemoff 16077 itemsize 34 [ 3845.737808] location key (257 1 0) type 2 [ 3845.737810] transid 9 data_len 0 name_len 4 [ 3845.737811] item 3 key (256 DIR_ITEM 2676584006) itemoff 16043 itemsize 34 [ 3845.737813] location key (258 1 0) type 2 [ 3845.737814] transid 9 data_len 0 name_len 4 [ 3845.737815] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34 [ 3845.737816] location key (257 1 0) type 2 [ 3845.737818] transid 9 data_len 0 name_len 4 [ 3845.737819] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34 [ 3845.737820] location key (258 1 0) type 2 [ 3845.737821] transid 9 data_len 0 name_len 4 [ 3845.737822] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160 [ 3845.737824] inode generation 9 transid 10 size 6 nbytes 0 [ 3845.737825] block group 0 mode 40755 links 2 uid 0 gid 0 [ 3845.737826] rdev 0 sequence 1 flags 0x0 [ 3845.737827] atime 1764259517.572889464 [ 3845.737828] ctime 1764259517.572889464 [ 3845.737830] mtime 1764259517.572889464 [ 3845.737831] otime 1764259517.572889464 [ 3845.737832] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14 [ 3845.737833] index 2 name_len 4 [ 3845.737834] item 8 key (257 INODE_REF 258) itemoff 15787 itemsize 14 [ 3845.737836] index 2 name_len 4 [ 3845.737837] item 9 key (257 DIR_ITEM 2507850652) itemoff 15754 itemsize 33 [ 3845.737838] location key (259 1 0) type 1 [ 3845.737839] transid 10 data_len 0 name_len 3 [ 3845.737840] item 10 key (257 DIR_INDEX 2) itemoff 15721 itemsize 33 [ 3845.737842] location key (259 1 0) type 1 [ 3845.737843] transid 10 data_len 0 name_len 3 [ 3845.737844] item 11 key (258 INODE_ITEM 0) itemoff 15561 itemsize 160 [ 3845.737846] inode generation 9 transid 10 size 8 nbytes 0 [ 3845.737847] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737848] rdev 0 sequence 1 flags 0x0 [ 3845.737849] atime 1764259517.572889464 [ 3845.737850] ctime 1764259517.572889464 [ 3845.737851] mtime 1764259517.572889464 [ 3845.737852] otime 1764259517.572889464 [ 3845.737853] item 12 key (258 INODE_REF 256) itemoff 15547 itemsize 14 [ 3845.737855] index 3 name_len 4 [ 3845.737856] item 13 key (258 DIR_ITEM 1843588421) itemoff 15513 itemsize 34 [ 3845.737857] location key (257 1 0) type 2 [ 3845.737858] transid 10 data_len 0 name_len 4 [ 3845.737860] item 14 key (258 DIR_INDEX 2) itemoff 15479 itemsize 34 [ 3845.737861] location key (257 1 0) type 2 [ 3845.737862] transid 10 data_len 0 name_len 4 [ 3845.737863] item 15 key (259 INODE_ITEM 0) itemoff 15319 itemsize 160 [ 3845.737865] inode generation 10 transid 10 size 0 nbytes 0 [ 3845.737866] block group 0 mode 100600 links 1 uid 0 gid 0 [ 3845.737867] rdev 0 sequence 2 flags 0x0 [ 3845.737868] atime 1764259517.580874966 [ 3845.737869] ctime 1764259517.586121869 [ 3845.737870] mtime 1764259517.580874966 [ 3845.737872] otime 1764259517.580874966 [ 3845.737873] item 16 key (259 INODE_REF 257) itemoff 15306 itemsize 13 [ 3845.737874] index 2 name_len 3 [ 3845.737875] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected [ 3845.739448] ------------[ cut here ]------------ [ 3845.740092] WARNING: CPU: 5 PID: 30701 at fs/btrfs/disk-io.c:335 btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.741439] Modules linked in: btrfs dm_flakey crc32c_cryptoapi (...) [ 3845.750626] CPU: 5 UID: 0 PID: 30701 Comm: mount Tainted: G W 6.18.0-rc6-btrfs-next-218+ #1 PREEMPT(full) [ 3845.752414] Tainted: [W]=WARN [ 3845.752828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 3845.754499] RIP: 0010:btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.755460] Code: 31 f6 48 89 (...) [ 3845.758685] RSP: 0018:ffffa8d9c5677678 EFLAGS: 00010246 [ 3845.759450] RAX: 0000000000000000 RBX: ffff92650e6d4738 RCX: 0000000000000000 [ 3845.760309] RDX: 0000000000000000 RSI: ffffffff9aab45b9 RDI: ffff9264c4748000 [ 3845.761239] RBP: ffff9264d4324000 R08: 0000000000000000 R09: ffffa8d9c5677468 [ 3845.762607] R10: ffff926bdc1fffa8 R11: 0000000000000003 R12: ffffa8d9c5677680 [ 3845.764099] R13: 0000000000004000 R14: ffff9264dd624000 R15: ffff9264d978aba8 [ 3845.765094] FS: 00007f751fa5a840(0000) GS:ffff926c42a82000(0000) knlGS:0000000000000000 [ 3845.766226] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3845.766970] CR2: 0000558df1815380 CR3: 000000010ed88003 CR4: 0000000000370ef0 [ 3845.768009] Call Trace: [ 3845.768392] <TASK> [ 3845.768714] btrfs_submit_bbio+0x6ee/0x7f0 [btrfs] [ 3845.769640] ? write_one_eb+0x28e/0x340 [btrfs] [ 3845.770588] btree_write_cache_pages+0x2f0/0x550 [btrfs] [ 3845.771286] ? alloc_extent_state+0x19/0x100 [btrfs] [ 3845.771967] ? merge_next_state+0x1a/0x90 [btrfs] [ 3845.772586] ? set_extent_bit+0x233/0x8b0 [btrfs] [ 3845.773198] ? xas_load+0x9/0xc0 [ 3845.773589] ? xas_find+0x14d/0x1a0 [ 3845.773969] do_writepages+0xc6/0x160 [ 3845.774367] filemap_fdatawrite_wbc+0x48/0x60 [ 3845.775003] __filemap_fdatawrite_range+0x5b/0x80 [ 3845.775902] btrfs_write_marked_extents+0x61/0x170 [btrfs] [ 3845.776707] btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs] [ 3845.777379] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 3845.777923] btrfs_commit_transaction+0x5ea/0xd20 [btrfs] [ 3845.778551] ? _raw_spin_unlock+0x15/0x30 [ 3845.778986] ? release_extent_buffer+0x34/0x160 [btrfs] [ 3845.779659] btrfs_recover_log_trees+0x7a3/0x7c0 [btrfs] [ 3845.780416] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs] [ 3845.781499] open_ctree+0x10bb/0x15f0 [btrfs] [ 3845.782194] btrfs_get_tree.cold+0xb/0x16c [btrfs] [ 3845.782764] ? fscontext_read+0x15c/0x180 [ 3845.783202] ? rw_verify_area+0x50/0x180 [ 3845.783667] vfs_get_tree+0x25/0xd0 [ 3845.784047] vfs_cmd_create+0x59/0xe0 [ 3845.784458] __do_sys_fsconfig+0x4f6/0x6b0 [ 3845.784914] do_syscall_64+0x50/0x1220 [ 3845.785340] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 3845.785980] RIP: 0033:0x7f751fc7f4aa [ 3845.786759] Code: 73 01 c3 48 (...) [ 3845.789951] RSP: 002b:00007ffcdba45dc8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af [ 3845.791402] RAX: ffffffffffffffda RBX: 000055ccc8291c20 RCX: 00007f751fc7f4aa [ 3845.792688] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 [ 3845.794308] RBP: 000055ccc8292120 R08: 0000000000000000 R09: 0000000000000000 [ 3845.795829] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 3845.797183] R13: 00007f751fe11580 R14: 00007f751fe1326c R15: 00007f751fdf8a23 [ 3845.798633] </TASK> [ 3845.799067] ---[ end trace 0000000000000000 ]--- [ 3845.800215] BTRFS: error (device dm-0) in btrfs_commit_transaction:2553: errno=-5 IO failure (Error while writing out transaction) [ 3845.801860] BTRFS warning (device dm-0 state E): Skipping commit of aborted transaction. [ 3845.802815] BTRFS error (device dm-0 state EA): Transaction aborted (error -5) [ 3845.803728] BTRFS: error (device dm-0 state EA) in cleanup_transaction:2036: errno=-5 IO failure [ 3845.805374] BTRFS: error (device dm-0 state EA) in btrfs_replay_log:2083: errno=-5 IO failure (Failed to recover log tree) [ 3845.807919] BTRFS error (device dm-0 state EA): open_ctree failed: -5 Fix this by never logging a conflicting inode that is a directory and was moved in the current transaction (its last_unlink_trans equals the current transaction) and instead fallback to a transaction commit. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ CC: [email protected] # 6.1+ Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
…ked_inode() In btrfs_read_locked_inode() we are calling btrfs_init_file_extent_tree() while holding a path with a read locked leaf from a subvolume tree, and btrfs_init_file_extent_tree() may do a GFP_KERNEL allocation, which can trigger reclaim. This can create a circular lock dependency which lockdep warns about with the following splat: [27386.164433] ====================================================== [27386.164574] WARNING: possible circular locking dependency detected [27386.164583] 6.18.0+ #4 Tainted: G U [27386.164591] ------------------------------------------------------ [27386.164599] kswapd0/117 is trying to acquire lock: [27386.164606] ffff8d9b6333c5b8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.164625] but task is already holding lock: [27386.164633] ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60 [27386.164646] which lock already depends on the new lock. [27386.164657] the existing dependency chain (in reverse order) is: [27386.164667] -> #2 (fs_reclaim){+.+.}-{0:0}: [27386.164677] fs_reclaim_acquire+0x9d/0xd0 [27386.164685] __kmalloc_cache_noprof+0x59/0x750 [27386.164694] btrfs_init_file_extent_tree+0x90/0x100 [27386.164702] btrfs_read_locked_inode+0xc3/0x6b0 [27386.164710] btrfs_iget+0xbb/0xf0 [27386.164716] btrfs_lookup_dentry+0x3c5/0x8e0 [27386.164724] btrfs_lookup+0x12/0x30 [27386.164731] lookup_open.isra.0+0x1aa/0x6a0 [27386.164739] path_openat+0x5f7/0xc60 [27386.164746] do_filp_open+0xd6/0x180 [27386.164753] do_sys_openat2+0x8b/0xe0 [27386.164760] __x64_sys_openat+0x54/0xa0 [27386.164768] do_syscall_64+0x97/0x3e0 [27386.164776] entry_SYSCALL_64_after_hwframe+0x76/0x7e [27386.164784] -> #1 (btrfs-tree-00){++++}-{3:3}: [27386.164794] lock_release+0x127/0x2a0 [27386.164801] up_read+0x1b/0x30 [27386.164808] btrfs_search_slot+0x8e0/0xff0 [27386.164817] btrfs_lookup_inode+0x52/0xd0 [27386.164825] __btrfs_update_delayed_inode+0x73/0x520 [27386.164833] btrfs_commit_inode_delayed_inode+0x11a/0x120 [27386.164842] btrfs_log_inode+0x608/0x1aa0 [27386.164849] btrfs_log_inode_parent+0x249/0xf80 [27386.164857] btrfs_log_dentry_safe+0x3e/0x60 [27386.164865] btrfs_sync_file+0x431/0x690 [27386.164872] do_fsync+0x39/0x80 [27386.164879] __x64_sys_fsync+0x13/0x20 [27386.164887] do_syscall_64+0x97/0x3e0 [27386.164894] entry_SYSCALL_64_after_hwframe+0x76/0x7e [27386.164903] -> #0 (&delayed_node->mutex){+.+.}-{3:3}: [27386.164913] __lock_acquire+0x15e9/0x2820 [27386.164920] lock_acquire+0xc9/0x2d0 [27386.164927] __mutex_lock+0xcc/0x10a0 [27386.164934] __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.164944] btrfs_evict_inode+0x20b/0x4b0 [27386.164952] evict+0x15a/0x2f0 [27386.164958] prune_icache_sb+0x91/0xd0 [27386.164966] super_cache_scan+0x150/0x1d0 [27386.164974] do_shrink_slab+0x155/0x6f0 [27386.164981] shrink_slab+0x48e/0x890 [27386.164988] shrink_one+0x11a/0x1f0 [27386.164995] shrink_node+0xbfd/0x1320 [27386.165002] balance_pgdat+0x67f/0xc60 [27386.165321] kswapd+0x1dc/0x3e0 [27386.165643] kthread+0xff/0x240 [27386.165965] ret_from_fork+0x223/0x280 [27386.166287] ret_from_fork_asm+0x1a/0x30 [27386.166616] other info that might help us debug this: [27386.167561] Chain exists of: &delayed_node->mutex --> btrfs-tree-00 --> fs_reclaim [27386.168503] Possible unsafe locking scenario: [27386.169110] CPU0 CPU1 [27386.169411] ---- ---- [27386.169707] lock(fs_reclaim); [27386.169998] lock(btrfs-tree-00); [27386.170291] lock(fs_reclaim); [27386.170581] lock(&delayed_node->mutex); [27386.170874] *** DEADLOCK *** [27386.171716] 2 locks held by kswapd0/117: [27386.171999] #0: ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60 [27386.172294] #1: ffff8d998344b0e0 (&type->s_umount_key#40){++++}- {3:3}, at: super_cache_scan+0x37/0x1d0 [27386.172596] stack backtrace: [27386.173183] CPU: 11 UID: 0 PID: 117 Comm: kswapd0 Tainted: G U 6.18.0+ #4 PREEMPT(lazy) [27386.173185] Tainted: [U]=USER [27386.173186] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023 [27386.173187] Call Trace: [27386.173187] <TASK> [27386.173189] dump_stack_lvl+0x6e/0xa0 [27386.173192] print_circular_bug.cold+0x17a/0x1c0 [27386.173194] check_noncircular+0x175/0x190 [27386.173197] __lock_acquire+0x15e9/0x2820 [27386.173200] lock_acquire+0xc9/0x2d0 [27386.173201] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173204] __mutex_lock+0xcc/0x10a0 [27386.173206] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173208] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173211] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173213] __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173215] btrfs_evict_inode+0x20b/0x4b0 [27386.173217] ? lock_acquire+0xc9/0x2d0 [27386.173220] evict+0x15a/0x2f0 [27386.173222] prune_icache_sb+0x91/0xd0 [27386.173224] super_cache_scan+0x150/0x1d0 [27386.173226] do_shrink_slab+0x155/0x6f0 [27386.173228] shrink_slab+0x48e/0x890 [27386.173229] ? shrink_slab+0x2d2/0x890 [27386.173231] shrink_one+0x11a/0x1f0 [27386.173234] shrink_node+0xbfd/0x1320 [27386.173236] ? shrink_node+0xa2d/0x1320 [27386.173236] ? shrink_node+0xbd3/0x1320 [27386.173239] ? balance_pgdat+0x67f/0xc60 [27386.173239] balance_pgdat+0x67f/0xc60 [27386.173241] ? finish_task_switch.isra.0+0xc4/0x2a0 [27386.173246] kswapd+0x1dc/0x3e0 [27386.173247] ? __pfx_autoremove_wake_function+0x10/0x10 [27386.173249] ? __pfx_kswapd+0x10/0x10 [27386.173250] kthread+0xff/0x240 [27386.173251] ? __pfx_kthread+0x10/0x10 [27386.173253] ret_from_fork+0x223/0x280 [27386.173255] ? __pfx_kthread+0x10/0x10 [27386.173257] ret_from_fork_asm+0x1a/0x30 [27386.173260] </TASK> This is because: 1) The fsync task is holding an inode's delayed node mutex (for a directory) while calling __btrfs_update_delayed_inode() and that needs to do a search on the subvolume's btree (therefore read lock some extent buffers); 2) The lookup task, at btrfs_lookup(), triggered reclaim with the GFP_KERNEL allocation done by btrfs_init_file_extent_tree() while holding a read lock on a subvolume leaf; 3) The reclaim triggered kswapd which is doing inode eviction for the directory inode the fsync task is using as an argument to btrfs_commit_inode_delayed_inode() - but in that call chain we are trying to read lock the same leaf that the lookup task is holding while calling btrfs_init_file_extent_tree() and doing the GFP_KERNEL allocation. Fix this by calling btrfs_init_file_extent_tree() after we don't need the path anymore and release it in btrfs_read_locked_inode(). Reported-by: Thomas Hellström <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Fixes: 8679d26 ("btrfs: initialize inode::file_extent_tree after i_mode has been set") Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]>
…ked_inode() In btrfs_read_locked_inode() we are calling btrfs_init_file_extent_tree() while holding a path with a read locked leaf from a subvolume tree, and btrfs_init_file_extent_tree() may do a GFP_KERNEL allocation, which can trigger reclaim. This can create a circular lock dependency which lockdep warns about with the following splat: [27386.164433] ====================================================== [27386.164574] WARNING: possible circular locking dependency detected [27386.164583] 6.18.0+ #4 Tainted: G U [27386.164591] ------------------------------------------------------ [27386.164599] kswapd0/117 is trying to acquire lock: [27386.164606] ffff8d9b6333c5b8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.164625] but task is already holding lock: [27386.164633] ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60 [27386.164646] which lock already depends on the new lock. [27386.164657] the existing dependency chain (in reverse order) is: [27386.164667] -> #2 (fs_reclaim){+.+.}-{0:0}: [27386.164677] fs_reclaim_acquire+0x9d/0xd0 [27386.164685] __kmalloc_cache_noprof+0x59/0x750 [27386.164694] btrfs_init_file_extent_tree+0x90/0x100 [27386.164702] btrfs_read_locked_inode+0xc3/0x6b0 [27386.164710] btrfs_iget+0xbb/0xf0 [27386.164716] btrfs_lookup_dentry+0x3c5/0x8e0 [27386.164724] btrfs_lookup+0x12/0x30 [27386.164731] lookup_open.isra.0+0x1aa/0x6a0 [27386.164739] path_openat+0x5f7/0xc60 [27386.164746] do_filp_open+0xd6/0x180 [27386.164753] do_sys_openat2+0x8b/0xe0 [27386.164760] __x64_sys_openat+0x54/0xa0 [27386.164768] do_syscall_64+0x97/0x3e0 [27386.164776] entry_SYSCALL_64_after_hwframe+0x76/0x7e [27386.164784] -> #1 (btrfs-tree-00){++++}-{3:3}: [27386.164794] lock_release+0x127/0x2a0 [27386.164801] up_read+0x1b/0x30 [27386.164808] btrfs_search_slot+0x8e0/0xff0 [27386.164817] btrfs_lookup_inode+0x52/0xd0 [27386.164825] __btrfs_update_delayed_inode+0x73/0x520 [27386.164833] btrfs_commit_inode_delayed_inode+0x11a/0x120 [27386.164842] btrfs_log_inode+0x608/0x1aa0 [27386.164849] btrfs_log_inode_parent+0x249/0xf80 [27386.164857] btrfs_log_dentry_safe+0x3e/0x60 [27386.164865] btrfs_sync_file+0x431/0x690 [27386.164872] do_fsync+0x39/0x80 [27386.164879] __x64_sys_fsync+0x13/0x20 [27386.164887] do_syscall_64+0x97/0x3e0 [27386.164894] entry_SYSCALL_64_after_hwframe+0x76/0x7e [27386.164903] -> #0 (&delayed_node->mutex){+.+.}-{3:3}: [27386.164913] __lock_acquire+0x15e9/0x2820 [27386.164920] lock_acquire+0xc9/0x2d0 [27386.164927] __mutex_lock+0xcc/0x10a0 [27386.164934] __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.164944] btrfs_evict_inode+0x20b/0x4b0 [27386.164952] evict+0x15a/0x2f0 [27386.164958] prune_icache_sb+0x91/0xd0 [27386.164966] super_cache_scan+0x150/0x1d0 [27386.164974] do_shrink_slab+0x155/0x6f0 [27386.164981] shrink_slab+0x48e/0x890 [27386.164988] shrink_one+0x11a/0x1f0 [27386.164995] shrink_node+0xbfd/0x1320 [27386.165002] balance_pgdat+0x67f/0xc60 [27386.165321] kswapd+0x1dc/0x3e0 [27386.165643] kthread+0xff/0x240 [27386.165965] ret_from_fork+0x223/0x280 [27386.166287] ret_from_fork_asm+0x1a/0x30 [27386.166616] other info that might help us debug this: [27386.167561] Chain exists of: &delayed_node->mutex --> btrfs-tree-00 --> fs_reclaim [27386.168503] Possible unsafe locking scenario: [27386.169110] CPU0 CPU1 [27386.169411] ---- ---- [27386.169707] lock(fs_reclaim); [27386.169998] lock(btrfs-tree-00); [27386.170291] lock(fs_reclaim); [27386.170581] lock(&delayed_node->mutex); [27386.170874] *** DEADLOCK *** [27386.171716] 2 locks held by kswapd0/117: [27386.171999] #0: ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60 [27386.172294] #1: ffff8d998344b0e0 (&type->s_umount_key#40){++++}- {3:3}, at: super_cache_scan+0x37/0x1d0 [27386.172596] stack backtrace: [27386.173183] CPU: 11 UID: 0 PID: 117 Comm: kswapd0 Tainted: G U 6.18.0+ #4 PREEMPT(lazy) [27386.173185] Tainted: [U]=USER [27386.173186] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023 [27386.173187] Call Trace: [27386.173187] <TASK> [27386.173189] dump_stack_lvl+0x6e/0xa0 [27386.173192] print_circular_bug.cold+0x17a/0x1c0 [27386.173194] check_noncircular+0x175/0x190 [27386.173197] __lock_acquire+0x15e9/0x2820 [27386.173200] lock_acquire+0xc9/0x2d0 [27386.173201] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173204] __mutex_lock+0xcc/0x10a0 [27386.173206] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173208] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173211] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173213] __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173215] btrfs_evict_inode+0x20b/0x4b0 [27386.173217] ? lock_acquire+0xc9/0x2d0 [27386.173220] evict+0x15a/0x2f0 [27386.173222] prune_icache_sb+0x91/0xd0 [27386.173224] super_cache_scan+0x150/0x1d0 [27386.173226] do_shrink_slab+0x155/0x6f0 [27386.173228] shrink_slab+0x48e/0x890 [27386.173229] ? shrink_slab+0x2d2/0x890 [27386.173231] shrink_one+0x11a/0x1f0 [27386.173234] shrink_node+0xbfd/0x1320 [27386.173236] ? shrink_node+0xa2d/0x1320 [27386.173236] ? shrink_node+0xbd3/0x1320 [27386.173239] ? balance_pgdat+0x67f/0xc60 [27386.173239] balance_pgdat+0x67f/0xc60 [27386.173241] ? finish_task_switch.isra.0+0xc4/0x2a0 [27386.173246] kswapd+0x1dc/0x3e0 [27386.173247] ? __pfx_autoremove_wake_function+0x10/0x10 [27386.173249] ? __pfx_kswapd+0x10/0x10 [27386.173250] kthread+0xff/0x240 [27386.173251] ? __pfx_kthread+0x10/0x10 [27386.173253] ret_from_fork+0x223/0x280 [27386.173255] ? __pfx_kthread+0x10/0x10 [27386.173257] ret_from_fork_asm+0x1a/0x30 [27386.173260] </TASK> This is because: 1) The fsync task is holding an inode's delayed node mutex (for a directory) while calling __btrfs_update_delayed_inode() and that needs to do a search on the subvolume's btree (therefore read lock some extent buffers); 2) The lookup task, at btrfs_lookup(), triggered reclaim with the GFP_KERNEL allocation done by btrfs_init_file_extent_tree() while holding a read lock on a subvolume leaf; 3) The reclaim triggered kswapd which is doing inode eviction for the directory inode the fsync task is using as an argument to btrfs_commit_inode_delayed_inode() - but in that call chain we are trying to read lock the same leaf that the lookup task is holding while calling btrfs_init_file_extent_tree() and doing the GFP_KERNEL allocation. Fix this by calling btrfs_init_file_extent_tree() after we don't need the path anymore and release it in btrfs_read_locked_inode(). Reported-by: Thomas Hellström <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Fixes: 8679d26 ("btrfs: initialize inode::file_extent_tree after i_mode has been set") Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
…ked_inode() In btrfs_read_locked_inode() we are calling btrfs_init_file_extent_tree() while holding a path with a read locked leaf from a subvolume tree, and btrfs_init_file_extent_tree() may do a GFP_KERNEL allocation, which can trigger reclaim. This can create a circular lock dependency which lockdep warns about with the following splat: [6.1433] ====================================================== [6.1574] WARNING: possible circular locking dependency detected [6.1583] 6.18.0+ #4 Tainted: G U [6.1591] ------------------------------------------------------ [6.1599] kswapd0/117 is trying to acquire lock: [6.1606] ffff8d9b6333c5b8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1625] but task is already holding lock: [6.1633] ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60 [6.1646] which lock already depends on the new lock. [6.1657] the existing dependency chain (in reverse order) is: [6.1667] -> #2 (fs_reclaim){+.+.}-{0:0}: [6.1677] fs_reclaim_acquire+0x9d/0xd0 [6.1685] __kmalloc_cache_noprof+0x59/0x750 [6.1694] btrfs_init_file_extent_tree+0x90/0x100 [6.1702] btrfs_read_locked_inode+0xc3/0x6b0 [6.1710] btrfs_iget+0xbb/0xf0 [6.1716] btrfs_lookup_dentry+0x3c5/0x8e0 [6.1724] btrfs_lookup+0x12/0x30 [6.1731] lookup_open.isra.0+0x1aa/0x6a0 [6.1739] path_openat+0x5f7/0xc60 [6.1746] do_filp_open+0xd6/0x180 [6.1753] do_sys_openat2+0x8b/0xe0 [6.1760] __x64_sys_openat+0x54/0xa0 [6.1768] do_syscall_64+0x97/0x3e0 [6.1776] entry_SYSCALL_64_after_hwframe+0x76/0x7e [6.1784] -> #1 (btrfs-tree-00){++++}-{3:3}: [6.1794] lock_release+0x127/0x2a0 [6.1801] up_read+0x1b/0x30 [6.1808] btrfs_search_slot+0x8e0/0xff0 [6.1817] btrfs_lookup_inode+0x52/0xd0 [6.1825] __btrfs_update_delayed_inode+0x73/0x520 [6.1833] btrfs_commit_inode_delayed_inode+0x11a/0x120 [6.1842] btrfs_log_inode+0x608/0x1aa0 [6.1849] btrfs_log_inode_parent+0x249/0xf80 [6.1857] btrfs_log_dentry_safe+0x3e/0x60 [6.1865] btrfs_sync_file+0x431/0x690 [6.1872] do_fsync+0x39/0x80 [6.1879] __x64_sys_fsync+0x13/0x20 [6.1887] do_syscall_64+0x97/0x3e0 [6.1894] entry_SYSCALL_64_after_hwframe+0x76/0x7e [6.1903] -> #0 (&delayed_node->mutex){+.+.}-{3:3}: [6.1913] __lock_acquire+0x15e9/0x2820 [6.1920] lock_acquire+0xc9/0x2d0 [6.1927] __mutex_lock+0xcc/0x10a0 [6.1934] __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1944] btrfs_evict_inode+0x20b/0x4b0 [6.1952] evict+0x15a/0x2f0 [6.1958] prune_icache_sb+0x91/0xd0 [6.1966] super_cache_scan+0x150/0x1d0 [6.1974] do_shrink_slab+0x155/0x6f0 [6.1981] shrink_slab+0x48e/0x890 [6.1988] shrink_one+0x11a/0x1f0 [6.1995] shrink_node+0xbfd/0x1320 [6.1002] balance_pgdat+0x67f/0xc60 [6.1321] kswapd+0x1dc/0x3e0 [6.1643] kthread+0xff/0x240 [6.1965] ret_from_fork+0x223/0x280 [6.1287] ret_from_fork_asm+0x1a/0x30 [6.1616] other info that might help us debug this: [6.1561] Chain exists of: &delayed_node->mutex --> btrfs-tree-00 --> fs_reclaim [6.1503] Possible unsafe locking scenario: [6.1110] CPU0 CPU1 [6.1411] ---- ---- [6.1707] lock(fs_reclaim); [6.1998] lock(btrfs-tree-00); [6.1291] lock(fs_reclaim); [6.1581] lock(&delayed_node->mutex); [6.1874] *** DEADLOCK *** [6.1716] 2 locks held by kswapd0/117: [6.1999] #0: ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60 [6.1294] #1: ffff8d998344b0e0 (&type->s_umount_key#40){++++}- {3:3}, at: super_cache_scan+0x37/0x1d0 [6.1596] stack backtrace: [6.1183] CPU: 11 UID: 0 PID: 117 Comm: kswapd0 Tainted: G U 6.18.0+ #4 PREEMPT(lazy) [6.1185] Tainted: [U]=USER [6.1186] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023 [6.1187] Call Trace: [6.1187] <TASK> [6.1189] dump_stack_lvl+0x6e/0xa0 [6.1192] print_circular_bug.cold+0x17a/0x1c0 [6.1194] check_noncircular+0x175/0x190 [6.1197] __lock_acquire+0x15e9/0x2820 [6.1200] lock_acquire+0xc9/0x2d0 [6.1201] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1204] __mutex_lock+0xcc/0x10a0 [6.1206] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1208] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1211] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1213] __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1215] btrfs_evict_inode+0x20b/0x4b0 [6.1217] ? lock_acquire+0xc9/0x2d0 [6.1220] evict+0x15a/0x2f0 [6.1222] prune_icache_sb+0x91/0xd0 [6.1224] super_cache_scan+0x150/0x1d0 [6.1226] do_shrink_slab+0x155/0x6f0 [6.1228] shrink_slab+0x48e/0x890 [6.1229] ? shrink_slab+0x2d2/0x890 [6.1231] shrink_one+0x11a/0x1f0 [6.1234] shrink_node+0xbfd/0x1320 [6.1236] ? shrink_node+0xa2d/0x1320 [6.1236] ? shrink_node+0xbd3/0x1320 [6.1239] ? balance_pgdat+0x67f/0xc60 [6.1239] balance_pgdat+0x67f/0xc60 [6.1241] ? finish_task_switch.isra.0+0xc4/0x2a0 [6.1246] kswapd+0x1dc/0x3e0 [6.1247] ? __pfx_autoremove_wake_function+0x10/0x10 [6.1249] ? __pfx_kswapd+0x10/0x10 [6.1250] kthread+0xff/0x240 [6.1251] ? __pfx_kthread+0x10/0x10 [6.1253] ret_from_fork+0x223/0x280 [6.1255] ? __pfx_kthread+0x10/0x10 [6.1257] ret_from_fork_asm+0x1a/0x30 [6.1260] </TASK> This is because: 1) The fsync task is holding an inode's delayed node mutex (for a directory) while calling __btrfs_update_delayed_inode() and that needs to do a search on the subvolume's btree (therefore read lock some extent buffers); 2) The lookup task, at btrfs_lookup(), triggered reclaim with the GFP_KERNEL allocation done by btrfs_init_file_extent_tree() while holding a read lock on a subvolume leaf; 3) The reclaim triggered kswapd which is doing inode eviction for the directory inode the fsync task is using as an argument to btrfs_commit_inode_delayed_inode() - but in that call chain we are trying to read lock the same leaf that the lookup task is holding while calling btrfs_init_file_extent_tree() and doing the GFP_KERNEL allocation. Fix this by calling btrfs_init_file_extent_tree() after we don't need the path anymore and release it in btrfs_read_locked_inode(). Reported-by: Thomas Hellström <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Fixes: 8679d26 ("btrfs: initialize inode::file_extent_tree after i_mode has been set") Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
prp_get_untagged_frame() calls __pskb_copy() to create frame->skb_std but doesn't check if the allocation failed. If __pskb_copy() returns NULL, skb_clone() is called with a NULL pointer, causing a crash: Oops: general protection fault, probably for non-canonical address 0xdffffc000000000f: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000078-0x000000000000007f] CPU: 0 UID: 0 PID: 5625 Comm: syz.1.18 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:skb_clone+0xd7/0x3a0 net/core/skbuff.c:2041 Code: 03 42 80 3c 20 00 74 08 4c 89 f7 e8 23 29 05 f9 49 83 3e 00 0f 85 a0 01 00 00 e8 94 dd 9d f8 48 8d 6b 7e 49 89 ee 49 c1 ee 03 <43> 0f b6 04 26 84 c0 0f 85 d1 01 00 00 44 0f b6 7d 00 41 83 e7 0c RSP: 0018:ffffc9000d00f200 EFLAGS: 00010207 RAX: ffffffff892235a1 RBX: 0000000000000000 RCX: ffff88803372a480 RDX: 0000000000000000 RSI: 0000000000000820 RDI: 0000000000000000 RBP: 000000000000007e R08: ffffffff8f7d0f77 R09: 1ffffffff1efa1ee R10: dffffc0000000000 R11: fffffbfff1efa1ef R12: dffffc0000000000 R13: 0000000000000820 R14: 000000000000000f R15: ffff88805144cc00 FS: 0000555557f6d500(0000) GS:ffff88808d72f000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555581d35808 CR3: 000000005040e000 CR4: 0000000000352ef0 Call Trace: <TASK> hsr_forward_do net/hsr/hsr_forward.c:-1 [inline] hsr_forward_skb+0x1013/0x2860 net/hsr/hsr_forward.c:741 hsr_handle_frame+0x6ce/0xa70 net/hsr/hsr_slave.c:84 __netif_receive_skb_core+0x10b9/0x4380 net/core/dev.c:5966 __netif_receive_skb_one_core net/core/dev.c:6077 [inline] __netif_receive_skb+0x72/0x380 net/core/dev.c:6192 netif_receive_skb_internal net/core/dev.c:6278 [inline] netif_receive_skb+0x1cb/0x790 net/core/dev.c:6337 tun_rx_batched+0x1b9/0x730 drivers/net/tun.c:1485 tun_get_user+0x2b65/0x3e90 drivers/net/tun.c:1953 tun_chr_write_iter+0x113/0x200 drivers/net/tun.c:1999 new_sync_write fs/read_write.c:593 [inline] vfs_write+0x5c9/0xb30 fs/read_write.c:686 ksys_write+0x145/0x250 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f0449f8e1ff Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 f9 92 02 00 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 4c 93 02 00 48 RSP: 002b:00007ffd7ad94c90 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007f044a1e5fa0 RCX: 00007f0449f8e1ff RDX: 000000000000003e RSI: 0000200000000500 RDI: 00000000000000c8 RBP: 00007ffd7ad94d20 R08: 0000000000000000 R09: 0000000000000000 R10: 000000000000003e R11: 0000000000000293 R12: 0000000000000001 R13: 00007f044a1e5fa0 R14: 00007f044a1e5fa0 R15: 0000000000000003 </TASK> Add a NULL check immediately after __pskb_copy() to handle allocation failures gracefully. Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=2fa344348a579b779e05 Fixes: f266a68 ("net/hsr: Better frame dispatch") Cc: [email protected] Signed-off-by: Shaurya Rane <[email protected]> Reviewed-by: Felix Maurer <[email protected]> Tested-by: Felix Maurer <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
…in ets_qdisc_change [email protected] says: The vulnerability is a race condition between `ets_qdisc_dequeue` and `ets_qdisc_change`. It leads to UAF on `struct Qdisc` object. Attacker requires the capability to create new user and network namespace in order to trigger the bug. See my additional commentary at the end of the analysis. Analysis: static int ets_qdisc_change(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack) { ... // (1) this lock is preventing .change handler (`ets_qdisc_change`) //to race with .dequeue handler (`ets_qdisc_dequeue`) sch_tree_lock(sch); for (i = nbands; i < oldbands; i++) { if (i >= q->nstrict && q->classes[i].qdisc->q.qlen) list_del_init(&q->classes[i].alist); qdisc_purge_queue(q->classes[i].qdisc); } WRITE_ONCE(q->nbands, nbands); for (i = nstrict; i < q->nstrict; i++) { if (q->classes[i].qdisc->q.qlen) { // (2) the class is added to the q->active list_add_tail(&q->classes[i].alist, &q->active); q->classes[i].deficit = quanta[i]; } } WRITE_ONCE(q->nstrict, nstrict); memcpy(q->prio2band, priomap, sizeof(priomap)); for (i = 0; i < q->nbands; i++) WRITE_ONCE(q->classes[i].quantum, quanta[i]); for (i = oldbands; i < q->nbands; i++) { q->classes[i].qdisc = queues[i]; if (q->classes[i].qdisc != &noop_qdisc) qdisc_hash_add(q->classes[i].qdisc, true); } // (3) the qdisc is unlocked, now dequeue can be called in parallel // to the rest of .change handler sch_tree_unlock(sch); ets_offload_change(sch); for (i = q->nbands; i < oldbands; i++) { // (4) we're reducing the refcount for our class's qdisc and // freeing it qdisc_put(q->classes[i].qdisc); // (5) If we call .dequeue between (4) and (5), we will have // a strong UAF and we can control RIP q->classes[i].qdisc = NULL; WRITE_ONCE(q->classes[i].quantum, 0); q->classes[i].deficit = 0; gnet_stats_basic_sync_init(&q->classes[i].bstats); memset(&q->classes[i].qstats, 0, sizeof(q->classes[i].qstats)); } return 0; } Comment: This happens because some of the classes have their qdiscs assigned to NULL, but remain in the active list. This commit fixes this issue by always removing the class from the active list before deleting and freeing its associated qdisc Reproducer Steps (trimmed version of what was sent by [email protected]) ``` DEV="${DEV:-lo}" ROOT_HANDLE="${ROOT_HANDLE:-1:}" BAND2_HANDLE="${BAND2_HANDLE:-20:}" # child under 1:2 PING_BYTES="${PING_BYTES:-48}" PING_COUNT="${PING_COUNT:-200000}" PING_DST="${PING_DST:-127.0.0.1}" SLOW_TBF_RATE="${SLOW_TBF_RATE:-8bit}" SLOW_TBF_BURST="${SLOW_TBF_BURST:-100b}" SLOW_TBF_LAT="${SLOW_TBF_LAT:-1s}" cleanup() { tc qdisc del dev "$DEV" root 2>/dev/null } trap cleanup EXIT ip link set "$DEV" up tc qdisc del dev "$DEV" root 2>/dev/null || true tc qdisc add dev "$DEV" root handle "$ROOT_HANDLE" ets bands 2 strict 2 tc qdisc add dev "$DEV" parent 1:2 handle "$BAND2_HANDLE" \ tbf rate "$SLOW_TBF_RATE" burst "$SLOW_TBF_BURST" latency "$SLOW_TBF_LAT" tc filter add dev "$DEV" parent 1: protocol all prio 1 u32 match u32 0 0 flowid 1:2 tc -s qdisc ls dev $DEV ping -I "$DEV" -f -c "$PING_COUNT" -s "$PING_BYTES" -W 0.001 "$PING_DST" \ >/dev/null 2>&1 & tc qdisc change dev "$DEV" root handle "$ROOT_HANDLE" ets bands 2 strict 0 tc qdisc change dev "$DEV" root handle "$ROOT_HANDLE" ets bands 2 strict 2 tc -s qdisc ls dev $DEV tc qdisc del dev "$DEV" parent 1:2 || true tc -s qdisc ls dev $DEV tc qdisc change dev "$DEV" root handle "$ROOT_HANDLE" ets bands 1 strict 1 ``` KASAN report ``` ================================================================== BUG: KASAN: slab-use-after-free in ets_qdisc_dequeue+0x1071/0x11b0 kernel/net/sched/sch_ets.c:481 Read of size 8 at addr ffff8880502fc018 by task ping/12308 > CPU: 0 UID: 0 PID: 12308 Comm: ping Not tainted 6.18.0-rc4-dirty #1 PREEMPT(full) Hardware name: QEMU Ubuntu 25.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <IRQ> __dump_stack kernel/lib/dump_stack.c:94 dump_stack_lvl+0x100/0x190 kernel/lib/dump_stack.c:120 print_address_description kernel/mm/kasan/report.c:378 print_report+0x156/0x4c9 kernel/mm/kasan/report.c:482 kasan_report+0xdf/0x110 kernel/mm/kasan/report.c:595 ets_qdisc_dequeue+0x1071/0x11b0 kernel/net/sched/sch_ets.c:481 dequeue_skb kernel/net/sched/sch_generic.c:294 qdisc_restart kernel/net/sched/sch_generic.c:399 __qdisc_run+0x1c9/0x1b00 kernel/net/sched/sch_generic.c:417 __dev_xmit_skb kernel/net/core/dev.c:4221 __dev_queue_xmit+0x2848/0x4410 kernel/net/core/dev.c:4729 dev_queue_xmit kernel/./include/linux/netdevice.h:3365 [...] Allocated by task 17115: kasan_save_stack+0x30/0x50 kernel/mm/kasan/common.c:56 kasan_save_track+0x14/0x30 kernel/mm/kasan/common.c:77 poison_kmalloc_redzone kernel/mm/kasan/common.c:400 __kasan_kmalloc+0xaa/0xb0 kernel/mm/kasan/common.c:417 kasan_kmalloc kernel/./include/linux/kasan.h:262 __do_kmalloc_node kernel/mm/slub.c:5642 __kmalloc_node_noprof+0x34e/0x990 kernel/mm/slub.c:5648 kmalloc_node_noprof kernel/./include/linux/slab.h:987 qdisc_alloc+0xb8/0xc30 kernel/net/sched/sch_generic.c:950 qdisc_create_dflt+0x93/0x490 kernel/net/sched/sch_generic.c:1012 ets_class_graft+0x4fd/0x800 kernel/net/sched/sch_ets.c:261 qdisc_graft+0x3e4/0x1780 kernel/net/sched/sch_api.c:1196 [...] Freed by task 9905: kasan_save_stack+0x30/0x50 kernel/mm/kasan/common.c:56 kasan_save_track+0x14/0x30 kernel/mm/kasan/common.c:77 __kasan_save_free_info+0x3b/0x70 kernel/mm/kasan/generic.c:587 kasan_save_free_info kernel/mm/kasan/kasan.h:406 poison_slab_object kernel/mm/kasan/common.c:252 __kasan_slab_free+0x5f/0x80 kernel/mm/kasan/common.c:284 kasan_slab_free kernel/./include/linux/kasan.h:234 slab_free_hook kernel/mm/slub.c:2539 slab_free kernel/mm/slub.c:6630 kfree+0x144/0x700 kernel/mm/slub.c:6837 rcu_do_batch kernel/kernel/rcu/tree.c:2605 rcu_core+0x7c0/0x1500 kernel/kernel/rcu/tree.c:2861 handle_softirqs+0x1ea/0x8a0 kernel/kernel/softirq.c:622 __do_softirq kernel/kernel/softirq.c:656 [...] Commentary: 1. Maher Azzouzi working with Trend Micro Zero Day Initiative was reported as the person who found the issue. I requested to get a proper email to add to the reported-by tag but got no response. For this reason i will credit the person i exchanged emails with i.e [email protected] 2. Neither i nor Victor who did a much more thorough testing was able to reproduce a UAF with the PoC or other approaches we tried. We were both able to reproduce a null ptr deref. After exchange with [email protected] they sent a small change to be made to the code to add an extra delay which was able to simulate the UAF. i.e, this: qdisc_put(q->classes[i].qdisc); mdelay(90); q->classes[i].qdisc = NULL; I was informed by Thomas Gleixner([email protected]) that adding delays was acceptable approach for demonstrating the bug, quote: "Adding such delays is common exploit validation practice" The equivalent delay could happen "by virt scheduling the vCPU out, SMIs, NMIs, PREEMPT_RT enabled kernel" 3. I asked the OP to test and report back but got no response and after a few days gave up and proceeded to submit this fix. Fixes: de6d259 ("net/sched: sch_ets: don't peek at classes beyond 'nbands'") Reported-by: [email protected] Tested-by: Victor Nogueira <[email protected]> Signed-off-by: Jamal Hadi Salim <[email protected]> Reviewed-by: Davide Caratti <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
Clear hca_devcom_comp in device's private data after unregistering it in LAG teardown. Otherwise a slightly lagging second pass through mlx5_unload_one() might try to unregister it again and trip over use-after-free. On s390 almost all PCI level recovery events trigger two passes through mxl5_unload_one() - one through the poll_health() method and one through mlx5_pci_err_detected() as callback from generic PCI error recovery. While testing PCI error recovery paths with more kernel debug features enabled, this issue reproducibly led to kernel panics with the following call chain: Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 6b6b6b6b6b6b6000 TEID: 6b6b6b6b6b6b6803 ESOP-2 FSI Fault in home space mode while using kernel ASCE. AS:00000000705c4007 R3:0000000000000024 Oops: 0038 ilc:3 [#1]SMP CPU: 14 UID: 0 PID: 156 Comm: kmcheck Kdump: loaded Not tainted 6.18.0-20251130.rc7.git0.16131a59cab1.300.fc43.s390x+debug #1 PREEMPT Krnl PSW : 0404e00180000000 0000020fc86aa1dc (__lock_acquire+0x5c/0x15f0) R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 0000000000000000 0000020f00000001 6b6b6b6b6b6b6c33 0000000000000000 0000000000000000 0000000000000000 0000000000000001 0000000000000000 0000000000000000 0000020fca28b820 0000000000000000 0000010a1ced8100 0000010a1ced8100 0000020fc9775068 0000018fce14f8b8 0000018fce14f7f8 Krnl Code: 0000020fc86aa1cc: e3b003400004 lg %r11,832 0000020fc86aa1d2: a7840211 brc 8,0000020fc86aa5f4 *0000020fc86aa1d6: c09000df0b25 larl %r9,0000020fca28b820 >0000020fc86aa1dc: d50790002000 clc 0(8,%r9),0(%r2) 0000020fc86aa1e2: a7840209 brc 8,0000020fc86aa5f4 0000020fc86aa1e6: c0e001100401 larl %r14,0000020fca8aa9e8 0000020fc86aa1ec: c01000e25a00 larl %r1,0000020fca2f55ec 0000020fc86aa1f2: a7eb00e8 aghi %r14,232 Call Trace: __lock_acquire+0x5c/0x15f0 lock_acquire.part.0+0xf8/0x270 lock_acquire+0xb0/0x1b0 down_write+0x5a/0x250 mlx5_detach_device+0x42/0x110 [mlx5_core] mlx5_unload_one_devl_locked+0x50/0xc0 [mlx5_core] mlx5_unload_one+0x42/0x60 [mlx5_core] mlx5_pci_err_detected+0x94/0x150 [mlx5_core] zpci_event_attempt_error_recovery+0xcc/0x388 Fixes: 5a977b5 ("net/mlx5: Lag, move devcom registration to LAG layer") Signed-off-by: Gerd Bayer <[email protected]> Reviewed-by: Moshe Shemesh <[email protected]> Acked-by: Tariq Toukan <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
…stats Cited commit added a dedicated mutex (instead of RTNL) to protect the multicast route list, so that it will not change while the driver periodically traverses it in order to update the kernel about multicast route stats that were queried from the device. One instance of list entry deletion (during route replace) was missed and it can result in a use-after-free [1]. Fix by acquiring the mutex before deleting the entry from the list and releasing it afterwards. [1] BUG: KASAN: slab-use-after-free in mlxsw_sp_mr_stats_update+0x4a5/0x540 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c:1006 [mlxsw_spectrum] Read of size 8 at addr ffff8881523c2fa8 by task kworker/2:5/22043 CPU: 2 UID: 0 PID: 22043 Comm: kworker/2:5 Not tainted 6.18.0-rc1-custom-g1a3d6d7cd014 #1 PREEMPT(full) Hardware name: Mellanox Technologies Ltd. MSN2010/SA002610, BIOS 5.6.5 08/24/2017 Workqueue: mlxsw_core mlxsw_sp_mr_stats_update [mlxsw_spectrum] Call Trace: <TASK> dump_stack_lvl+0xba/0x110 print_report+0x174/0x4f5 kasan_report+0xdf/0x110 mlxsw_sp_mr_stats_update+0x4a5/0x540 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c:1006 [mlxsw_spectrum] process_one_work+0x9cc/0x18e0 worker_thread+0x5df/0xe40 kthread+0x3b8/0x730 ret_from_fork+0x3e9/0x560 ret_from_fork_asm+0x1a/0x30 </TASK> Allocated by task 29933: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x8f/0xa0 mlxsw_sp_mr_route_add+0xd8/0x4770 [mlxsw_spectrum] mlxsw_sp_router_fibmr_event_work+0x371/0xad0 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:7965 [mlxsw_spectrum] process_one_work+0x9cc/0x18e0 worker_thread+0x5df/0xe40 kthread+0x3b8/0x730 ret_from_fork+0x3e9/0x560 ret_from_fork_asm+0x1a/0x30 Freed by task 29933: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 __kasan_save_free_info+0x3b/0x70 __kasan_slab_free+0x43/0x70 kfree+0x14e/0x700 mlxsw_sp_mr_route_add+0x2dea/0x4770 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c:444 [mlxsw_spectrum] mlxsw_sp_router_fibmr_event_work+0x371/0xad0 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:7965 [mlxsw_spectrum] process_one_work+0x9cc/0x18e0 worker_thread+0x5df/0xe40 kthread+0x3b8/0x730 ret_from_fork+0x3e9/0x560 ret_from_fork_asm+0x1a/0x30 Fixes: f38656d ("mlxsw: spectrum_mr: Protect multicast route list with a lock") Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: Petr Machata <[email protected]> Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/f996feecfd59fde297964bfc85040b6d83ec6089.1764695650.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <[email protected]>
Jakub reported an MPTCP deadlock at fallback time: WARNING: possible recursive locking detected 6.18.0-rc7-virtme #1 Not tainted -------------------------------------------- mptcp_connect/20858 is trying to acquire lock: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_try_fallback+0xd8/0x280 but task is already holding lock: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&msk->fallback_lock); lock(&msk->fallback_lock); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by mptcp_connect/20858: #0: ff1100001da18290 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x114/0x1bc0 #1: ff1100001db40fd0 (k-sk_lock-AF_INET#2){+.+.}-{0:0}, at: __mptcp_retrans+0x2cb/0xaa0 #2: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0 stack backtrace: CPU: 0 UID: 0 PID: 20858 Comm: mptcp_connect Not tainted 6.18.0-rc7-virtme #1 PREEMPT(full) Hardware name: Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl+0x6f/0xa0 print_deadlock_bug.cold+0xc0/0xcd validate_chain+0x2ff/0x5f0 __lock_acquire+0x34c/0x740 lock_acquire.part.0+0xbc/0x260 _raw_spin_lock_bh+0x38/0x50 __mptcp_try_fallback+0xd8/0x280 mptcp_sendmsg_frag+0x16c2/0x3050 __mptcp_retrans+0x421/0xaa0 mptcp_release_cb+0x5aa/0xa70 release_sock+0xab/0x1d0 mptcp_sendmsg+0xd5b/0x1bc0 sock_write_iter+0x281/0x4d0 new_sync_write+0x3c5/0x6f0 vfs_write+0x65e/0xbb0 ksys_write+0x17e/0x200 do_syscall_64+0xbb/0xfd0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7fa5627cbc5e Code: 4d 89 d8 e8 14 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa RSP: 002b:00007fff1fe14700 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fa5627cbc5e RDX: 0000000000001f9c RSI: 00007fff1fe16984 RDI: 0000000000000005 RBP: 00007fff1fe14710 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff1fe16920 R13: 0000000000002000 R14: 0000000000001f9c R15: 0000000000001f9c The packet scheduler could attempt a reinjection after receiving an MP_FAIL and before the infinite map has been transmitted, causing a deadlock since MPTCP needs to do the reinjection atomically from WRT fallback. Address the issue explicitly avoiding the reinjection in the critical scenario. Note that this is the only fallback critical section that could potentially send packets and hit the double-lock. Reported-by: Jakub Kicinski <[email protected]> Closes: https://netdev-ctrl.bots.linux.dev/logs/vmksft/mptcp-dbg/results/412720/1-mptcp-join-sh/stderr Fixes: f8a1d9b ("mptcp: make fallback action and fallback decision atomic") Cc: [email protected] Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20251205-net-mptcp-misc-fixes-6-19-rc1-v1-4-9e4781a6c1b8@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
Petr Machata says: ==================== selftests: forwarding: vxlan_bridge_1q_mc_ul: Fix flakiness The net/forwarding/vxlan_bridge_1q_mc_ul selftest runs an overlay traffic, forwarded over a multicast-routed VXLAN underlay. In order to determine whether packets reach their intended destination, it uses a TC match. For convenience, it uses a flower match, which however does not allow matching on the encapsulated packet. So various service traffic ends up being indistinguishable from the test packets, and ends up confusing the test. To alleviate the problem, the test uses sleep to allow the necessary service traffic to run and clear the channel, before running the test traffic. This worked for a while, but lately we have nevertheless seen flakiness of the test in the CI. In this patchset, first generalize tc_rule_stats_get() to support u32 in patch #1, then in patch #2 convert the test to use u32 to allow parsing deeper into the packet, and in #3 drop the now-unnecessary sleep. ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
With the RAPL PMU addition, there is a recursive locking when CPU online callback function calls rapl_package_add_pmu(). Here cpu_hotplug_lock is already acquired by cpuhp_thread_fun() and rapl_package_add_pmu() tries to acquire again. <4>[ 8.197433] ============================================ <4>[ 8.197437] WARNING: possible recursive locking detected <4>[ 8.197440] 6.19.0-rc1-lgci-xe-xe-4242-05b7c58b3367dca84+ #1 Not tainted <4>[ 8.197444] -------------------------------------------- <4>[ 8.197447] cpuhp/0/20 is trying to acquire lock: <4>[ 8.197450] ffffffff83487870 (cpu_hotplug_lock){++++}-{0:0}, at: rapl_package_add_pmu+0x37/0x370 [intel_rapl_common] <4>[ 8.197463] but task is already holding lock: <4>[ 8.197466] ffffffff83487870 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x6d/0x290 <4>[ 8.197477] other info that might help us debug this: <4>[ 8.197480] Possible unsafe locking scenario: <4>[ 8.197483] CPU0 <4>[ 8.197485] ---- <4>[ 8.197487] lock(cpu_hotplug_lock); <4>[ 8.197490] lock(cpu_hotplug_lock); <4>[ 8.197493] *** DEADLOCK *** .. .. <4>[ 8.197542] __lock_acquire+0x146e/0x2790 <4>[ 8.197548] lock_acquire+0xc4/0x2c0 <4>[ 8.197550] ? rapl_package_add_pmu+0x37/0x370 [intel_rapl_common] <4>[ 8.197556] cpus_read_lock+0x41/0x110 <4>[ 8.197558] ? rapl_package_add_pmu+0x37/0x370 [intel_rapl_common] <4>[ 8.197561] rapl_package_add_pmu+0x37/0x370 [intel_rapl_common] <4>[ 8.197565] rapl_cpu_online+0x85/0x87 [intel_rapl_msr] <4>[ 8.197568] ? __pfx_rapl_cpu_online+0x10/0x10 [intel_rapl_msr] <4>[ 8.197570] cpuhp_invoke_callback+0x41f/0x6c0 <4>[ 8.197573] ? cpuhp_thread_fun+0x6d/0x290 <4>[ 8.197575] cpuhp_thread_fun+0x1e2/0x290 <4>[ 8.197578] ? smpboot_thread_fn+0x26/0x290 <4>[ 8.197581] smpboot_thread_fn+0x12f/0x290 <4>[ 8.197584] ? __pfx_smpboot_thread_fn+0x10/0x10 <4>[ 8.197586] kthread+0x11f/0x250 <4>[ 8.197589] ? __pfx_kthread+0x10/0x10 <4>[ 8.197592] ret_from_fork+0x344/0x3a0 <4>[ 8.197595] ? __pfx_kthread+0x10/0x10 <4>[ 8.197597] ret_from_fork_asm+0x1a/0x30 <4>[ 8.197604] </TASK> Fix this issue in the same way as rapl powercap package domain is added from the same CPU online callback by introducing another interface which doesn't call cpus_read_lock(). Add rapl_package_add_pmu_locked() and rapl_package_remove_pmu_locked() which don't call cpus_read_lock(). Fixes: 748d6ba ("powercap: intel_rapl: Enable MSR-based RAPL PMU support") Reported-by: Borah, Chaitanya Kumar <[email protected]> Closes: https://lore.kernel.org/linux-pm/[email protected]/T/#u Tested-by: Kuppuswamy Sathyanarayanan <[email protected]> Tested-by: RavitejaX Veesam <[email protected]> Signed-off-by: Srinivas Pandruvada <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Rafael J. Wysocki <[email protected]>
Fix a loop scenario of ethx:egress->ethx:egress
Example setup to reproduce:
tc qdisc add dev ethx root handle 1: drr
tc filter add dev ethx parent 1: protocol ip prio 1 matchall \
action mirred egress redirect dev ethx
Now ping out of ethx and you get a deadlock:
[ 116.892898][ T307] ============================================
[ 116.893182][ T307] WARNING: possible recursive locking detected
[ 116.893418][ T307] 6.18.0-rc6-01205-ge05021a829b8-dirty #204 Not tainted
[ 116.893682][ T307] --------------------------------------------
[ 116.893926][ T307] ping/307 is trying to acquire lock:
[ 116.894133][ T307] ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50
[ 116.894517][ T307]
[ 116.894517][ T307] but task is already holding lock:
[ 116.894836][ T307] ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50
[ 116.895252][ T307]
[ 116.895252][ T307] other info that might help us debug this:
[ 116.895608][ T307] Possible unsafe locking scenario:
[ 116.895608][ T307]
[ 116.895901][ T307] CPU0
[ 116.896057][ T307] ----
[ 116.896200][ T307] lock(&sch->root_lock_key);
[ 116.896392][ T307] lock(&sch->root_lock_key);
[ 116.896605][ T307]
[ 116.896605][ T307] *** DEADLOCK ***
[ 116.896605][ T307]
[ 116.896864][ T307] May be due to missing lock nesting notation
[ 116.896864][ T307]
[ 116.897123][ T307] 6 locks held by ping/307:
[ 116.897302][ T307] #0: ffff88800b4b0250 (sk_lock-AF_INET){+.+.}-{0:0}, at: raw_sendmsg+0xb20/0x2cf0
[ 116.897808][ T307] #1: ffffffff88c839c0 (rcu_read_lock){....}-{1:3}, at: ip_output+0xa9/0x600
[ 116.898138][ T307] #2: ffffffff88c839c0 (rcu_read_lock){....}-{1:3}, at: ip_finish_output2+0x2c6/0x1ee0
[ 116.898459][ T307] #3: ffffffff88c83960 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x200/0x3b50
[ 116.898782][ T307] #4: ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50
[ 116.899132][ T307] #5: ffffffff88c83960 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x200/0x3b50
[ 116.899442][ T307]
[ 116.899442][ T307] stack backtrace:
[ 116.899667][ T307] CPU: 2 UID: 0 PID: 307 Comm: ping Not tainted 6.18.0-rc6-01205-ge05021a829b8-dirty #204 PREEMPT(voluntary)
[ 116.899672][ T307] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 116.899675][ T307] Call Trace:
[ 116.899678][ T307] <TASK>
[ 116.899680][ T307] dump_stack_lvl+0x6f/0xb0
[ 116.899688][ T307] print_deadlock_bug.cold+0xc0/0xdc
[ 116.899695][ T307] __lock_acquire+0x11f7/0x1be0
[ 116.899704][ T307] lock_acquire+0x162/0x300
[ 116.899707][ T307] ? __dev_queue_xmit+0x2210/0x3b50
[ 116.899713][ T307] ? srso_alias_return_thunk+0x5/0xfbef5
[ 116.899717][ T307] ? stack_trace_save+0x93/0xd0
[ 116.899723][ T307] _raw_spin_lock+0x30/0x40
[ 116.899728][ T307] ? __dev_queue_xmit+0x2210/0x3b50
[ 116.899731][ T307] __dev_queue_xmit+0x2210/0x3b50
Fixes: 178ca30 ("Revert "net/sched: Fix mirred deadlock on device recursion"")
Tested-by: Victor Nogueira <[email protected]>
Signed-off-by: Jamal Hadi Salim <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
Avoid a possible UAF in GPU recovery due to a race between the sched timeout callback and the tdr work queue. The gpu recovery function calls drm_sched_stop() and later drm_sched_start(). drm_sched_start() restarts the tdr queue which will eventually free the job. If the tdr queue frees the job before time out callback completes, the job will be freed and we'll get a UAF when accessing the pasid. Cache it early to avoid the UAF. Example KASAN trace: [ 493.058141] BUG: KASAN: slab-use-after-free in amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.067530] Read of size 4 at addr ffff88b0ce3f794c by task kworker/u128:1/323 [ 493.074892] [ 493.076485] CPU: 9 UID: 0 PID: 323 Comm: kworker/u128:1 Tainted: G E 6.16.0-1289896.2.zuul.bf4f11df81c1410bbe901c4373305a31 #1 PREEMPT(voluntary) [ 493.076493] Tainted: [E]=UNSIGNED_MODULE [ 493.076495] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS V1.03.B10 04/01/2019 [ 493.076500] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched] [ 493.076512] Call Trace: [ 493.076515] <TASK> [ 493.076518] dump_stack_lvl+0x64/0x80 [ 493.076529] print_report+0xce/0x630 [ 493.076536] ? _raw_spin_lock_irqsave+0x86/0xd0 [ 493.076541] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ 493.076545] ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.077253] kasan_report+0xb8/0xf0 [ 493.077258] ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.077965] amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.078672] ? __pfx_amdgpu_device_gpu_recover+0x10/0x10 [amdgpu] [ 493.079378] ? amdgpu_coredump+0x1fd/0x4c0 [amdgpu] [ 493.080111] amdgpu_job_timedout+0x642/0x1400 [amdgpu] [ 493.080903] ? pick_task_fair+0x24e/0x330 [ 493.080910] ? __pfx_amdgpu_job_timedout+0x10/0x10 [amdgpu] [ 493.081702] ? _raw_spin_lock+0x75/0xc0 [ 493.081708] ? __pfx__raw_spin_lock+0x10/0x10 [ 493.081712] drm_sched_job_timedout+0x1b0/0x4b0 [gpu_sched] [ 493.081721] ? __pfx__raw_spin_lock_irq+0x10/0x10 [ 493.081725] process_one_work+0x679/0xff0 [ 493.081732] worker_thread+0x6ce/0xfd0 [ 493.081736] ? __pfx_worker_thread+0x10/0x10 [ 493.081739] kthread+0x376/0x730 [ 493.081744] ? __pfx_kthread+0x10/0x10 [ 493.081748] ? __pfx__raw_spin_lock_irq+0x10/0x10 [ 493.081751] ? __pfx_kthread+0x10/0x10 [ 493.081755] ret_from_fork+0x247/0x330 [ 493.081761] ? __pfx_kthread+0x10/0x10 [ 493.081764] ret_from_fork_asm+0x1a/0x30 [ 493.081771] </TASK> Fixes: a72002c ("drm/amdgpu: Make use of drm_wedge_task_info") Link: HansKristian-Work/vkd3d-proton#2670 Cc: [email protected] Cc: [email protected] Cc: [email protected] Suggested-by: Matthew Brost <[email protected]> Reviewed-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 20880a3)
Test pull request, patches from Liu Bo on the list.