Skip to content

Jit BPF_CALL to direct call when possible #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

danielocfb
Copy link
Owner

Pull request for series with
subject: Jit BPF_CALL to direct call when possible
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=673664

@danielocfb
Copy link
Owner Author

Master branch: 9fad7fe
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=673664
version: 1

Pull request is NOT updated. Failed to apply https://patchwork.kernel.org/project/netdevbpf/list/?series=673664
error message:

Cmd('git') failed due to: exit code(128)
  cmdline: git am -3
  stdout: 'Applying: bpf, arm64: Jit BPF_CALL to direct call when possible
Applying: bpf, arm64: Eliminate false -EFBIG error in bpf trampoline
Patch failed at 0002 bpf, arm64: Eliminate false -EFBIG error in bpf trampoline
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".'
  stderr: 'error: sha1 information is lacking or useless (arch/arm64/net/bpf_jit_comp.c).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch'

conflict:


@danielocfb danielocfb closed this Sep 7, 2022
@danielocfb danielocfb deleted the series/673664=>bpf-next branch September 7, 2022 18:52
danielocfb pushed a commit that referenced this pull request Oct 19, 2022
Inspired by commit 9fb7410("arm64/BUG: Use BRK instruction for
generic BUG traps"), do similar for LoongArch to use generic BUG()
handler.

This patch uses the BREAK software breakpoint instruction to generate
a trap instead, similarly to most other arches, with the generic BUG
code generating the dmesg boilerplate.

This allows bug metadata to be moved to a separate table and reduces
the amount of inline code at BUG() and WARN() sites. This also avoids
clobbering any registers before they can be dumped.

To mitigate the size of the bug table further, this patch makes use of
the existing infrastructure for encoding addresses within the bug table
as 32-bit relative pointers instead of absolute pointers.

(Note: this limits the max kernel size to 2GB.)

Before patch:
[ 3018.338013] lkdtm: Performing direct entry BUG
[ 3018.342445] Kernel bug detected[#5]:
[ 3018.345992] CPU: 2 PID: 865 Comm: cat Tainted: G D 6.0.0-rc6+ #35

After patch:
[  125.585985] lkdtm: Performing direct entry BUG
[  125.590433] ------------[ cut here ]------------
[  125.595020] kernel BUG at drivers/misc/lkdtm/bugs.c:78!
[  125.600211] Oops - BUG[#1]:
[  125.602980] CPU: 3 PID: 410 Comm: cat Not tainted 6.0.0-rc6+ #36

Out-of-line file/line data information obtained compared to before.

Signed-off-by: Youling Tang <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
danielocfb pushed a commit that referenced this pull request Oct 19, 2022
ASAN reports an use-after-free in btf_dump_name_dups:

ERROR: AddressSanitizer: heap-use-after-free on address 0xffff927006db at pc 0xaaaab5dfb618 bp 0xffffdd89b890 sp 0xffffdd89b928
READ of size 2 at 0xffff927006db thread T0
    #0 0xaaaab5dfb614 in __interceptor_strcmp.part.0 (test_progs+0x21b614)
    #1 0xaaaab635f144 in str_equal_fn tools/lib/bpf/btf_dump.c:127
    #2 0xaaaab635e3e0 in hashmap_find_entry tools/lib/bpf/hashmap.c:143
    #3 0xaaaab635e72c in hashmap__find tools/lib/bpf/hashmap.c:212
    #4 0xaaaab6362258 in btf_dump_name_dups tools/lib/bpf/btf_dump.c:1525
    #5 0xaaaab636240c in btf_dump_resolve_name tools/lib/bpf/btf_dump.c:1552
    #6 0xaaaab6362598 in btf_dump_type_name tools/lib/bpf/btf_dump.c:1567
    #7 0xaaaab6360b48 in btf_dump_emit_struct_def tools/lib/bpf/btf_dump.c:912
    #8 0xaaaab6360630 in btf_dump_emit_type tools/lib/bpf/btf_dump.c:798
    #9 0xaaaab635f720 in btf_dump__dump_type tools/lib/bpf/btf_dump.c:282
    #10 0xaaaab608523c in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:236
    #11 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #12 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #13 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #14 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #15 0xaaaab5d65990  (test_progs+0x185990)

0xffff927006db is located 11 bytes inside of 16-byte region [0xffff927006d0,0xffff927006e0)
freed by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353e10 in btf__add_field tools/lib/bpf/btf.c:2032
    #7 0xaaaab6084fcc in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:232
    #8 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #9 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #10 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #11 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #12 0xaaaab5d65990  (test_progs+0x185990)

previously allocated by thread T0 here:
    #0 0xaaaab5e2c7c4 in realloc (test_progs+0x24c7c4)
    #1 0xaaaab634f4a0 in libbpf_reallocarray tools/lib/bpf/libbpf_internal.h:191
    #2 0xaaaab634f840 in libbpf_add_mem tools/lib/bpf/btf.c:163
    #3 0xaaaab636643c in strset_add_str_mem tools/lib/bpf/strset.c:106
    #4 0xaaaab6366560 in strset__add_str tools/lib/bpf/strset.c:157
    #5 0xaaaab6352d70 in btf__add_str tools/lib/bpf/btf.c:1519
    #6 0xaaaab6353ff0 in btf_add_enum_common tools/lib/bpf/btf.c:2070
    #7 0xaaaab6354080 in btf__add_enum tools/lib/bpf/btf.c:2102
    #8 0xaaaab6082f50 in test_btf_dump_incremental tools/testing/selftests/bpf/prog_tests/btf_dump.c:162
    #9 0xaaaab6097530 in test_btf_dump tools/testing/selftests/bpf/prog_tests/btf_dump.c:875
    #10 0xaaaab6314ed0 in run_one_test tools/testing/selftests/bpf/test_progs.c:1062
    #11 0xaaaab631a0a8 in main tools/testing/selftests/bpf/test_progs.c:1697
    #12 0xffff9676d214 in __libc_start_main ../csu/libc-start.c:308
    #13 0xaaaab5d65990  (test_progs+0x185990)

The reason is that the key stored in hash table name_map is a string
address, and the string memory is allocated by realloc() function, when
the memory is resized by realloc() later, the old memory may be freed,
so the address stored in name_map references to a freed memory, causing
use-after-free.

Fix it by storing duplicated string address in name_map.

Fixes: 919d2b1 ("libbpf: Allow modification of BTF and add btf__add_str API")
Signed-off-by: Xu Kuohai <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
danielocfb pushed a commit that referenced this pull request Nov 3, 2022
Daniel Machon says:

====================
Add new PCP and APPTRUST attributes to dcbnl

This patch series adds new extension attributes to dcbnl, to support PCP
prioritization (and thereby hw offloadable pcp-based queue
classification) and per-selector trust and trust order. Additionally,
the microchip sparx5 driver has been dcb-enabled to make use of the new
attributes to offload PCP, DSCP and Default prio to the switch, and
implement trust order of selectors.

For pre-RFC discussion see:
https://lore.kernel.org/netdev/Yv9VO1DYAxNduw6A@DEN-LT-70577/

For RFC series see:
https://lore.kernel.org/netdev/[email protected]/

In summary: there currently exist no convenient way to offload per-port
PCP-based queue classification to hardware. The DCB subsystem offers
different ways to prioritize through its APP table, but lacks an option
for PCP. Similarly, there is no way to indicate the notion of trust for
APP table selectors. This patch series addresses both topics.

PCP based queue classification:
  - 8021Q standardizes the Priority Code Point table (see 6.9.3 of IEEE
    Std 802.1Q-2018).  This patch series makes it possible, to offload
    the PCP classification to said table.  The new PCP selector is not a
    standard part of the APP managed object, therefore it is
    encapsulated in a new non-std extension attribute.

Selector trust:
  - ASIC's often has the notion of trust DSCP and trust PCP. The new
    attribute makes it possible to specify a trust order of app
    selectors, which drivers can then react on.

DCB-enable sparx5 driver:
 - Now supports offloading of DSCP, PCP and default priority. Only one
   mapping of protocol:priority is allowed. Consecutive mappings of the
   same protocol to some new priority, will overwrite the previous. This
   is to keep a consistent view of the app table and the hardware.
 - Now supports dscp and pcp trust, by use of the introduced
   dcbnl_set/getapptrust ops. Sparx5 supports trust orders: [], [dscp],
   [pcp] and [dscp, pcp]. For now, only DSCP and PCP selectors are
   supported by the driver, everything else is bounced.

Patch #1 introduces a new PCP selector to the APP object, which makes it
possible to encode PCP and DEI in the app triplet and offload it to the
PCP table of the ASIC.

Patch #2 Introduces the new extension attributes
DCB_ATTR_DCB_APP_TRUST_TABLE and DCB_ATTR_DCB_APP_TRUST. Trusted
selectors are passed in the nested DCB_ATTR_DCB_APP_TRUST_TABLE
attribute, and assembled into an array of selectors:

  u8 selectors[256];

where lower indexes has higher precedence.  In the array, selectors are
stored consecutively, starting from index zero. With a maximum number of
256 unique selectors, the list has the same maximum size.

Patch #3 Sets up the dcbnl ops hook, and adds support for offloading pcp
app entries, to the PCP table of the switch.

Patch #4 Makes use of the dcbnl_set/getapptrust ops, to set a per-port
trust order.

Patch #5 Adds support for offloading dscp app entries to the DSCP table
of the switch.

Patch #6 Adds support for offloading default prio app entries to the
switch.

====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
danielocfb pushed a commit that referenced this pull request Nov 4, 2022
Andrii Nakryiko says:

====================

This patch set fixes and improves BPF verifier's precision tracking logic for
SCALAR registers.

Patches #1 and #2 are bug fixes discovered while working on these changes.

Patch #3 enables precision tracking for BPF programs that contain subprograms.
This was disabled before and prevent any modern BPF programs that use
subprograms from enjoying the benefits of SCALAR (im)precise logic.

Patch #4 is few lines of code changes and many lines of explaining why those
changes are correct. We establish why ignoring precise markings in current
state is OK.

Patch #5 build on explanation in patch #4 and pushes it to the limit by
forcefully forgetting inherited precise markins. Patch #4 by itself doesn't
prevent current state from having precise=true SCALARs, so patch #5 is
necessary to prevent such stray precise=true registers from creeping in.

Patch #6 adjusts test_align selftests to work around BPF verifier log's
limitations when it comes to interactions between state output and precision
backtracking output.

Overall, the goal of this patch set is to make BPF verifier's state tracking
a bit more efficient by trying to preserve as much generality in checkpointed
states as possible.

v1->v2:
- adjusted patch #1 commit message to make it clear we are fixing forward
  step, not precision backtracking (Alexei);
- moved last_idx/first_idx verbose logging up to make it clear when global
  func reaches the first empty state (Alexei).
====================

Signed-off-by: Alexei Starovoitov <[email protected]>
danielocfb pushed a commit that referenced this pull request Nov 14, 2022
The btrfs_alloc_dummy_root() uses ERR_PTR as the error return value
rather than NULL, if error happened, there will be a NULL pointer
dereference:

  BUG: KASAN: null-ptr-deref in btrfs_free_dummy_root+0x21/0x50 [btrfs]
  Read of size 8 at addr 000000000000002c by task insmod/258926

  CPU: 2 PID: 258926 Comm: insmod Tainted: G        W          6.1.0-rc2+ #5
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x44
   kasan_report+0xb7/0x140
   kasan_check_range+0x145/0x1a0
   btrfs_free_dummy_root+0x21/0x50 [btrfs]
   btrfs_test_free_space_cache+0x1a8c/0x1add [btrfs]
   btrfs_run_sanity_tests+0x65/0x80 [btrfs]
   init_btrfs_fs+0xec/0x154 [btrfs]
   do_one_initcall+0x87/0x2a0
   do_init_module+0xdf/0x320
   load_module+0x3006/0x3390
   __do_sys_finit_module+0x113/0x1b0
   do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

Fixes: aaedb55 ("Btrfs: add tests for btrfs_get_extent")
CC: [email protected] # 4.9+
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Zhang Xiaoxu <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
danielocfb pushed a commit that referenced this pull request Nov 14, 2022
Petr Machata says:

====================
mlxsw: Add 802.1X and MAB offload support

This patchset adds 802.1X [1] and MAB [2] offload support in mlxsw.

Patches #1-#3 add the required switchdev interfaces.

Patches #4-#5 add the required packet traps for 802.1X.

Patches #6-#10 are small preparations in mlxsw.

Patch #11 adds locked bridge port support in mlxsw.

Patches #12-#15 add mlxsw selftests. The patchset was also tested with
the generic forwarding selftest ('bridge_locked_port.sh').

[1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=a21d9a670d81103db7f788de1a4a4a6e4b891a0b
[2] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=a35ec8e38cdd1766f29924ca391a01de20163931
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
danielocfb pushed a commit that referenced this pull request Nov 21, 2022
Sahid Orentino Ferdjaoui says:

====================

As part of commit 93b8952 ("libbpf: deprecate legacy BPF map
definitions") and commit bd05410 ("libbpf: enforce strict libbpf
1.0 behaviors") The --legacy option is not relevant anymore. #1 is
removing it. #4 is cleaning the code from using libbpf_get_error().

About patches #2 and #3 They are changes discovered while working on
this series (credits to Quentin Monnet). #2 is cleaning-up usage of an
unnecessary PTR_ERR(NULL), finally #3 is fixing an invalid value
passed to strerror().

v1 -> v2:
   - Addressed review comments from Yonghong Song on patch #4
   - Added a patch #5 that removes unwanted function noticed by
     Yonghong Song
v2 -> v3
   - Addressed review comments from Andrii Nakryiko on patch #2, #3, #4
     * clean-up usage of libbpf_get_error() (#2, #3)
     * fix possible return of an uninitialized local variable err
     * fix returned errors using errno
v3 -> v4
   - Addressed review comments from Quentin Monnet
     * fix line moved from patch #2 to patch #3
     * fix missing returned errors using errno
     * fix some returned values to errno instead of -1
====================

Reviewed-by: Quentin Monnet <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
danielocfb pushed a commit that referenced this pull request Nov 24, 2022
test_bpf tail call tests end up as:

  test_bpf: #0 Tail call leaf jited:1 85 PASS
  test_bpf: #1 Tail call 2 jited:1 111 PASS
  test_bpf: #2 Tail call 3 jited:1 145 PASS
  test_bpf: #3 Tail call 4 jited:1 170 PASS
  test_bpf: #4 Tail call load/store leaf jited:1 190 PASS
  test_bpf: #5 Tail call load/store jited:1
  BUG: Unable to handle kernel data access on write at 0xf1b4e000
  Faulting instruction address: 0xbe86b710
  Oops: Kernel access of bad area, sig: 11 [#1]
  BE PAGE_SIZE=4K MMU=Hash PowerMac
  Modules linked in: test_bpf(+)
  CPU: 0 PID: 97 Comm: insmod Not tainted 6.1.0-rc4+ #195
  Hardware name: PowerMac3,1 750CL 0x87210 PowerMac
  NIP:  be86b710 LR: be857e88 CTR: be86b704
  REGS: f1b4df20 TRAP: 0300   Not tainted  (6.1.0-rc4+)
  MSR:  00009032 <EE,ME,IR,DR,RI>  CR: 28008242  XER: 00000000
  DAR: f1b4e000 DSISR: 42000000
  GPR00: 00000001 f1b4dfe0 c11d2280 00000000 00000000 00000000 00000002 00000000
  GPR08: f1b4e000 be86b704 f1b4e000 00000000 00000000 100d816a f2440000 fe73baa8
  GPR16: f2458000 00000000 c1941ae4 f1fe2248 00000045 c0de0000 f2458030 00000000
  GPR24: 000003e8 0000000f f2458000 f1b4dc90 3e584b46 00000000 f24466a0 c1941a00
  NIP [be86b710] 0xbe86b710
  LR [be857e88] __run_one+0xec/0x264 [test_bpf]
  Call Trace:
  [f1b4dfe0] [00000002] 0x2 (unreliable)
  Instruction dump:
  XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
  XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
  ---[ end trace 0000000000000000 ]---

This is a tentative to write above the stack. The problem is encoutered
with tests added by commit 38608ee ("bpf, tests: Add load store
test case for tail call")

This happens because tail call is done to a BPF prog with a different
stack_depth. At the time being, the stack is kept as is when the caller
tail calls its callee. But at exit, the callee restores the stack based
on its own properties. Therefore here, at each run, r1 is erroneously
increased by 32 - 16 = 16 bytes.

This was done that way in order to pass the tail call count from caller
to callee through the stack. As powerpc32 doesn't have a red zone in
the stack, it was necessary the maintain the stack as is for the tail
call. But it was not anticipated that the BPF frame size could be
different.

Let's take a new approach. Use register r4 to carry the tail call count
during the tail call, and save it into the stack at function entry if
required. This means the input parameter must be in r3, which is more
correct as it is a 32 bits parameter, then tail call better match with
normal BPF function entry, the down side being that we move that input
parameter back and forth between r3 and r4. That can be optimised later.

Doing that also has the advantage of maximising the common parts between
tail calls and a normal function exit.

With the fix, tail call tests are now successfull:

  test_bpf: #0 Tail call leaf jited:1 53 PASS
  test_bpf: #1 Tail call 2 jited:1 115 PASS
  test_bpf: #2 Tail call 3 jited:1 154 PASS
  test_bpf: #3 Tail call 4 jited:1 165 PASS
  test_bpf: #4 Tail call load/store leaf jited:1 101 PASS
  test_bpf: #5 Tail call load/store jited:1 141 PASS
  test_bpf: #6 Tail call error path, max count reached jited:1 994 PASS
  test_bpf: #7 Tail call count preserved across function calls jited:1 140975 PASS
  test_bpf: #8 Tail call error path, NULL target jited:1 110 PASS
  test_bpf: #9 Tail call error path, index out of range jited:1 69 PASS
  test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]

Suggested-by: Naveen N. Rao <[email protected]>
Fixes: 51c66ad ("powerpc/bpf: Implement extended BPF on PPC32")
Cc: [email protected]
Signed-off-by: Christophe Leroy <[email protected]>
danielocfb pushed a commit that referenced this pull request Dec 11, 2022
test_bpf tail call tests end up as:

  test_bpf: #0 Tail call leaf jited:1 85 PASS
  test_bpf: #1 Tail call 2 jited:1 111 PASS
  test_bpf: #2 Tail call 3 jited:1 145 PASS
  test_bpf: #3 Tail call 4 jited:1 170 PASS
  test_bpf: #4 Tail call load/store leaf jited:1 190 PASS
  test_bpf: #5 Tail call load/store jited:1
  BUG: Unable to handle kernel data access on write at 0xf1b4e000
  Faulting instruction address: 0xbe86b710
  Oops: Kernel access of bad area, sig: 11 [#1]
  BE PAGE_SIZE=4K MMU=Hash PowerMac
  Modules linked in: test_bpf(+)
  CPU: 0 PID: 97 Comm: insmod Not tainted 6.1.0-rc4+ #195
  Hardware name: PowerMac3,1 750CL 0x87210 PowerMac
  NIP:  be86b710 LR: be857e88 CTR: be86b704
  REGS: f1b4df20 TRAP: 0300   Not tainted  (6.1.0-rc4+)
  MSR:  00009032 <EE,ME,IR,DR,RI>  CR: 28008242  XER: 00000000
  DAR: f1b4e000 DSISR: 42000000
  GPR00: 00000001 f1b4dfe0 c11d2280 00000000 00000000 00000000 00000002 00000000
  GPR08: f1b4e000 be86b704 f1b4e000 00000000 00000000 100d816a f2440000 fe73baa8
  GPR16: f2458000 00000000 c1941ae4 f1fe2248 00000045 c0de0000 f2458030 00000000
  GPR24: 000003e8 0000000f f2458000 f1b4dc90 3e584b46 00000000 f24466a0 c1941a00
  NIP [be86b710] 0xbe86b710
  LR [be857e88] __run_one+0xec/0x264 [test_bpf]
  Call Trace:
  [f1b4dfe0] [00000002] 0x2 (unreliable)
  Instruction dump:
  XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
  XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
  ---[ end trace 0000000000000000 ]---

This is a tentative to write above the stack. The problem is encoutered
with tests added by commit 38608ee ("bpf, tests: Add load store
test case for tail call")

This happens because tail call is done to a BPF prog with a different
stack_depth. At the time being, the stack is kept as is when the caller
tail calls its callee. But at exit, the callee restores the stack based
on its own properties. Therefore here, at each run, r1 is erroneously
increased by 32 - 16 = 16 bytes.

This was done that way in order to pass the tail call count from caller
to callee through the stack. As powerpc32 doesn't have a red zone in
the stack, it was necessary the maintain the stack as is for the tail
call. But it was not anticipated that the BPF frame size could be
different.

Let's take a new approach. Use register r4 to carry the tail call count
during the tail call, and save it into the stack at function entry if
required. This means the input parameter must be in r3, which is more
correct as it is a 32 bits parameter, then tail call better match with
normal BPF function entry, the down side being that we move that input
parameter back and forth between r3 and r4. That can be optimised later.

Doing that also has the advantage of maximising the common parts between
tail calls and a normal function exit.

With the fix, tail call tests are now successfull:

  test_bpf: #0 Tail call leaf jited:1 53 PASS
  test_bpf: #1 Tail call 2 jited:1 115 PASS
  test_bpf: #2 Tail call 3 jited:1 154 PASS
  test_bpf: #3 Tail call 4 jited:1 165 PASS
  test_bpf: #4 Tail call load/store leaf jited:1 101 PASS
  test_bpf: #5 Tail call load/store jited:1 141 PASS
  test_bpf: #6 Tail call error path, max count reached jited:1 994 PASS
  test_bpf: #7 Tail call count preserved across function calls jited:1 140975 PASS
  test_bpf: #8 Tail call error path, NULL target jited:1 110 PASS
  test_bpf: #9 Tail call error path, index out of range jited:1 69 PASS
  test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]

Suggested-by: Naveen N. Rao <[email protected]>
Fixes: 51c66ad ("powerpc/bpf: Implement extended BPF on PPC32")
Cc: [email protected]
Signed-off-by: Christophe Leroy <[email protected]>
Tested-by: Naveen N. Rao <[email protected]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/757acccb7fbfc78efa42dcf3c974b46678198905.1669278887.git.christophe.leroy@csgroup.eu
danielocfb pushed a commit that referenced this pull request Dec 11, 2022
QAT devices on Intel Sapphire Rapids and Emerald Rapids have a defect in
address translation service (ATS). These devices may inadvertently issue
ATS invalidation completion before posted writes initiated with
translated address that utilized translations matching the invalidation
address range, violating the invalidation completion ordering.

This patch adds an extra device TLB invalidation for the affected devices,
it is needed to ensure no more posted writes with translated address
following the invalidation completion. Therefore, the ordering is
preserved and data-corruption is prevented.

Device TLBs are invalidated under the following six conditions:
1. Device driver does DMA API unmap IOVA
2. Device driver unbind a PASID from a process, sva_unbind_device()
3. PASID is torn down, after PASID cache is flushed. e.g. process
exit_mmap() due to crash
4. Under SVA usage, called by mmu_notifier.invalidate_range() where
VM has to free pages that were unmapped
5. userspace driver unmaps a DMA buffer
6. Cache invalidation in vSVA usage (upcoming)

For #1 and #2, device drivers are responsible for stopping DMA traffic
before unmap/unbind. For #3, iommu driver gets mmu_notifier to
invalidate TLB the same way as normal user unmap which will do an extra
invalidation. The dTLB invalidation after PASID cache flush does not
need an extra invalidation.

Therefore, we only need to deal with #4 and #5 in this patch. #1 is also
covered by this patch due to common code path with #5.

Tested-by: Yuzhang Luo <[email protected]>
Reviewed-by: Ashok Raj <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Lu Baolu <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>
danielocfb pushed a commit that referenced this pull request Dec 12, 2022
We see kernel crashes and lockups and KASAN errors related to ax210
firmware crashes.  One of the KASAN dumps pointed at the tx path,
and it appears there is indeed a way to double-free an skb.

If iwl_mvm_tx_skb_sta returns non-zero, then the 'skb' sent into the
method will be freed.  But, in case where we build TSO skb buffer,
the skb may also be freed in error case.  So, return 0 in that particular
error case and do cleanup manually.

BUG: KASAN: use-after-free in __list_del_entry_valid+0x12/0x90
iwlwifi 0000:06:00.0: 0x00000000 | tsf hi
Read of size 8 at addr ffff88813cfa4ba0 by task btserver/9650

CPU: 4 PID: 9650 Comm: btserver Tainted: G        W         5.19.8+ #5
iwlwifi 0000:06:00.0: 0x00000000 | time gp1
Hardware name: Default string Default string/SKYBAY, BIOS 5.12 02/19/2019
Call Trace:
 <TASK>
 dump_stack_lvl+0x55/0x6d
 print_report.cold.12+0xf2/0x684
iwlwifi 0000:06:00.0: 0x1D0915A8 | time gp2
 ? __list_del_entry_valid+0x12/0x90
 kasan_report+0x8b/0x180
iwlwifi 0000:06:00.0: 0x00000001 | uCode revision type
 ? __list_del_entry_valid+0x12/0x90
 __list_del_entry_valid+0x12/0x90
iwlwifi 0000:06:00.0: 0x00000048 | uCode version major
 tcp_update_skb_after_send+0x5d/0x170
 __tcp_transmit_skb+0xb61/0x15c0
iwlwifi 0000:06:00.0: 0xDAA05125 | uCode version minor
 ? __tcp_select_window+0x490/0x490
iwlwifi 0000:06:00.0: 0x00000420 | hw version
 ? trace_kmalloc_node+0x29/0xd0
 ? __kmalloc_node_track_caller+0x12a/0x260
 ? memset+0x1f/0x40
 ? __build_skb_around+0x125/0x150
 ? __alloc_skb+0x1d4/0x220
 ? skb_zerocopy_clone+0x55/0x230
iwlwifi 0000:06:00.0: 0x00489002 | board version
 ? kmalloc_reserve+0x80/0x80
 ? rcu_read_lock_bh_held+0x60/0xb0
 tcp_write_xmit+0x3f1/0x24d0
iwlwifi 0000:06:00.0: 0x034E001C | hcmd
 ? __check_object_size+0x180/0x350
iwlwifi 0000:06:00.0: 0x24020000 | isr0
 tcp_sendmsg_locked+0x8a9/0x1520
iwlwifi 0000:06:00.0: 0x01400000 | isr1
 ? tcp_sendpage+0x50/0x50
iwlwifi 0000:06:00.0: 0x48F0000A | isr2
 ? lock_release+0xb9/0x400
 ? tcp_sendmsg+0x14/0x40
iwlwifi 0000:06:00.0: 0x00C3080C | isr3
 ? lock_downgrade+0x390/0x390
 ? do_raw_spin_lock+0x114/0x1d0
iwlwifi 0000:06:00.0: 0x00200000 | isr4
 ? rwlock_bug.part.2+0x50/0x50
iwlwifi 0000:06:00.0: 0x034A001C | last cmd Id
 ? rwlock_bug.part.2+0x50/0x50
 ? lockdep_hardirqs_on_prepare+0xe/0x200
iwlwifi 0000:06:00.0: 0x0000C2F0 | wait_event
 ? __local_bh_enable_ip+0x87/0xe0
 ? inet_send_prepare+0x220/0x220
iwlwifi 0000:06:00.0: 0x000000C4 | l2p_control
 tcp_sendmsg+0x22/0x40
 sock_sendmsg+0x5f/0x70
iwlwifi 0000:06:00.0: 0x00010034 | l2p_duration
 __sys_sendto+0x19d/0x250
iwlwifi 0000:06:00.0: 0x00000007 | l2p_mhvalid
 ? __ia32_sys_getpeername+0x40/0x40
iwlwifi 0000:06:00.0: 0x00000000 | l2p_addr_match
 ? rcu_read_lock_held_common+0x12/0x50
 ? rcu_read_lock_sched_held+0x5a/0xd0
 ? rcu_read_lock_bh_held+0xb0/0xb0
 ? rcu_read_lock_sched_held+0x5a/0xd0
 ? rcu_read_lock_sched_held+0x5a/0xd0
 ? lock_release+0xb9/0x400
 ? lock_downgrade+0x390/0x390
 ? ktime_get+0x64/0x130
 ? ktime_get+0x8d/0x130
 ? rcu_read_lock_held_common+0x12/0x50
 ? rcu_read_lock_sched_held+0x5a/0xd0
 ? rcu_read_lock_held_common+0x12/0x50
 ? rcu_read_lock_sched_held+0x5a/0xd0
 ? rcu_read_lock_bh_held+0xb0/0xb0
 ? rcu_read_lock_bh_held+0xb0/0xb0
 __x64_sys_sendto+0x6f/0x80
 do_syscall_64+0x34/0xb0
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f1d126e4531
Code: 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 35 80 0c 00 41 89 ca 8b 00 85 c0 75 1c 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 67 c3 66 0f 1f 44 00 00 55 48 83 ec 20 48 89
RSP: 002b:00007ffe21a679d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 000000000000ffdc RCX: 00007f1d126e4531
RDX: 0000000000010000 RSI: 000000000374acf0 RDI: 0000000000000014
RBP: 00007ffe21a67ac0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000010
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
 </TASK>

Allocated by task 9650:
 kasan_save_stack+0x1c/0x40
 __kasan_slab_alloc+0x6d/0x90
 kmem_cache_alloc_node+0xf3/0x2b0
 __alloc_skb+0x191/0x220
 tcp_stream_alloc_skb+0x3f/0x330
 tcp_sendmsg_locked+0x67c/0x1520
 tcp_sendmsg+0x22/0x40
 sock_sendmsg+0x5f/0x70
 __sys_sendto+0x19d/0x250
 __x64_sys_sendto+0x6f/0x80
 do_syscall_64+0x34/0xb0
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

Freed by task 9650:
 kasan_save_stack+0x1c/0x40
 kasan_set_track+0x21/0x30
 kasan_set_free_info+0x20/0x30
 __kasan_slab_free+0x102/0x170
 kmem_cache_free+0xc8/0x3e0
 iwl_mvm_mac_itxq_xmit+0x124/0x270 [iwlmvm]
 ieee80211_queue_skb+0x874/0xd10 [mac80211]
 ieee80211_xmit_fast+0xf80/0x1180 [mac80211]
 __ieee80211_subif_start_xmit+0x287/0x680 [mac80211]
 ieee80211_subif_start_xmit+0xcd/0x730 [mac80211]
 dev_hard_start_xmit+0xf6/0x420
 __dev_queue_xmit+0x165b/0x1b50
 ip_finish_output2+0x66e/0xfb0
 __ip_finish_output+0x487/0x6d0
 ip_output+0x11c/0x350
 __ip_queue_xmit+0x36b/0x9d0
 __tcp_transmit_skb+0xb35/0x15c0
 tcp_write_xmit+0x3f1/0x24d0
 tcp_sendmsg_locked+0x8a9/0x1520
 tcp_sendmsg+0x22/0x40
 sock_sendmsg+0x5f/0x70
 __sys_sendto+0x19d/0x250
 __x64_sys_sendto+0x6f/0x80
 do_syscall_64+0x34/0xb0
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

The buggy address belongs to the object at ffff88813cfa4b40
 which belongs to the cache skbuff_fclone_cache of size 472
The buggy address is located 96 bytes inside of
 472-byte region [ffff88813cfa4b40, ffff88813cfa4d18)

The buggy address belongs to the physical page:
page:ffffea0004f3e900 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88813cfa6c40 pfn:0x13cfa4
head:ffffea0004f3e900 order:2 compound_mapcount:0 compound_pincount:0
flags: 0x5fff8000010200(slab|head|node=0|zone=2|lastcpupid=0x3fff)
raw: 005fff8000010200 ffffea0004656b08 ffffea0008e8cf08 ffff8881081a5240
raw: ffff88813cfa6c40 0000000000170015 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88813cfa4a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88813cfa4b00: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
>ffff88813cfa4b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                               ^
 ffff88813cfa4c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88813cfa4c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Fixes: 08f7d8b ("iwlwifi: mvm: bring back mvm GSO code")
Link: https://lore.kernel.org/linux-wireless/[email protected]/
Tested-by: Amol Jawale <[email protected]>
Signed-off-by: Ben Greear <[email protected]>
Link: https://lore.kernel.org/r/20221123225313.21b1ee31d666.I3b3ba184433dd2a544d91eeeda29b467021824ae@changeid
Signed-off-by: Gregory Greenman <[email protected]>
danielocfb pushed a commit that referenced this pull request Dec 12, 2022
Vladimir Oltean says:

====================
Fix rtnl_mutex deadlock with DPAA2 and SFP modules

This patch set deliberately targets net-next and lacks Fixes: tags due
to caution on my part.

While testing some SFP modules on the Solidrun Honeycomb LX2K platform,
I noticed that rebooting causes a deadlock:

============================================
WARNING: possible recursive locking detected
6.1.0-rc5-07010-ga9b9500ffaac-dirty #656 Not tainted
--------------------------------------------
systemd-shutdow/1 is trying to acquire lock:
ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30

but task is already holding lock:
ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(rtnl_mutex);
  lock(rtnl_mutex);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

6 locks held by systemd-shutdow/1:
 #0: ffffa62db6863c70 (system_transition_mutex){+.+.}-{4:4}, at: __do_sys_reboot+0xd4/0x260
 #1: ffff2f2b0176f100 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xf4/0x260
 #2: ffff2f2b017be900 (&dev->mutex){....}-{4:4}, at: device_shutdown+0x104/0x260
 #3: ffff2f2b017680f0 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x40/0x260
 #4: ffff2f2b0e1608f0 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x40/0x260
 #5: ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30

stack backtrace:
CPU: 6 PID: 1 Comm: systemd-shutdow Not tainted 6.1.0-rc5-07010-ga9b9500ffaac-dirty #656
Hardware name: SolidRun LX2160A Honeycomb (DT)
Call trace:
 lock_acquire+0x68/0x84
 __mutex_lock+0x98/0x460
 mutex_lock_nested+0x2c/0x40
 rtnl_lock+0x1c/0x30
 sfp_bus_del_upstream+0x1c/0xac
 phylink_destroy+0x1c/0x50
 dpaa2_mac_disconnect+0x28/0x70
 dpaa2_eth_remove+0x1dc/0x1f0
 fsl_mc_driver_remove+0x24/0x60
 device_remove+0x70/0x80
 device_release_driver_internal+0x1f0/0x260
 device_links_unbind_consumers+0xe0/0x110
 device_release_driver_internal+0x138/0x260
 device_release_driver+0x18/0x24
 bus_remove_device+0x12c/0x13c
 device_del+0x16c/0x424
 fsl_mc_device_remove+0x28/0x40
 __fsl_mc_device_remove+0x10/0x20
 device_for_each_child+0x5c/0xac
 dprc_remove+0x94/0xb4
 fsl_mc_driver_remove+0x24/0x60
 device_remove+0x70/0x80
 device_release_driver_internal+0x1f0/0x260
 device_release_driver+0x18/0x24
 bus_remove_device+0x12c/0x13c
 device_del+0x16c/0x424
 fsl_mc_bus_remove+0x8c/0x10c
 fsl_mc_bus_shutdown+0x10/0x20
 platform_shutdown+0x24/0x3c
 device_shutdown+0x15c/0x260
 kernel_restart+0x40/0xa4
 __do_sys_reboot+0x1e4/0x260
 __arm64_sys_reboot+0x24/0x30

But fixing this appears to be not so simple. The patch set represents my
attempt to address it.

In short, the problem is that dpaa2_mac_connect() and dpaa2_mac_disconnect()
call 2 phylink functions in a row, one takes rtnl_lock() itself -
phylink_create(), and one which requires rtnl_lock() to be held by the
caller - phylink_fwnode_phy_connect(). The existing approach in the
drivers is too simple. We take rtnl_lock() when calling dpaa2_mac_connect(),
which is what results in the deadlock.

Fixing just that creates another problem. The drivers make use of
rtnl_lock() for serializing with other code paths too. I think I've
found all those code paths, and established other mechanisms for
serializing with them.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
danielocfb pushed a commit that referenced this pull request Dec 12, 2022
Petr Machata says:

====================
mlxsw: Add Spectrum-1 ip6gre support

Ido Schimmel writes:

Currently, mlxsw only supports ip6gre offload on Spectrum-2 and newer
ASICs. Spectrum-1 can also offload ip6gre tunnels, but it needs double
entry router interfaces (RIFs) for the RIFs representing these tunnels.
In addition, the RIF index needs to be even. This is handled in
patches #1-#3.

The implementation can otherwise be shared between all Spectrum
generations. This is handled in patches #4-#5.

Patch #6 moves a mlxsw ip6gre selftest to a shared directory, as ip6gre
is no longer only supported on Spectrum-2 and newer ASICs.

This work is motivated by users that require multiple GRE tunnels that
all share the same underlay VRF. Currently, mlxsw only supports
decapsulation based on the underlay destination IP (i.e., not taking the
GRE key into account), so users need to configure these tunnels with
different source IPs and IPv6 addresses are easier to spare than IPv4.

Tested using existing ip6gre forwarding selftests.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
danielocfb pushed a commit that referenced this pull request Dec 12, 2022
Biju Das <[email protected]> says:

The CAN FD IP found on RZ/G2L SoC has some HW features different to
that of R-Car. For example, it has multiple resets, dedicated channel
tx and error interrupts, separate global rx and error interrupts
compared to shared irq for R-Car. it does not s ECC error flag
registers and clk post divider present on R-Car.

Similarly, R-Car V3U has 8 channels whereas other SoCs has only 2
channels. Currently all the HW differences are handled by comparing
with chip_id enum.

This patch series aims to replace chip_id with struct
rcar_canfd_hw_info to handle the HW feature differences and driver
data present on both IPs.

The changes are trivial and tested on RZ/G2L SMARC EVK.

This patch series depend upon [1].

[1] https://lore.kernel.org/all/[email protected]

changes since v2: https://lore.kernel.org/all/[email protected]
 * Replaced data type of max_channels from unsigned int->u8 to save memory.
 * Replaced data type of postdiv from unsigned int->u8 to save memory.

changes since v1: https://lore.kernel.org/all/[email protected]
 * Updated commit description for R-Car V3U SoC detection using
   driver data.
 * Replaced data type of max_channels from u32->unsigned int.
 * Replaced multi_global_irqs->shared_global_irqs to make it
   positive checks.
 * Replaced clk_postdiv->postdiv driver data variable.
 * Simplified the calcualtion for fcan_freq.
 * Replaced info->has_gerfl to gpriv->info->has_gerfl and wrapped
   the ECC error flag checks inside a single if statement.
 * Added Rb tag from Geert patch#1,#2,#3 and #5

Link: https://lore.kernel.org/all/[email protected]
[mkl: only take patches 1...5 to avoid merge conflict]
Signed-off-by: Marc Kleine-Budde <[email protected]>
danielocfb pushed a commit that referenced this pull request Feb 2, 2023
Commit 9130b8d ("SUNRPC: allow for upcalls for the same uid
but different gss service") introduced `auth` argument to
__gss_find_upcall(), but in gss_pipe_downcall() it was left as NULL
since it (and auth->service) was not (yet) determined.

When multiple upcalls with the same uid and different service are
ongoing, it could happen that __gss_find_upcall(), which returns the
first match found in the pipe->in_downcall list, could not find the
correct gss_msg corresponding to the downcall we are looking for.
Moreover, it might return a msg which is not sent to rpc.gssd yet.

We could see mount.nfs process hung in D state with multiple mount.nfs
are executed in parallel.  The call trace below is of CentOS 7.9
kernel-3.10.0-1160.24.1.el7.x86_64 but we observed the same hang w/
elrepo kernel-ml-6.0.7-1.el7.

PID: 71258  TASK: ffff91ebd4be0000  CPU: 36  COMMAND: "mount.nfs"
 #0 [ffff9203ca3234f8] __schedule at ffffffffa3b8899f
 #1 [ffff9203ca323580] schedule at ffffffffa3b88eb9
 #2 [ffff9203ca323590] gss_cred_init at ffffffffc0355818 [auth_rpcgss]
 #3 [ffff9203ca323658] rpcauth_lookup_credcache at ffffffffc0421ebc
[sunrpc]
 #4 [ffff9203ca3236d8] gss_lookup_cred at ffffffffc0353633 [auth_rpcgss]
 #5 [ffff9203ca3236e8] rpcauth_lookupcred at ffffffffc0421581 [sunrpc]
 #6 [ffff9203ca323740] rpcauth_refreshcred at ffffffffc04223d3 [sunrpc]
 #7 [ffff9203ca3237a0] call_refresh at ffffffffc04103dc [sunrpc]
 #8 [ffff9203ca3237b8] __rpc_execute at ffffffffc041e1c9 [sunrpc]
 #9 [ffff9203ca323820] rpc_execute at ffffffffc0420a48 [sunrpc]

The scenario is like this. Let's say there are two upcalls for
services A and B, A -> B in pipe->in_downcall, B -> A in pipe->pipe.

When rpc.gssd reads pipe to get the upcall msg corresponding to
service B from pipe->pipe and then writes the response, in
gss_pipe_downcall the msg corresponding to service A will be picked
because only uid is used to find the msg and it is before the one for
B in pipe->in_downcall.  And the process waiting for the msg
corresponding to service A will be woken up.

Actual scheduing of that process might be after rpc.gssd processes the
next msg.  In rpc_pipe_generic_upcall it clears msg->errno (for A).
The process is scheduled to see gss_msg->ctx == NULL and
gss_msg->msg.errno == 0, therefore it cannot break the loop in
gss_create_upcall and is never woken up after that.

This patch adds a simple check to ensure that a msg which is not
sent to rpc.gssd yet is not chosen as the matching upcall upon
receiving a downcall.

Signed-off-by: minoura makoto <[email protected]>
Signed-off-by: Hiroshi Shimamoto <[email protected]>
Tested-by: Hiroshi Shimamoto <[email protected]>
Cc: Trond Myklebust <[email protected]>
Fixes: 9130b8d ("SUNRPC: allow for upcalls for same uid but different gss service")
Signed-off-by: Trond Myklebust <[email protected]>
danielocfb pushed a commit that referenced this pull request Feb 2, 2023
…ata and not linked with libtraceevent

When we have a perf.data file with tracepoints, such as:

  # perf evlist -f
  probe_perf:lzma_decompress_to_file
  # Tip: use 'perf evlist --trace-fields' to show fields for tracepoint events
  #

We end up segfaulting when using perf built with NO_LIBTRACEEVENT=1 by
trying to find an evsel with a NULL 'event_name' variable:

  (gdb) run report --stdio -f
  Starting program: /root/bin/perf report --stdio -f

  Program received signal SIGSEGV, Segmentation fault.
  0x000000000055219d in find_evsel (evlist=0xfda7b0, event_name=0x0) at util/sort.c:2830
  warning: Source file is more recent than executable.
  2830		if (event_name[0] == '%') {
  Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-11.fc36.x86_64 cyrus-sasl-lib-2.1.27-18.fc36.x86_64 elfutils-debuginfod-client-0.188-3.fc36.x86_64 elfutils-libelf-0.188-3.fc36.x86_64 elfutils-libs-0.188-3.fc36.x86_64 glibc-2.35-20.fc36.x86_64 keyutils-libs-1.6.1-4.fc36.x86_64 krb5-libs-1.19.2-12.fc36.x86_64 libbrotli-1.0.9-7.fc36.x86_64 libcap-2.48-4.fc36.x86_64 libcom_err-1.46.5-2.fc36.x86_64 libcurl-7.82.0-12.fc36.x86_64 libevent-2.1.12-6.fc36.x86_64 libgcc-12.2.1-4.fc36.x86_64 libidn2-2.3.4-1.fc36.x86_64 libnghttp2-1.51.0-1.fc36.x86_64 libpsl-0.21.1-5.fc36.x86_64 libselinux-3.3-4.fc36.x86_64 libssh-0.9.6-4.fc36.x86_64 libstdc++-12.2.1-4.fc36.x86_64 libunistring-1.0-1.fc36.x86_64 libunwind-1.6.2-2.fc36.x86_64 libxcrypt-4.4.33-4.fc36.x86_64 libzstd-1.5.2-2.fc36.x86_64 numactl-libs-2.0.14-5.fc36.x86_64 opencsd-1.2.0-1.fc36.x86_64 openldap-2.6.3-1.fc36.x86_64 openssl-libs-3.0.5-2.fc36.x86_64 slang-2.3.2-11.fc36.x86_64 xz-libs-5.2.5-9.fc36.x86_64 zlib-1.2.11-33.fc36.x86_64
  (gdb) bt
  #0  0x000000000055219d in find_evsel (evlist=0xfda7b0, event_name=0x0) at util/sort.c:2830
  #1  0x0000000000552416 in add_dynamic_entry (evlist=0xfda7b0, tok=0xffb6eb "trace", level=2) at util/sort.c:2976
  #2  0x0000000000552d26 in sort_dimension__add (list=0xf93e00 <perf_hpp_list>, tok=0xffb6eb "trace", evlist=0xfda7b0, level=2) at util/sort.c:3193
  #3  0x0000000000552e1c in setup_sort_list (list=0xf93e00 <perf_hpp_list>, str=0xffb6eb "trace", evlist=0xfda7b0) at util/sort.c:3227
  #4  0x00000000005532fa in __setup_sorting (evlist=0xfda7b0) at util/sort.c:3381
  #5  0x0000000000553cdc in setup_sorting (evlist=0xfda7b0) at util/sort.c:3608
  #6  0x000000000042eb9f in cmd_report (argc=0, argv=0x7fffffffe470) at builtin-report.c:1596
  #7  0x00000000004aee7e in run_builtin (p=0xf64ca0 <commands+288>, argc=3, argv=0x7fffffffe470) at perf.c:330
  #8  0x00000000004af0f2 in handle_internal_command (argc=3, argv=0x7fffffffe470) at perf.c:384
  #9  0x00000000004af241 in run_argv (argcp=0x7fffffffe29c, argv=0x7fffffffe290) at perf.c:428
  #10 0x00000000004af5fc in main (argc=3, argv=0x7fffffffe470) at perf.c:562
  (gdb)

So check if we have tracepoint events in add_dynamic_entry() and bail
out instead:

  # perf report --stdio -f
  This perf binary isn't linked with libtraceevent, can't process probe_perf:lzma_decompress_to_file
  Error:
  Unknown --sort key: `trace'
  #

Fixes: 378ef0f ("perf build: Use libtraceevent from the system")
Acked-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
danielocfb pushed a commit that referenced this pull request Feb 2, 2023
…ed_text_end" symbol on s/390

The test case perf lock contention dumps core on s390. Run the following
commands:

  # ./perf lock record -- ./perf bench sched messaging
  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 10 groups == 400 processes run

      Total time: 2.799 [sec]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.073 MB perf.data (100 samples) ]
  #
  # ./perf lock contention
  Segmentation fault (core dumped)
  #

The function call stack is lengthy, here are the top 5 functions:

  # gdb ./perf core.24048
  GNU gdb (GDB) Fedora Linux 12.1-6.fc37
  Core was generated by `./perf lock contention'.
  Program terminated with signal SIGSEGV, Segmentation fault.
  #0  0x00000000011dd25c in machine__is_lock_function (machine=0x3029e28, addr=1789230) at util/machine.c:3356
         3356 machine->sched.text_end = kmap->unmap_ip(kmap, sym->start);

 (gdb) where
  #0  0x00000000011dd25c in machine__is_lock_function (machine=0x3029e28, addr=1789230) at util/machine.c:3356
  #1  0x000000000109f244 in callchain_id (evsel=0x30313e0, sample=0x3ffea4f77d0) at builtin-lock.c:957
  #2  0x000000000109e094 in get_key_by_aggr_mode (key=0x3ffea4f7290, addr=27758136, evsel=0x30313e0, sample=0x3ffea4f77d0) at builtin-lock.c:586
  #3  0x000000000109f4d0 in report_lock_contention_begin_event (evsel=0x30313e0, sample=0x3ffea4f77d0) at builtin-lock.c:1004
  #4  0x00000000010a00ae in evsel__process_contention_begin (evsel=0x30313e0, sample=0x3ffea4f77d0) at builtin-lock.c:1254
  #5  0x00000000010a0e14 in process_sample_event (tool=0x3ffea4f8480, event=0x3ff85601ef8, sample=0x3ffea4f77d0, evsel=0x30313e0, machine=0x3029e28) at builtin-lock.c:1464
  .....

The issue is in function machine__is_lock_function() in file
./util/machine.c lines 3355:

   /* should not fail from here */
   sym = machine__find_kernel_symbol_by_name(machine, "__sched_text_end", &kmap);
   machine->sched.text_end = kmap->unmap_ip(kmap, sym->start)

On s390 the symbol __sched_text_end is *NOT* in the symbol list and the
resulting pointer sym is set to NULL. The sym->start is then a NULL pointer
access and generates the core dump.

The reason why __sched_text_end is not in the symbol list on s390 is
simple:

When the symbol list is created at perf start up with function calls

  dso__load
  +--> dso__load_vmlinux_path
       +--> dso__load_vmlinux
            +--> dso__load_sym
	         +--> dso__load_sym_internal (reads kernel symbols)
		 +--> symbols__fixup_end
		 +--> symbols__fixup_duplicate

The issue is in function symbols__fixup_duplicate(). It deletes all
symbols with have the same address. On s390:

  # nm -g  ~/linux/vmlinux| fgrep c68390
  0000000000c68390 T __cpuidle_text_start
  0000000000c68390 T __sched_text_end
  #

two symbols have identical addresses and __sched_text_end is considered
duplicate (in ascending sort order) and removed from the symbol list.
Therefore it is missing and an invalid pointer reference occurs.  The
code checks for symbol __sched_text_start and when it exists assumes
symbol __sched_text_end is also in the symbol table. However this is not
the case on s390.

Same situation exists for symbol __lock_text_start:

0000000000c68770 T __cpuidle_text_end
0000000000c68770 T __lock_text_start

This symbol is also removed from the symbol table but used in function
machine__is_lock_function().

To fix this and keep duplicate symbols in the symbol table, set
symbol_conf.allow_aliases to true. This prevents the removal of
duplicate symbols in function symbols__fixup_duplicate().

Output After:

 # ./perf lock contention
 contended total wait  max wait  avg wait    type   caller

        48   124.39 ms 123.99 ms   2.59 ms rwsem:W unlink_anon_vmas+0x24a
        47    83.68 ms  83.26 ms   1.78 ms rwsem:W free_pgtables+0x132
         5    41.22 us  10.55 us   8.24 us rwsem:W free_pgtables+0x140
         4    40.12 us  20.55 us  10.03 us rwsem:W copy_process+0x1ac8
 #

Fixes: 0d2997f ("perf lock: Look up callchain for the contended locks")
Signed-off-by: Thomas Richter <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Sumanth Korikkar <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
danielocfb pushed a commit that referenced this pull request Feb 2, 2023
Depending on the compiler used and the optimization options, the sbrk()
test was crashing, both on real hardware (mips-24kc) and in qemu. One
such example is kernel.org toolchain in version 11.3 optimizing at -Os.

Inspecting the sys_brk() call shows the following code:

  0040047c <sys_brk>:
    40047c:       24020fcd        li      v0,4045
    400480:       27bdffe0        addiu   sp,sp,-32
    400484:       0000000c        syscall
    400488:       27bd0020        addiu   sp,sp,32
    40048c:       10e00001        beqz    a3,400494 <sys_brk+0x18>
    400490:       00021023        negu    v0,v0
    400494:       03e00008        jr      ra

It is obviously wrong, the "negu" instruction is placed in beqz's
delayed slot, and worse, there's no nop nor instruction after the
return, so the next function's first instruction (addiu sip,sip,-32)
will also be executed as part of the delayed slot that follows the
return.

This is caused by the ".set noreorder" directive in the _start block,
that applies to the whole program. The compiler emits code without the
delayed slots and relies on the compiler to swap instructions when this
option is not set. Removing the option would require to change the
startup code in a way that wouldn't make it look like the resulting
code, which would not be easy to debug. Instead let's just save the
default ordering before changing it, and restore it at the end of the
_start block. Now the code is correct:

  0040047c <sys_brk>:
    40047c:       24020fcd        li      v0,4045
    400480:       27bdffe0        addiu   sp,sp,-32
    400484:       0000000c        syscall
    400488:       10e00002        beqz    a3,400494 <sys_brk+0x18>
    40048c:       27bd0020        addiu   sp,sp,32
    400490:       00021023        negu    v0,v0
    400494:       03e00008        jr      ra
    400498:       00000000        nop

Fixes: 66b6f75 ("rcutorture: Import a copy of nolibc") #5.0
Signed-off-by: Willy Tarreau <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
danielocfb pushed a commit that referenced this pull request Feb 2, 2023
When RISCV port was imported in 5.2, the O_* macros were taken with
their octal value and written as-is in hex, resulting in the getdents64()
to fail in nolibc-test.

Fixes: 582e84f ("tool headers nolibc: add RISCV support") #5.2
Signed-off-by: Willy Tarreau <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
danielocfb pushed a commit that referenced this pull request Feb 2, 2023
Daniel Machon says:

====================
net: Introduce new DCB rewrite table

There is currently no support for per-port egress mapping of priority to PCP and
priority to DSCP. Some support for expressing egress mapping of PCP is supported
through ip link, with the 'egress-qos-map', however this command only maps
priority to PCP, and for vlan interfaces only. DCB APP already has support for
per-port ingress mapping of PCP/DEI, DSCP and a bunch of other stuff. So why not
take advantage of this fact, and add a new table that does the reverse.

This patch series introduces the new DCB rewrite table. Whereas the DCB
APP table deals with ingress mapping of PID (protocol identifier) to priority,
the rewrite table deals with egress mapping of priority to PID.

It is indeed possible to integrate rewrite in the existing APP table, by
introducing new dedicated rewrite selectors, and altering existing functions
to treat rewrite entries specially. However, I feel like this is not a good
solution, and will pollute the APP namespace. APP is well-defined in IEEE, and
some userspace relies of advertised entries - for this fact, separating APP and
rewrite into to completely separate objects, seems to me the best solution.

The new table shares much functionality with the APP table, and as such, much
existing code is reused, or slightly modified, to work for both.

================================================================================
DCB rewrite table in a nutshell
================================================================================
The table is implemented as a simple linked list, and uses the same lock as the
APP table. New functions for getting, setting and deleting entries have been
added, and these are exported, so they can be used by the stack or drivers.
Additionnaly, new dcbnl_setrewr and dcnl_delrewr hooks has been added, to
support hardware offload of the entries.

================================================================================
Sparx5 per-port PCP rewrite support
================================================================================
Sparx5 supports PCP egress mapping through two eight-entry switch tables.
One table maps QoS class 0-7 to PCP for DE0 (DP levels mapped to
drop-eligibility 0) and the other for DE1. DCB does currently not have support
for expressing DP/color, so instead, the tagged DEI bit will reflect the DP
levels, for any rewrite entries> 7 ('de').

The driver will take apptrust (contributed earlier) into consideration, so
that the mapping tables only be used, if PCP is trusted *and* the rewrite table
has active mappings, otherwise classified PCP (same as frame PCP) will be used
instead.

================================================================================
Sparx5 per-port DSCP rewrite support
================================================================================
Sparx5 support DSCP egress mapping through a single 32-entry table. This table
maps classified QoS class and DP level to classified DSCP, and is consulted by
the switch Analyzer Classifier at ingress. At egress, the frame DSCP can either
be rewritten to classified DSCP to frame DSCP.

The driver will take apptrust into consideration, so that the mapping tables
only be used, if DSCP is trusted *and* the rewrite table has active mappings,
otherwise frame DSCP will be used instead.

================================================================================
Patches
================================================================================
Patch #1 modifies dcb_app_add to work for both APP and rewrite

Patch #2 adds dcbnl_app_table_setdel() for setting and deleting both APP and
         rewrite entries.

Patch #3 adds the rewrite table and all required functions, offload hooks and
         bookkeeping for maintaining it.

Patch #4 adds two new helper functions for getting a priority to PCP bitmask
         map, and a priority to DSCP bitmask map.

Patch #5 adds support for PCP rewrite in the Sparx5 driver.
Patch #6 adds support for DSCP rewrite in the Sparx5 driver.

================================================================================
v2 -> v3:
  in dcbnl_ieee_fill() use nla_nest_start() instead of the _noflag() version.
  Also, cancel the rewrite nest in case of an error (Petr Machata).

v1 -> v2:
  In dcb_setrewr() change proto to u16 as it ought to be, and remove zero
  initialization of err. (Dan Carpenter).
  Change name of dcbnl_apprewr_setdel -> dcbnl_app_table_setdel and change the
  function signature to take a single function pointer. Update uses accordingly
  (Petr Machata).

====================

Signed-off-by: David S. Miller <[email protected]>
danielocfb pushed a commit that referenced this pull request Feb 2, 2023
Petr Machata says:

====================
mlxsw: Add support of latency TLV

Amit Cohen writes:

Ethernet Management Datagrams (EMADs) are Ethernet packets sent between
the driver and device's firmware. They are used to pass various
configurations to the device, but also to get events (e.g., port up)
from it. After the Ethernet header, these packets are built in a TLV
format.

This is the structure of EMADs:
* Ethernet header
* Operation TLV
* String TLV (optional)
* Latency TLV (optional)
* Reg TLV
* End TLV

The latency of each EMAD is measured by firmware. The driver can get the
measurement via latency TLV which can be added to each EMAD. This TLV is
optional, when EMAD is sent with this TLV, the EMAD's response will include
the TLV and will contain the firmware measurement.

Add support for Latency TLV and use it by default for all EMADs (see
more information in commit messages). The latency measurements can be
processed using BPF program for example, to create a histogram and average
of the latency per register. In addition, it is possible to measure the
end-to-end latency, so then the latency of the software overhead can be
calculated. This information can be useful to improve the driver
performance.

See an example of output of BPF tool which presents these measurements:

$ ./emadlatency -f -a
    Tracing EMADs... Hit Ctrl-C to end.
    Register write = RALUE (0x8013)
    E2E Measurements:
    average = 23 usecs, total = 32052693 usecs, count = 1337061
         usecs               : count    distribution
             0 -> 1          : 0        |                                 |
             2 -> 3          : 0        |                                 |
             4 -> 7          : 0        |                                 |
             8 -> 15         : 0        |                                 |
            16 -> 31         : 1290814  |*********************************|
            32 -> 63         : 45339    |*                                |
            64 -> 127        : 532      |                                 |
           128 -> 255        : 247      |                                 |
           256 -> 511        : 57       |                                 |
           512 -> 1023       : 26       |                                 |
          1024 -> 2047       : 33       |                                 |
          2048 -> 4095       : 0        |                                 |
          4096 -> 8191       : 10       |                                 |
          8192 -> 16383      : 1        |                                 |
         16384 -> 32767      : 1        |                                 |
         32768 -> 65535      : 1        |                                 |

    Firmware Measurements:
    average = 10 usecs, total = 13884128 usecs, count = 1337061
         usecs               : count    distribution
             0 -> 1          : 0        |                                 |
             2 -> 3          : 0        |                                 |
             4 -> 7          : 0        |                                 |
             8 -> 15         : 1337035  |*********************************|
            16 -> 31         : 17       |                                 |
            32 -> 63         : 7        |                                 |
            64 -> 127        : 0        |                                 |
           128 -> 255        : 2        |                                 |

    Diff between measurements: 13 usecs

Patch set overview:
Patches #1-#3 add support for querying MGIR, to know if string TLV and
latency TLV are supported
Patches #4-#5 add some relevant fields to support latency TLV
Patch #6 adds support of latency TLV
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
danielocfb pushed a commit that referenced this pull request Feb 2, 2023
During EEH error injection testing, a deadlock was encountered in the tg3
driver when tg3_io_error_detected() was attempting to cancel outstanding
reset tasks:

crash> foreach UN bt
...
PID: 159    TASK: c0000000067c6000  CPU: 8   COMMAND: "eehd"
...
 #5 [c00000000681f990] __cancel_work_timer at c00000000019fd18
 #6 [c00000000681fa30] tg3_io_error_detected at c00800000295f098 [tg3]
 #7 [c00000000681faf0] eeh_report_error at c00000000004e25c
...

PID: 290    TASK: c000000036e5f800  CPU: 6   COMMAND: "kworker/6:1"
...
 #4 [c00000003721fbc0] rtnl_lock at c000000000c940d8
 #5 [c00000003721fbe0] tg3_reset_task at c008000002969358 [tg3]
 #6 [c00000003721fc60] process_one_work at c00000000019e5c4
...

PID: 296    TASK: c000000037a65800  CPU: 21  COMMAND: "kworker/21:1"
...
 #4 [c000000037247bc0] rtnl_lock at c000000000c940d8
 #5 [c000000037247be0] tg3_reset_task at c008000002969358 [tg3]
 #6 [c000000037247c60] process_one_work at c00000000019e5c4
...

PID: 655    TASK: c000000036f49000  CPU: 16  COMMAND: "kworker/16:2"
...:1

 #4 [c0000000373ebbc0] rtnl_lock at c000000000c940d8
 #5 [c0000000373ebbe0] tg3_reset_task at c008000002969358 [tg3]
 #6 [c0000000373ebc60] process_one_work at c00000000019e5c4
...

Code inspection shows that both tg3_io_error_detected() and
tg3_reset_task() attempt to acquire the RTNL lock at the beginning of
their code blocks.  If tg3_reset_task() should happen to execute between
the times when tg3_io_error_deteced() acquires the RTNL lock and
tg3_reset_task_cancel() is called, a deadlock will occur.

Moving tg3_reset_task_cancel() call earlier within the code block, prior
to acquiring RTNL, prevents this from happening, but also exposes another
deadlock issue where tg3_reset_task() may execute AFTER
tg3_io_error_detected() has executed:

crash> foreach UN bt
PID: 159    TASK: c0000000067d2000  CPU: 9   COMMAND: "eehd"
...
 #4 [c000000006867a60] rtnl_lock at c000000000c940d8
 #5 [c000000006867a80] tg3_io_slot_reset at c0080000026c2ea8 [tg3]
 #6 [c000000006867b00] eeh_report_reset at c00000000004de88
...
PID: 363    TASK: c000000037564000  CPU: 6   COMMAND: "kworker/6:1"
...
 #3 [c000000036c1bb70] msleep at c000000000259e6c
 #4 [c000000036c1bba0] napi_disable at c000000000c6b848
 #5 [c000000036c1bbe0] tg3_reset_task at c0080000026d942c [tg3]
 #6 [c000000036c1bc60] process_one_work at c00000000019e5c4
...

This issue can be avoided by aborting tg3_reset_task() if EEH error
recovery is already in progress.

Fixes: db84bf4 ("tg3: tg3_reset_task() needs to use rtnl_lock to synchronize")
Signed-off-by: David Christensen <[email protected]>
Reviewed-by: Pavan Chebbi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
danielocfb pushed a commit that referenced this pull request Feb 15, 2023
Daniel Machon says:

====================
net: Add support for PSFP in Sparx5

================================================================================
Add support for Per-Stream Filtering and Policing (802.1Q-2018, 8.6.5.1).
================================================================================

The VCAP CLM (VCAP IS0 ingress classifier) classifies streams,
identified by ISDX (Ingress Service Index, frame metadata), and maps
ISDX to streams.

Flow meters are also classified by ISDX, and implemented using service
policers (Service Dual Leacky Buckets, SDLB). Leacky buckets are linked
together in a leak chain of a leak group. Leak groups a preconfigured to serve
buckets within a certain rate interval.

Stream gates are time-based policers used by PSFP. Frames are dropped
based on the gate state (OPEN/ CLOSE), whose state will be altered based
on the Gate Control List (GCL) and current PTP time. Apart from
time-based policing, stream gates can alter egress queue selection for
the frames that pass through the Gate. This is done through Internal
Priority Selector (IPS). Stream gates are mapped from stream filters.

Support for tc actions gate and police, have been added to the VCAP IS0 set of
supported actions.

Examples:

// tc filter with gate action
$ tc filter add dev eth1 ingress chain 1100000 prio 1 handle 1001 protocol \
802.1q flower skip_sw vlan_id 100 action gate base-time 0 sched-entry open \
700000 7 8m sched-entry close 300000 action goto chain 1200000

// tc filter with police action
$ tc filter add dev eth1 ingress chain 1100000 prio 1 handle 1002 protocol \
802.1q flower skip_sw vlan_id 100 action police rate 1gbit burst 8096      \
conform-exceed drop action goto chain 1200000

================================================================================
Patches
================================================================================
Patch #1:  Adds new register needed for PSFP.
Patch #2:  Adds resource pools to control PSFP needed chip resources.
Patch #3:  Adds support for SDLB's needed for flow-meters.
Patch #4:  Adds support for service policers.
Patch #5:  Adds support for PSFP flow-meters, using service policers.
Patch #6:  Adds a new function to calculate basetime, required by flow-meters.
Patch #7:  Adds support for PSFP stream gates.
Patch #8:  Adds support for PSFP stream filters.
Patch #9:  Adds a function to initialize flow-meters, stream gates and stream
           filters.
Patch #10: Adds the required flower code to configure PSFP using the tc command.
====================

Signed-off-by: David S. Miller <[email protected]>
danielocfb pushed a commit that referenced this pull request Feb 15, 2023
Petr Machata says:

====================
bridge: Limit number of MDB entries per port, port-vlan

The MDB maintained by the bridge is limited. When the bridge is configured
for IGMP / MLD snooping, a buggy or malicious client can easily exhaust its
capacity. In SW datapath, the capacity is configurable through the
IFLA_BR_MCAST_HASH_MAX parameter, but ultimately is finite. Obviously a
similar limit exists in the HW datapath for purposes of offloading.

In order to prevent the issue of unilateral exhaustion of MDB resources,
introduce two parameters in each of two contexts:

- Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled)
  per-port-VLAN number of MDB entries that the port is member in.

- Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled)
  per-port-VLAN maximum permitted number of MDB entries, or 0 for
  no limit.

Per-port number of entries keeps track of the total number of MDB entries
configured on a given port. The per-port-VLAN value then keeps track of the
subset of MDB entries configured specifically for the given VLAN, on that
port. The number is adjusted as port_groups are created and deleted, and
therefore under multicast lock.

A maximum value, if non-zero, then places a limit on the number of entries
that can be configured in a given context. Attempts to add entries above
the maximum are rejected.

Rejection reason of netlink-based requests to add MDB entries is
communicated through extack. This channel is unavailable for rejections
triggered from the control path. To address this lack of visibility, the
patchset adds a tracepoint, bridge:br_mdb_full:

	# perf record -e bridge:br_mdb_full &
	# [...]
	# perf script | cut -d: -f4-
	 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 0
	 dev v2 af 10 src :: grp ff0e::112/00:00:00:00:00:00 vid 0
	 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 10
	 dev v2 af 10 src 2001:db8:1::1 grp ff0e::1/00:00:00:00:00:00 vid 10
	 dev v2 af 2 src ::ffff:192.0.2.1 grp ::ffff:239.1.1.1/00:00:00:00:00:00 vid 10

Another option to consume the tracepoint is e.g. through the bpftrace tool:

	# bpftrace -e ' tracepoint:bridge:br_mdb_full /args->af != 0/ {
			    printf("dev %s src %s grp %s vid %u\n",
				   str(args->dev), ntop(args->src),
				   ntop(args->grp), args->vid);
			}
			tracepoint:bridge:br_mdb_full /args->af == 0/ {
			    printf("dev %s grp %s vid %u\n",
				   str(args->dev),
				   macaddr(args->grpmac), args->vid);
			}'

This tracepoint is triggered for mcast_hash_max exhaustions as well.

The following is an example of how the feature is used. A more extensive
example is available in patch #8:

	# bridge vlan set dev v1 vid 1 mcast_max_groups 1
	# bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 1
	# bridge mdb add dev br port v1 grp 230.1.2.4 temp vid 1
	Error: bridge: Port-VLAN is already in 1 groups, and mcast_max_groups=1.

The patchset progresses as follows:

- In patch #1, set strict_start_type at two bridge-related policies. The
  reason is we are adding a new attribute to one of these, and want the new
  attribute to be parsed strictly. The other was adjusted for completeness'
  sake.

- In patches #2 to #5, br_mdb and br_multicast code is adjusted to make the
  following additions smoother.

- In patch #6, add the tracepoint.

- In patch #7, the code to maintain number of MDB entries is added as
  struct net_bridge_mcast_port::mdb_n_entries. The maximum is added, too,
  as struct net_bridge_mcast_port::mdb_max_entries, however at this point
  there is no way to set the value yet, and since 0 is treated as "no
  limit", the functionality doesn't change at this point. Note however,
  that mcast_hash_max violations already do trigger at this point.

- In patch #8, netlink plumbing is added: reading of number of entries, and
  reading and writing of maximum.

  The per-port values are passed through RTM_NEWLINK / RTM_GETLINK messages
  in IFLA_BRPORT_MCAST_N_GROUPS and _MAX_GROUPS, inside IFLA_PROTINFO nest.

  The per-port-vlan values are passed through RTM_GETVLAN / RTM_NEWVLAN
  messages in BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS, _MAX_GROUPS, inside
  BRIDGE_VLANDB_ENTRY.

The following patches deal with the selftest:

- Patches #9 and #10 clean up and move around some selftest code.

- Patches #11 to #14 add helpers and generalize the existing IGMP / MLD
  support to allow generating packets with configurable group addresses and
  varying source lists for (S,G) memberships.

- Patch #15 adds code to generate IGMP leave and MLD done packets.

- Patch #16 finally adds the selftest itself.

v3:
- Patch #7:
    - Access mdb_max_/_n_entries through READ_/WRITE_ONCE
    - Move extack setting to br_multicast_port_ngroups_inc_one().
      Since we use NL_SET_ERR_MSG_FMT_MOD, the correct context
      (port / port-vlan) can be passed through an argument.
      This also removes the need for more READ/WRITE_ONCE's
      at the extack-setting site.
- Patch #8:
    - Move the br_multicast_port_ctx_vlan_disabled() check
      out to the _vlan_ helpers callers. Thus these helpers
      cannot fail, which makes them very similar to the
      _port_ helpers. Have them take the MC context directly
      and unify them.

v2:
- Cover letter:
    - Add an example of a bpftrace-based probe script
- Patch #6:
    - Report IPv4 as an IPv6-mapped address through the IPv6 buffer
      as well, to save ring buffer space.
- Patch #7:
    - In br_multicast_port_ngroups_inc_one(), bounce
      if n>=max, not if n==max
    - Adjust extack messages to mention ngroups, now
      that the bounces appear when n>=max, not n==max
    - In __br_multicast_enable_port_ctx(), do not reset
      max to 0. Also do not count number of entries by
      going through _inc, as that would end up incorrectly
      bouncing the entries.
- Patch #8:
    - Drop locks around accesses in
      br_multicast_{port,vlan}_ngroups_{get,set_max}(),
    - Drop bounces due to max<n in
      br_multicast_{port,vlan}_ngroups_set_max().
- Patch #12:
    - In the comment at payload_template_calc_checksum(),
      s/%#02x/%02x/, that's the mausezahn payload format.
- Patch #16:
    - Adjust the tests that check setting max below n and
      reset of max on VLAN snooping enablement
    - Make test naming uniform
    - Enable testing of control path (IGMP/MLD) in
      mcast_vlan_snooping bridge
    - Reorganize the code so that test instances (per bridge
      type and configuration type) always come right after
      the test, in order of {d,q,qvs}{4,6}{cfg,ctl}.
      Then groups of selftests are at the end of the file.
      Similarly adjust invocation order of the tests.
====================

Signed-off-by: David S. Miller <[email protected]>
danielocfb pushed a commit that referenced this pull request Feb 15, 2023
Petr Machata says:

====================
mlxsw: Misc devlink changes

This patchset adjusts mlxsw to recent devlink changes in net-next.

Patch #1 removes a devl_param_driverinit_value_set() call that was
unnecessary, but now additionally triggers a WARN_ON.

Patches #2-#4 are non-functional preparations for the following patches.

Patch #5 fixes a use-after-free that is triggered while changing network
namespaces.

Patch #6 makes mlxsw consistent with netdevsim by having mlxsw register
its devlink instance before its sub-objects. It helps us avoid a warning
described in the commit message.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
danielocfb pushed a commit that referenced this pull request Feb 17, 2023
Eduard Zingerman says:

====================

This patch-set is a part of preparation work for -mcpu=v4 option for
BPF C compiler (discussed in [1]). Among other things -mcpu=v4 should
enable generation of BPF_ST instruction by the compiler.

- Patches #1,2 adjust verifier to track values of constants written to
  stack using BPF_ST. Currently these are tracked imprecisely, unlike
  the writes using BPF_STX, e.g.:

    fp[-8] = 42;   currently verifier assumes that fp[-8]=mmmmmmmm
                   after such instruction, where m stands for "misc",
                   just a note that something is written at fp[-8].

    r1 = 42;       verifier tracks r1=42 after this instruction.
    fp[-8] = r1;   verifier tracks fp[-8]=42 after this instruction.

  This patch makes both cases equivalent.

- Patches #3,4 adjust verifier.c:check_stack_write_fixed_off() to
  preserve STACK_ZERO marks when BPF_ST writes zero. Currently these
  are replaced by STACK_MISC, unlike zero writes using BPF_STX, e.g.:

    ... stack range [X,Y] is marked as STACK_ZERO ...
    r0 = ... variable offset pointer to stack with range [X,Y] ...

    fp[r0] = 0;    currently verifier marks range [X,Y] as
                   STACK_MISC for such instructions.

    r1 = 0;
    fp[r0] = r1;   verifier keeps STACK_ZERO marks for range [X,Y].

  This patch makes both cases equivalent.

Motivating example for patch #1 could be found at [3].

Previous version of the patch-set is here [2], the changes are:
- Explicit initialization of fake register parent link is removed from
  verifier.c:check_stack_write_fixed_off() as parent links are now
  correctly handled by verifier.c:save_register_state().
- Original patch #1 is split in patches #1 & #3.
- Missing test case added for patch #3
  verifier.c:check_stack_write_fixed_off() adjustment.
- Test cases are updated to use .prog_type = BPF_PROG_TYPE_SK_LOOKUP,
  which requires return value to be in the range [0,1] (original test
  cases assumed that such range is always required, which is not true).
- Original patch #3 with changes allowing BPF_ST writes to context is
  withheld for now, w/o compiler support for BPF_ST it requires some
  creative testing.
- Original patch #5 is removed from the patch-set. This patch
  contained adjustments to expected verifier error messages in some
  tests, necessary when C compiler generates BPF_ST instruction
  instead of BPF_STX (changes to expected instruction indices). These
  changes are not necessary yet.

[1] https://lore.kernel.org/bpf/[email protected]/
[2] https://lore.kernel.org/bpf/[email protected]/
[3] https://lore.kernel.org/bpf/[email protected]/
====================

Signed-off-by: Alexei Starovoitov <[email protected]>
danielocfb pushed a commit that referenced this pull request Feb 22, 2023
There is a certain chance to trigger the following panic:

PID: 5900   TASK: ffff88c1c8af4100  CPU: 1   COMMAND: "kworker/1:48"
 #0 [ffff9456c1cc79a0] machine_kexec at ffffffff870665b7
 #1 [ffff9456c1cc79f0] __crash_kexec at ffffffff871b4c7a
 #2 [ffff9456c1cc7ab0] crash_kexec at ffffffff871b5b60
 #3 [ffff9456c1cc7ac0] oops_end at ffffffff87026ce7
 #4 [ffff9456c1cc7ae0] page_fault_oops at ffffffff87075715
 #5 [ffff9456c1cc7b58] exc_page_fault at ffffffff87ad0654
 #6 [ffff9456c1cc7b80] asm_exc_page_fault at ffffffff87c00b62
    [exception RIP: ib_alloc_mr+19]
    RIP: ffffffffc0c9cce3  RSP: ffff9456c1cc7c38  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000004
    RDX: 0000000000000010  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ffff88c1ea281d00   R8: 000000020a34ffff   R9: ffff88c1350bbb20
    R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
    R13: 0000000000000010  R14: ffff88c1ab040a50  R15: ffff88c1ea281d00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff9456c1cc7c60] smc_ib_get_memory_region at ffffffffc0aff6df [smc]
 #8 [ffff9456c1cc7c88] smcr_buf_map_link at ffffffffc0b0278c [smc]
 #9 [ffff9456c1cc7ce0] __smc_buf_create at ffffffffc0b03586 [smc]

The reason here is that when the server tries to create a second link,
smc_llc_srv_add_link() has no protection and may add a new link to
link group. This breaks the security environment protected by
llc_conf_mutex.

Fixes: 2d2209f ("net/smc: first part of add link processing as SMC server")
Signed-off-by: D. Wythe <[email protected]>
Reviewed-by: Larysa Zaremba <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
danielocfb pushed a commit that referenced this pull request Sep 20, 2023
Noticed with:

  make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin

Direct leak of 45 byte(s) in 1 object(s) allocated from:
    #0 0x7f213f87243b in strdup (/lib64/libasan.so.8+0x7243b)
    #1 0x63d15f in evsel__set_filter util/evsel.c:1371
    #2 0x63d15f in evsel__append_filter util/evsel.c:1387
    #3 0x63d15f in evsel__append_tp_filter util/evsel.c:1400
    #4 0x62cd52 in evlist__append_tp_filter util/evlist.c:1145
    #5 0x62cd52 in evlist__append_tp_filter_pids util/evlist.c:1196
    #6 0x541e49 in trace__set_filter_loop_pids /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3646
    #7 0x541e49 in trace__set_filter_pids /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3670
    #8 0x541e49 in trace__run /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3970
    #9 0x541e49 in cmd_trace /home/acme/git/perf-tools/tools/perf/builtin-trace.c:5141
    #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools/tools/perf/perf.c:323
    #11 0x4196da in handle_internal_command /home/acme/git/perf-tools/tools/perf/perf.c:377
    #12 0x4196da in run_argv /home/acme/git/perf-tools/tools/perf/perf.c:421
    #13 0x4196da in main /home/acme/git/perf-tools/tools/perf/perf.c:537
    #14 0x7f213e84a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

Free it on evsel__exit().

Acked-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
danielocfb pushed a commit that referenced this pull request Sep 20, 2023
To plug these leaks detected with:

  $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin

  =================================================================
  ==473890==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 112 byte(s) in 1 object(s) allocated from:
    #0 0x7fdf19aba097 in calloc (/lib64/libasan.so.8+0xba097)
    #1 0x987836 in zalloc (/home/acme/bin/perf+0x987836)
    #2 0x5367ae in thread_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1289
    #3 0x5367ae in thread__trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1307
    #4 0x5367ae in trace__sys_exit /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2468
    #5 0x52bf34 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3177
    #6 0x52bf34 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3685
    #7 0x542927 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3712
    #8 0x542927 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4055
    #9 0x542927 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5141
    #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
    #11 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
    #12 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
    #13 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
    #14 0x7fdf18a4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

  Direct leak of 2048 byte(s) in 1 object(s) allocated from:
    #0 0x7f788fcba6af in __interceptor_malloc (/lib64/libasan.so.8+0xba6af)
    #1 0x5337c0 in trace__sys_enter /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2342
    #2 0x52bfb4 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3191
    #3 0x52bfb4 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3699
    #4 0x542883 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3726
    #5 0x542883 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4069
    #6 0x542883 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5155
    #7 0x5ef232 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
    #8 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
    #9 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
    #10 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
    #11 0x7f788ec4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

  Indirect leak of 48 byte(s) in 1 object(s) allocated from:
    #0 0x7fdf19aba6af in __interceptor_malloc (/lib64/libasan.so.8+0xba6af)
    #1 0x77b335 in intlist__new util/intlist.c:116
    #2 0x5367fd in thread_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1293
    #3 0x5367fd in thread__trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1307
    #4 0x5367fd in trace__sys_exit /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2468
    #5 0x52bf34 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3177
    #6 0x52bf34 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3685
    #7 0x542927 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3712
    #8 0x542927 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4055
    #9 0x542927 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5141
    #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
    #11 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
    #12 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
    #13 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
    #14 0x7fdf18a4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

Acked-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
danielocfb pushed a commit that referenced this pull request Sep 20, 2023
In 3cb4d5e ("perf trace: Free syscall tp fields in
evsel->priv") it only was freeing if strcmp(evsel->tp_format->system,
"syscalls") returned zero, while the corresponding initialization of
evsel->priv was being performed if it was _not_ zero, i.e. if the tp
system wasn't 'syscalls'.

Just stop looking for that and free it if evsel->priv was set, which
should be equivalent.

Also use the pre-existing evsel_trace__delete() function.

This resolves these leaks, detected with:

  $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin

  =================================================================
  ==481565==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 40 byte(s) in 1 object(s) allocated from:
      #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8+0xba097)
      #1 0x987966 in zalloc (/home/acme/bin/perf+0x987966)
      #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307
      #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333
      #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458
      #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480
      #6 0x540e8b in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3212
      #7 0x540e8b in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891
      #8 0x540e8b in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156
      #9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
      #10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
      #11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
      #12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
      #13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

  Direct leak of 40 byte(s) in 1 object(s) allocated from:
      #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8+0xba097)
      #1 0x987966 in zalloc (/home/acme/bin/perf+0x987966)
      #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307
      #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333
      #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458
      #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480
      #6 0x540dd1 in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3205
      #7 0x540dd1 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891
      #8 0x540dd1 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156
      #9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
      #10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
      #11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
      #12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
      #13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

  SUMMARY: AddressSanitizer: 80 byte(s) leaked in 2 allocation(s).
  [root@quaco ~]#

With this we plug all leaks with "perf trace sleep 1".

Fixes: 3cb4d5e ("perf trace: Free syscall tp fields in evsel->priv")
Acked-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Riccardo Mancini <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
danielocfb pushed a commit that referenced this pull request Sep 20, 2023
…failure to add a probe

Building perf with EXTRA_CFLAGS="-fsanitize=address" a leak is detect
when trying to add a probe to a non-existent function:

  # perf probe -x ~/bin/perf dso__neW
  Probe point 'dso__neW' not found.
    Error: Failed to add events.

  =================================================================
  ==296634==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 128 byte(s) in 1 object(s) allocated from:
      #0 0x7f67642ba097 in calloc (/lib64/libasan.so.8+0xba097)
      #1 0x7f67641a76f1 in allocate_cfi (/lib64/libdw.so.1+0x3f6f1)

  Direct leak of 65 byte(s) in 1 object(s) allocated from:
      #0 0x7f67642b95b5 in __interceptor_realloc.part.0 (/lib64/libasan.so.8+0xb95b5)
      #1 0x6cac75 in strbuf_grow util/strbuf.c:64
      #2 0x6ca934 in strbuf_init util/strbuf.c:25
      #3 0x9337d2 in synthesize_perf_probe_point util/probe-event.c:2018
      #4 0x92be51 in try_to_find_probe_trace_events util/probe-event.c:964
      #5 0x93d5c6 in convert_to_probe_trace_events util/probe-event.c:3512
      #6 0x93d6d5 in convert_perf_probe_events util/probe-event.c:3529
      #7 0x56f37f in perf_add_probe_events /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:354
      #8 0x572fbc in __cmd_probe /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:738
      #9 0x5730f2 in cmd_probe /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:766
      #10 0x635d81 in run_builtin /var/home/acme/git/perf-tools-next/tools/perf/perf.c:323
      #11 0x6362c1 in handle_internal_command /var/home/acme/git/perf-tools-next/tools/perf/perf.c:377
      #12 0x63667a in run_argv /var/home/acme/git/perf-tools-next/tools/perf/perf.c:421
      #13 0x636b8d in main /var/home/acme/git/perf-tools-next/tools/perf/perf.c:537
      #14 0x7f676302950f in __libc_start_call_main (/lib64/libc.so.6+0x2950f)

  SUMMARY: AddressSanitizer: 193 byte(s) leaked in 2 allocation(s).
  #

synthesize_perf_probe_point() returns a "detachec" strbuf, i.e. a
malloc'ed string that needs to be free'd.

An audit will be performed to find other such cases.

Acked-by: Masami Hiramatsu <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
danielocfb pushed a commit that referenced this pull request Sep 20, 2023
While debugging a segfault on 'perf lock contention' without an
available perf.data file I noticed that it was basically calling:

	perf_session__delete(ERR_PTR(-1))

Resulting in:

  (gdb) run lock contention
  Starting program: /root/bin/perf lock contention
  [Thread debugging using libthread_db enabled]
  Using host libthread_db library "/lib64/libthread_db.so.1".
  failed to open perf.data: No such file or directory  (try 'perf record' first)
  Initializing perf session failed

  Program received signal SIGSEGV, Segmentation fault.
  0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858
  2858		if (!session->auxtrace)
  (gdb) p session
  $1 = (struct perf_session *) 0xffffffffffffffff
  (gdb) bt
  #0  0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858
  #1  0x000000000057bb4d in perf_session__delete (session=0xffffffffffffffff) at util/session.c:300
  #2  0x000000000047c421 in __cmd_contention (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2161
  #3  0x000000000047dc95 in cmd_lock (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2604
  #4  0x0000000000501466 in run_builtin (p=0xe597a8 <commands+552>, argc=2, argv=0x7fffffffe200) at perf.c:322
  #5  0x00000000005016d5 in handle_internal_command (argc=2, argv=0x7fffffffe200) at perf.c:375
  #6  0x0000000000501824 in run_argv (argcp=0x7fffffffe02c, argv=0x7fffffffe020) at perf.c:419
  #7  0x0000000000501b11 in main (argc=2, argv=0x7fffffffe200) at perf.c:535
  (gdb)

So just set it to NULL after using PTR_ERR(session) to decode the error
as perf_session__delete(NULL) is supported.

The same problem was found in 'perf top' after an audit of all
perf_session__new() failure handling.

Fixes: 6ef81c5 ("perf session: Return error code for perf_session__new() function on failure")
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Alexey Budankov <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Jeremie Galarneau <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kate Stewart <[email protected]>
Cc: Mamatha Inamdar <[email protected]>
Cc: Mukesh Ojha <[email protected]>
Cc: Nageswara R Sastry <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Shawn Landden <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tzvetomir Stoyanov <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
danielocfb pushed a commit that referenced this pull request Sep 20, 2023
While debugging a segfault on 'perf lock contention' without an
available perf.data file I noticed that it was basically calling:

	perf_session__delete(ERR_PTR(-1))

Resulting in:

  (gdb) run lock contention
  Starting program: /root/bin/perf lock contention
  [Thread debugging using libthread_db enabled]
  Using host libthread_db library "/lib64/libthread_db.so.1".
  failed to open perf.data: No such file or directory  (try 'perf record' first)
  Initializing perf session failed

  Program received signal SIGSEGV, Segmentation fault.
  0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858
  2858		if (!session->auxtrace)
  (gdb) p session
  $1 = (struct perf_session *) 0xffffffffffffffff
  (gdb) bt
  #0  0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858
  #1  0x000000000057bb4d in perf_session__delete (session=0xffffffffffffffff) at util/session.c:300
  #2  0x000000000047c421 in __cmd_contention (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2161
  #3  0x000000000047dc95 in cmd_lock (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2604
  #4  0x0000000000501466 in run_builtin (p=0xe597a8 <commands+552>, argc=2, argv=0x7fffffffe200) at perf.c:322
  #5  0x00000000005016d5 in handle_internal_command (argc=2, argv=0x7fffffffe200) at perf.c:375
  #6  0x0000000000501824 in run_argv (argcp=0x7fffffffe02c, argv=0x7fffffffe020) at perf.c:419
  #7  0x0000000000501b11 in main (argc=2, argv=0x7fffffffe200) at perf.c:535
  (gdb)

So just set it to NULL after using PTR_ERR(session) to decode the error
as perf_session__delete(NULL) is supported.

Fixes: eef4fee ("perf lock: Dynamically allocate lockhash_table")
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Mamatha Inamdar <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Tiezhu Yang <[email protected]>
Cc: Yang Jihong <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
danielocfb pushed a commit that referenced this pull request Sep 20, 2023
Jiri Pirko says:

====================
expose devlink instances relationships

From: Jiri Pirko <[email protected]>

Currently, the user can instantiate new SF using "devlink port add"
command. That creates an E-switch representor devlink port.

When user activates this SF, there is an auxiliary device created and
probed for it which leads to SF devlink instance creation.

There is 1:1 relationship between E-switch representor devlink port and
the SF auxiliary device devlink instance.

Also, for example in mlx5, one devlink instance is created for
PCI device and one is created for an auxiliary device that represents
the uplink port. The relation between these is invisible to the user.

Patches #1-#3 and #5 are small preparations.

Patch #4 adds netnsid attribute for nested devlink if that in a
different namespace.

Patch #5 is the main one in this set, introduces the relationship
tracking infrastructure later on used to track SFs, linecards and
devlink instance relationships with nested devlink instances.

Expose the relation to the user by introducing new netlink attribute
DEVLINK_PORT_FN_ATTR_DEVLINK which contains the devlink instance related
to devlink port function. This is done by patch #8.
Patch #9 implements this in mlx5 driver.

Patch #10 converts the linecard nested devlink handling to the newly
introduced rel infrastructure.

Patch #11 benefits from the rel infra and introduces possiblitily to
have relation between devlink instances.
Patch #12 implements this in mlx5 driver.

Examples:
$ devlink dev
pci/0000:08:00.0: nested_devlink auxiliary/mlx5_core.eth.0
pci/0000:08:00.1: nested_devlink auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.0

$ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 106
pci/0000:08:00.0/32768: type eth netdev eth4 flavour pcisf controller 0 pfnum 0 sfnum 106 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached roce enable
$ devlink port function set pci/0000:08:00.0/32768 state active
$ devlink port show pci/0000:08:00.0/32768
pci/0000:08:00.0/32768: type eth netdev eth4 flavour pcisf controller 0 pfnum 0 sfnum 106 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state active opstate attached roce enable nested_devlink auxiliary/mlx5_core.sf.2

$ devlink port show pci/0000:08:00.0/32768
pci/0000:08:00.0/32768: type eth netdev eth4 flavour pcisf controller 0 pfnum 0 sfnum 106 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state active opstate attached roce enable nested_devlink auxiliary/mlx5_core.sf.2 nested_devlink_netns ns1
====================

Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
danielocfb pushed a commit that referenced this pull request Sep 27, 2023
Sebastian Andrzej Siewior says:

====================
net: hsr: Properly parse HSRv1 supervisor frames.

this is a follow-up to
	https://lore.kernel.org/all/[email protected]/
replacing
	https://lore.kernel.org/all/[email protected]/

by grabing/ adding tags and reposting with a commit message plus a
missing __packed to a struct (#2) plus extending the testsuite to sover
HSRv1 which is what broke here (#3-#5).

HSRv0 is (was) not affected.
====================

Signed-off-by: David S. Miller <[email protected]>
danielocfb pushed a commit that referenced this pull request Oct 16, 2023
Andrii Nakryiko says:

====================
This patch set fixes ambiguity in BPF verifier log output of SCALAR register
in the parts that emit umin/umax, smin/smax, etc ranges. See patch #4 for
details.

Also, patch #5 fixes an issue with verifier log missing instruction context
(state) output for conditionals that trigger precision marking. See details in
the patch.

First two patches are just improvements to two selftests that are very flaky
locally when run in parallel mode.

Patch #3 changes 'align' selftest to be less strict about exact verifier log
output (which patch #4 changes, breaking lots of align tests as written). Now
test does more of a register substate checks, mostly around expected var_off()
values. This 'align' selftests is one of the more brittle ones and requires
constant adjustment when verifier log output changes, without really catching
any new issues. So hopefully these changes can minimize future support efforts
for this specific set of tests.
====================

Signed-off-by: Daniel Borkmann <[email protected]>
danielocfb pushed a commit that referenced this pull request Oct 17, 2023
Fix the deadlock by refactoring the MR cache cleanup flow to flush the
workqueue without holding the rb_lock.
This adds a race between cache cleanup and creation of new entries which
we solve by denied creation of new entries after cache cleanup started.

Lockdep:
WARNING: possible circular locking dependency detected
 [ 2785.326074 ] 6.2.0-rc6_for_upstream_debug_2023_01_31_14_02 #1 Not tainted
 [ 2785.339778 ] ------------------------------------------------------
 [ 2785.340848 ] devlink/53872 is trying to acquire lock:
 [ 2785.341701 ] ffff888124f8c0c8 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0xc8/0x900
 [ 2785.343403 ]
 [ 2785.343403 ] but task is already holding lock:
 [ 2785.344464 ] ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib]
 [ 2785.346273 ]
 [ 2785.346273 ] which lock already depends on the new lock.
 [ 2785.346273 ]
 [ 2785.347720 ]
 [ 2785.347720 ] the existing dependency chain (in reverse order) is:
 [ 2785.349003 ]
 [ 2785.349003 ] -> #1 (&dev->cache.rb_lock){+.+.}-{3:3}:
 [ 2785.350160 ]        __mutex_lock+0x14c/0x15c0
 [ 2785.350962 ]        delayed_cache_work_func+0x2d1/0x610 [mlx5_ib]
 [ 2785.352044 ]        process_one_work+0x7c2/0x1310
 [ 2785.352879 ]        worker_thread+0x59d/0xec0
 [ 2785.353636 ]        kthread+0x28f/0x330
 [ 2785.354370 ]        ret_from_fork+0x1f/0x30
 [ 2785.355135 ]
 [ 2785.355135 ] -> #0 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}:
 [ 2785.356515 ]        __lock_acquire+0x2d8a/0x5fe0
 [ 2785.357349 ]        lock_acquire+0x1c1/0x540
 [ 2785.358121 ]        __flush_work+0xe8/0x900
 [ 2785.358852 ]        __cancel_work_timer+0x2c7/0x3f0
 [ 2785.359711 ]        mlx5_mkey_cache_cleanup+0xfb/0x250 [mlx5_ib]
 [ 2785.360781 ]        mlx5_ib_stage_pre_ib_reg_umr_cleanup+0x16/0x30 [mlx5_ib]
 [ 2785.361969 ]        __mlx5_ib_remove+0x68/0x120 [mlx5_ib]
 [ 2785.362960 ]        mlx5r_remove+0x63/0x80 [mlx5_ib]
 [ 2785.363870 ]        auxiliary_bus_remove+0x52/0x70
 [ 2785.364715 ]        device_release_driver_internal+0x3c1/0x600
 [ 2785.365695 ]        bus_remove_device+0x2a5/0x560
 [ 2785.366525 ]        device_del+0x492/0xb80
 [ 2785.367276 ]        mlx5_detach_device+0x1a9/0x360 [mlx5_core]
 [ 2785.368615 ]        mlx5_unload_one_devl_locked+0x5a/0x110 [mlx5_core]
 [ 2785.369934 ]        mlx5_devlink_reload_down+0x292/0x580 [mlx5_core]
 [ 2785.371292 ]        devlink_reload+0x439/0x590
 [ 2785.372075 ]        devlink_nl_cmd_reload+0xaef/0xff0
 [ 2785.372973 ]        genl_family_rcv_msg_doit.isra.0+0x1bd/0x290
 [ 2785.374011 ]        genl_rcv_msg+0x3ca/0x6c0
 [ 2785.374798 ]        netlink_rcv_skb+0x12c/0x360
 [ 2785.375612 ]        genl_rcv+0x24/0x40
 [ 2785.376295 ]        netlink_unicast+0x438/0x710
 [ 2785.377121 ]        netlink_sendmsg+0x7a1/0xca0
 [ 2785.377926 ]        sock_sendmsg+0xc5/0x190
 [ 2785.378668 ]        __sys_sendto+0x1bc/0x290
 [ 2785.379440 ]        __x64_sys_sendto+0xdc/0x1b0
 [ 2785.380255 ]        do_syscall_64+0x3d/0x90
 [ 2785.381031 ]        entry_SYSCALL_64_after_hwframe+0x46/0xb0
 [ 2785.381967 ]
 [ 2785.381967 ] other info that might help us debug this:
 [ 2785.381967 ]
 [ 2785.383448 ]  Possible unsafe locking scenario:
 [ 2785.383448 ]
 [ 2785.384544 ]        CPU0                    CPU1
 [ 2785.385383 ]        ----                    ----
 [ 2785.386193 ]   lock(&dev->cache.rb_lock);
 [ 2785.386940 ]				lock((work_completion)(&(&ent->dwork)->work));
 [ 2785.388327 ]				lock(&dev->cache.rb_lock);
 [ 2785.389425 ]   lock((work_completion)(&(&ent->dwork)->work));
 [ 2785.390414 ]
 [ 2785.390414 ]  *** DEADLOCK ***
 [ 2785.390414 ]
 [ 2785.391579 ] 6 locks held by devlink/53872:
 [ 2785.392341 ]  #0: ffffffff84c17a50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40
 [ 2785.393630 ]  #1: ffff888142280218 (&devlink->lock_key){+.+.}-{3:3}, at: devlink_get_from_attrs_lock+0x12d/0x2d0
 [ 2785.395324 ]  #2: ffff8881422d3c38 (&dev->lock_key){+.+.}-{3:3}, at: mlx5_unload_one_devl_locked+0x4a/0x110 [mlx5_core]
 [ 2785.397322 ]  #3: ffffffffa0e59068 (mlx5_intf_mutex){+.+.}-{3:3}, at: mlx5_detach_device+0x60/0x360 [mlx5_core]
 [ 2785.399231 ]  #4: ffff88810e3cb0e8 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x8d/0x600
 [ 2785.400864 ]  #5: ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib]

Fixes: b958451 ("RDMA/mlx5: Change the cache structure to an RB-tree")
Signed-off-by: Shay Drory <[email protected]>
Signed-off-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
danielocfb pushed a commit that referenced this pull request Oct 17, 2023
The following call trace shows a deadlock issue due to recursive locking of
mutex "device_mutex". First lock acquire is in target_for_each_device() and
second in target_free_device().

 PID: 148266   TASK: ffff8be21ffb5d00  CPU: 10   COMMAND: "iscsi_ttx"
  #0 [ffffa2bfc9ec3b18] __schedule at ffffffffa8060e7f
  #1 [ffffa2bfc9ec3ba0] schedule at ffffffffa8061224
  #2 [ffffa2bfc9ec3bb8] schedule_preempt_disabled at ffffffffa80615ee
  #3 [ffffa2bfc9ec3bc8] __mutex_lock at ffffffffa8062fd7
  #4 [ffffa2bfc9ec3c40] __mutex_lock_slowpath at ffffffffa80631d3
  #5 [ffffa2bfc9ec3c50] mutex_lock at ffffffffa806320c
  #6 [ffffa2bfc9ec3c68] target_free_device at ffffffffc0935998 [target_core_mod]
  #7 [ffffa2bfc9ec3c90] target_core_dev_release at ffffffffc092f975 [target_core_mod]
  #8 [ffffa2bfc9ec3ca0] config_item_put at ffffffffa79d250f
  #9 [ffffa2bfc9ec3cd0] config_item_put at ffffffffa79d2583
 #10 [ffffa2bfc9ec3ce0] target_devices_idr_iter at ffffffffc0933f3a [target_core_mod]
 #11 [ffffa2bfc9ec3d00] idr_for_each at ffffffffa803f6fc
 #12 [ffffa2bfc9ec3d60] target_for_each_device at ffffffffc0935670 [target_core_mod]
 #13 [ffffa2bfc9ec3d98] transport_deregister_session at ffffffffc0946408 [target_core_mod]
 #14 [ffffa2bfc9ec3dc8] iscsit_close_session at ffffffffc09a44a6 [iscsi_target_mod]
 #15 [ffffa2bfc9ec3df0] iscsit_close_connection at ffffffffc09a4a88 [iscsi_target_mod]
 #16 [ffffa2bfc9ec3df8] finish_task_switch at ffffffffa76e5d07
 #17 [ffffa2bfc9ec3e78] iscsit_take_action_for_connection_exit at ffffffffc0991c23 [iscsi_target_mod]
 #18 [ffffa2bfc9ec3ea0] iscsi_target_tx_thread at ffffffffc09a403b [iscsi_target_mod]
 #19 [ffffa2bfc9ec3f08] kthread at ffffffffa76d8080
 #20 [ffffa2bfc9ec3f50] ret_from_fork at ffffffffa8200364

Fixes: 36d4cb4 ("scsi: target: Avoid that EXTENDED COPY commands trigger lock inversion")
Signed-off-by: Junxiao Bi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Mike Christie <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
danielocfb pushed a commit that referenced this pull request Oct 17, 2023
Petr Machata says:

====================
mlxsw: Control the order of blocks in ACL region

Amit Cohen writes:

For 12 key blocks in the A-TCAM, rules are split into two records, which
constitute two lookups. The two records are linked using a
"large entry key ID".

Due to a Spectrum-4 hardware issue, KVD entries that correspond to key
blocks 0 to 5 of 12 key blocks will be placed in the same KVD pipe if they
only differ in their "large entry key ID", as it is ignored. This results
in a reduced scale, we can insert less than 20k filters and get an error:

    $ tc -b flower.batch
    RTNETLINK answers: Input/output error
    We have an error talking to the kernel

To reduce the probability of this issue, we can place key blocks with
high entropy in blocks 0 to 5. The idea is to place blocks that are often
changed in blocks 0 to 5, for example, key blocks that match on IPv4
addresses or the LSBs of IPv6 addresses. Such placement will reduce the
probability of these blocks to be same.

Mark several blocks with 'high_entropy' flag and place them in blocks 0
to 5. Note that the list of the blocks is just a suggestion, I will verify
it with architects.

Currently, there is a one loop that chooses which blocks should be used
for a given list of elements and fills the blocks - when a block is
chosen, it fills it in the region. To be able to control the order of
the blocks, separate between searching blocks and filling them. Several
pre-changes are required.

Patch set overview:
Patch #1 marks several blocks with 'high_entropy' flag.
Patches #2-#4 prepare the code for filling blocks at the end of the search.
Patch #5 changes the loop to just choose the blocks and fill the blocks at
the end.
====================

Signed-off-by: David S. Miller <[email protected]>
danielocfb pushed a commit that referenced this pull request Oct 17, 2023
Amit Cohen says:

====================
Extend VXLAN driver to support FDB flushing

The merge commit 9271686 ("Merge branch 'br-flush-filtering'") added
support for FDB flushing in bridge driver. Extend VXLAN driver to support
FDB flushing also. Add support for filtering by fields which are relevant
for VXLAN FDBs:
* Source VNI
* Nexthop ID
* 'router' flag
* Destination VNI
* Destination Port
* Destination IP

Without this set, flush for VXLAN device fails:
$ bridge fdb flush dev vx10
RTNETLINK answers: Operation not supported

With this set, such flush works with the relevant arguments, for example:
$ bridge fdb flush dev vx10 vni 5000 dst 193.2.2.1
< flush all vx10 entries with VNI 5000 and destination IP 193.2.2.1>

Some preparations are required, handle them before adding flushing support
in VXLAN driver. See more details in commit messages.

Patch set overview:
Patch #1 prepares flush policy to be used by VXLAN driver
Patches #2-#3 are preparations in VXLAN driver
Patch #4 adds an initial support for flushing in VXLAN driver
Patches #5-#9 add support for filtering by several attributes
Patch #10 adds a test for FDB flush with VXLAN
Patch #11 extends the test to check FDB flush with bridge
====================

Acked-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
danielocfb pushed a commit that referenced this pull request Oct 26, 2023
Chuyi Zhou says:

====================
Add Open-coded task, css_task and css iters

This is version 6 of task, css_task and css iters support.

--- Changelog ---

v5 -> v6:

Patch #3:
 * In bpf_iter_task_next, return pos rather than goto out. (Andrii)
Patch #2, #3, #4:
 * Add the missing __diag_ignore_all to avoid kernel build warning
Patch #5, #6, #7:
 * Add Andrii's ack

Patch #8:
 * In BPF prog iter_css_task_for_each, return -EPERM rather than 0, and
   ensure stack_mprotect() in iters.c not success. If not, it would cause
   the subsequent 'test_lsm' fail, since the 'is_stack' check in
   test_int_hook(lsm.c) would not be guaranteed.
   (https://github.com/kernel-patches/bpf/actions/runs/6489662214/job/17624665086?pr=5790)

v4 -> v5:https://lore.kernel.org/lkml/[email protected]/

Patch 3~4:
 * Relax the BUILD_BUG_ON check in bpf_iter_task_new and bpf_iter_css_new to avoid
   netdev/build_32bit CI error.
   (https://netdev.bots.linux.dev/static/nipa/790929/13412333/build_32bit/stderr)
Patch 8:
 * Initialize skel pointer to fix the LLVM-16 build CI error
   (https://github.com/kernel-patches/bpf/actions/runs/6462875618/job/17545170863)

v3 -> v4:https://lore.kernel.org/all/[email protected]/

* Address all the comments from Andrii in patch-3 ~ patch-6
* Collect Tejun's ack
* Add a extra patch to rename bpf_iter_task.c to bpf_iter_tasks.c
* Seperate three BPF program files for selftests (iters_task.c iters_css_task.c iters_css.c)

v2 -> v3:https://lore.kernel.org/lkml/[email protected]/

Patch 1 (cgroup: Prepare for using css_task_iter_*() in BPF)
  * Add tj's ack and Alexei's suggest-by.
Patch 2 (bpf: Introduce css_task open-coded iterator kfuncs)
  * Use bpf_mem_alloc/bpf_mem_free rather than kzalloc()
  * Add KF_TRUSTED_ARGS for bpf_iter_css_task_new (Alexei)
  * Move bpf_iter_css_task's definition from uapi/linux/bpf.h to
    kernel/bpf/task_iter.c and we can use it from vmlinux.h
  * Move bpf_iter_css_task_XXX's declaration from bpf_helpers.h to
    bpf_experimental.h
Patch 3 (Introduce task open coded iterator kfuncs)
  * Change th API design keep consistent with SEC("iter/task"), support
    iterating all threads(BPF_TASK_ITERATE_ALL) and threads of a
    specific task (BPF_TASK_ITERATE_THREAD).(Andrii)
  * Move bpf_iter_task's definition from uapi/linux/bpf.h to
    kernel/bpf/task_iter.c and we can use it from vmlinux.h
  * Move bpf_iter_task_XXX's declaration from bpf_helpers.h to
    bpf_experimental.h
Patch 4 (Introduce css open-coded iterator kfuncs)
  * Change th API design keep consistent with cgroup_iters, reuse
    BPF_CGROUP_ITER_DESCENDANTS_PRE/BPF_CGROUP_ITER_DESCENDANTS_POST
    /BPF_CGROUP_ITER_ANCESTORS_UP(Andrii)
  * Add KF_TRUSTED_ARGS for bpf_iter_css_new
  * Move bpf_iter_css's definition from uapi/linux/bpf.h to
    kernel/bpf/task_iter.c and we can use it from vmlinux.h
  * Move bpf_iter_css_XXX's declaration from bpf_helpers.h to
    bpf_experimental.h
Patch 5 (teach the verifier to enforce css_iter and task_iter in RCU CS)
  * Add KF flag KF_RCU_PROTECTED to maintain kfuncs which need RCU CS.(Andrii)
  * Consider STACK_ITER when using bpf_for_each_spilled_reg.
Patch 6 (Let bpf_iter_task_new accept null task ptr)
  * Add this extra patch to let bpf_iter_task_new accept a 'nullable'
  * task pointer(Andrii)
Patch 7 (selftests/bpf: Add tests for open-coded task and css iter)
  * Add failure testcase(Alexei)

Changes from v1(https://lore.kernel.org/lkml/[email protected]/):
- Add a pre-patch to make some preparations before supporting css_task
  iters.(Alexei)
- Add an allowlist for css_task iters(Alexei)
- Let bpf progs do explicit bpf_rcu_read_lock() when using process
  iters and css_descendant iters.(Alexei)
---------------------

In some BPF usage scenarios, it will be useful to iterate the process and
css directly in the BPF program. One of the expected scenarios is
customizable OOM victim selection via BPF[1].

Inspired by Dave's task_vma iter[2], this patchset adds three types of
open-coded iterator kfuncs:

1. bpf_task_iters. It can be used to
1) iterate all process in the system, like for_each_forcess() in kernel.
2) iterate all threads in the system.
3) iterate all threads of a specific task

2. bpf_css_iters. It works like css_task_iter_{start, next, end} and would
be used to iterating tasks/threads under a css.

3. css_iters. It works like css_next_descendant_{pre, post} to iterating all
descendant css.

BPF programs can use these kfuncs directly or through bpf_for_each macro.

link[1]: https://lore.kernel.org/lkml/[email protected]/
link[2]: https://lore.kernel.org/all/[email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
danielocfb pushed a commit that referenced this pull request Oct 26, 2023
Eduard Zingerman says:

====================
exact states comparison for iterator convergence checks

Iterator convergence logic in is_state_visited() uses state_equals()
for states with branches counter > 0 to check if iterator based loop
converges. This is not fully correct because state_equals() relies on
presence of read and precision marks on registers. These marks are not
guaranteed to be finalized while state has branches.
Commit message for patch #3 describes a program that exhibits such
behavior.

This patch-set aims to fix iterator convergence logic by adding notion
of exact states comparison. Exact comparison does not rely on presence
of read or precision marks and thus is more strict.
As explained in commit message for patch #3 exact comparisons require
addition of speculative register bounds widening. The end result for
BPF verifier users could be summarized as follows:

(!) After this update verifier would reject programs that conjure an
    imprecise value on the first loop iteration and use it as precise
    on the second (for iterator based loops).

I urge people to at least skim over the commit message for patch #3.

Patches are organized as follows:
- patches #1,2: moving/extracting utility functions;
- patch #3: introduces exact mode for states comparison and adds
  widening heuristic;
- patch #4: adds test-cases that demonstrate why the series is
  necessary;
- patch #5: extends patch #3 with a notion of state loop entries,
  these entries have to be tracked to correctly identify that
  different verifier states belong to the same states loop;
- patch #6: adds a test-case that demonstrates a program
  which requires loop entry tracking for correct verification;
- patch #7: just adds a few debug prints.

The following actions are planned as a followup for this patch-set:
- implementation has to be adapted for callbacks handling logic as a
  part of a fix for [1];
- it is necessary to explore ways to improve widening heuristic to
  handle iters_task_vma test w/o need to insert barrier_var() calls;
- explored states eviction logic on cache miss has to be extended
  to either:
  - allow eviction of checkpoint states -or-
  - be sped up in case if there are many active checkpoints associated
    with the same instruction.

The patch-set is a followup for mailing list discussion [1].

Changelog:
- V2 [3] -> V3:
  - correct check for stack spills in widen_imprecise_scalars(),
    added test case progs/iters.c:widen_spill to check the behavior
    (suggested by Andrii);
  - allow eviction of checkpoint states in is_state_visited() to avoid
    pathological verifier performance when iterator based loop does not
    converge (discussion with Alexei).
- V1 [2] -> V2, applied changes suggested by Alexei offlist:
  - __explored_state() function removed;
  - same_callsites() function is now used in clean_live_states();
  - patches #1,2 are added as preparatory code movement;
  - in process_iter_next_call() a safeguard is added to verify that
    cur_st->parent exists and has expected insn index / call sites.

[1] https://lore.kernel.org/bpf/[email protected]/
[2] https://lore.kernel.org/bpf/[email protected]/
[3] https://lore.kernel.org/bpf/[email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
danielocfb pushed a commit that referenced this pull request Nov 1, 2023
Jiri Pirko says:

====================
devlink: fix a deadlock when taking devlink instance lock while holding RTNL lock

devlink_port_fill() may be called sometimes with RTNL lock held.
When putting the nested port function devlink instance attrs,
current code takes nested devlink instance lock. In that case lock
ordering is wrong.

Patch #1 is a dependency of patch #2.
Patch #2 converts the peernet2id_alloc() call to rely in RCU so it could
         called without devlink instance lock.
Patch #3 takes device reference for devlink instance making sure that
         device does not disappear before devlink_release() is called.
Patch #4 benefits from the preparations done in patches #2 and #3 and
         removes the problematic nested devlink lock aquisition.
Patched #5-#7 improve documentation to reflect this issue so it is
              avoided in the future.
====================

Signed-off-by: David S. Miller <[email protected]>
danielocfb pushed a commit that referenced this pull request Nov 1, 2023
Petr Machata says:

====================
mlxsw: Move allocation of LAG table to the driver

PGT is an in-HW table that maps addresses to sets of ports. Then when some
HW process needs a set of ports as an argument, instead of embedding the
actual set in the dynamic configuration, what gets configured is the
address referencing the set. The HW then works with the appropriate PGT
entry.

Within the PGT is placed a LAG table. That is a contiguous block of PGT
memory where each entry describes which ports are members of the
corresponding LAG port.

The PGT is split to two parts: one managed by the FW, and one managed by
the driver. Historically, the FW part included also the LAG table, referred
to as FW LAG mode. Giving the responsibility for placement of the LAG table
to the driver, referred to as SW LAG mode, makes the whole system more
flexible. The FW currently supports both FW and SW LAG modes. To shed
complexity, the FW should in the future only support SW LAG mode.

Hence this patchset, where support for placement of LAG is added to mlxsw.

There are FW versions out there that do not support SW LAG mode, and on
Spectrum-1 in particular, there is no plan to support it at all. mlxsw will
therefore have to support both modes of operation.

Another aspect is that at least on Spectrum-1, there are FW versions out
there that claim to support driver-placed LAG table, but then reject or
ignore configurations enabling the same. The driver thus has to have a say
in whether an attempt to configure SW LAG mode should even be done.

The feature is therefore expressed in terms of "does the driver prefer SW
LAG mode?", and "what LAG mode the PCI module managed to configure the FW
with". This is unlike current flood mode configuration, where the driver
can give a strict value, and that's what gets configured. But it gives a
chance to the driver to determine whether LAG mode should be enabled at
all.

The "does the driver prefer SW LAG mode?" bit is expressed as a boolean
lag_mode_prefer_sw. The reason for this is largely another feature that
will be introduced in a follow-up patchset: support for CFF flood mode. The
driver currently requires that the FW be configured with what is called
controlled flood mode. But on capable systems, CFF would be preferred. So
there are two values in flight: the preferred flood mode, and the fallback.
This could be expressed with an array of flood modes ordered by preference,
but that looks like an overkill in comparison. This flag/value model is
then reused for LAG mode as well, except the fallback value is absent and
implied to be FW, because there are no other values to choose from.

The patchset progresses as follows:

- Patches #1 to #5 adjust reg.h and cmd.h with new register fields,
  constants and remarks.

- Patches #6 and #7 add the ability to request SW LAG mode and to query the
  LAG mode that was actually negotiated. This is where the abovementioned
  lag_mode_prefer_sw flag is added.

- Patches #7 to #9 generalize PGT allocations to make it possible to
  allocate the LAG table, which is done in patch #10.

- In patch #11, toggle lag_mode_prefer_sw on Spectrum-2 and above, which
  makes the newly-added code live.
====================

Signed-off-by: David S. Miller <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants