Skip to content

bpf: tracing multi-link support #5383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

kernel-patches-daemon-bpf-rc[bot]
Copy link

Pull request for series with
subject: bpf: tracing multi-link support
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=966845

Kernel Patches Daemon and others added 26 commits May 27, 2025 20:21
For now, there isn't a way to set and get per-function metadata with
a low overhead, which is not convenient for some situations. Take
BPF trampoline for example, we need to create a trampoline for each
kernel function, as we have to store some information of the function
to the trampoline, such as BPF progs, function arg count, etc. The
performance overhead and memory consumption can be higher to create
these trampolines. With the supporting of per-function metadata storage,
we can store these information to the metadata, and create a global BPF
trampoline for all the kernel functions. In the global trampoline, we
get the information that we need from the function metadata through the
ip (function address) with almost no overhead.

Another beneficiary can be fprobe. For now, fprobe will add all the
functions that it hooks into a hash table. And in fprobe_entry(), it will
lookup all the handlers of the function in the hash table. The performance
can suffer from the hash table lookup. We can optimize it by adding the
handler to the function metadata instead.

Support per-function metadata storage in the function padding, and
previous discussion can be found in [1]. Generally speaking, we have two
way to implement this feature:

1. Create a function metadata array, and prepend a insn which can hold
the index of the function metadata in the array. And store the insn to
the function padding.

2. Allocate the function metadata with kmalloc(), and prepend a insn which
hold the pointer of the metadata. And store the insn to the function
padding.

Compared with way 2, way 1 consume less space, but we need to do more work
on the global function metadata array. And we implement this function in
the way 1.

Link: https://lore.kernel.org/bpf/CADxym3anLzM6cAkn_z71GDd_VeKiqqk1ts=xuiP7pr4PO6USPA@mail.gmail.com/ [1]
Signed-off-by: Menglong Dong <[email protected]>
With CONFIG_CALL_PADDING enabled, there will be 16-bytes padding space
before all the kernel functions. And some kernel features can use it,
such as MITIGATION_CALL_DEPTH_TRACKING, CFI_CLANG, FINEIBT, etc.

In my research, MITIGATION_CALL_DEPTH_TRACKING will consume the tail
9-bytes in the function padding, CFI_CLANG will consume the head 5-bytes,
and FINEIBT will consume all the 16 bytes if it is enabled. So there will
be no space for us if MITIGATION_CALL_DEPTH_TRACKING and CFI_CLANG are
both enabled, or FINEIBT is enabled.

In order to implement the padding-based function metadata, we need 5-bytes
to prepend a "mov %eax xxx" insn in x86_64, which can hold a 4-bytes
index. So we have following logic:

1. use the head 5-bytes if CFI_CLANG is not enabled
2. use the tail 5-bytes if MITIGATION_CALL_DEPTH_TRACKING and FINEIBT are
   not enabled
3. try to probe if fineibt or the call thunks is enabled after the kernel
   boot dynamically

On the third case, we implement the function metadata by hash table if
"cfi_mode==CFI_FINEIBT || thunks_initialized". Therefore, we need to make
thunks_initialized global in arch/x86/kernel/callthunks.c

Signed-off-by: Menglong Dong <[email protected]>
The per-function metadata storage is already used by ftrace if
CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS is enabled, and it store the pointer
of the callback directly to the function padding, which consume 8-bytes,
in the commit
baaf553 ("arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS").
So we can directly store the index to the function padding too, without
a prepending. With CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS enabled, the
function is 8-bytes aligned, and we will compile the kernel with extra
8-bytes (2 NOPS) padding space. Otherwise, the function is 4-bytes
aligned, and only extra 4-bytes (1 NOPS) is needed for us.

However, we have the same problem with Mark in the commit above: we can't
use the function padding together with CFI_CLANG, which can make the clang
compiles a wrong offset to the pre-function type hash. So we fallback to
the hash table mode for function metadata if CFI_CLANG is enabled.

Signed-off-by: Menglong Dong <[email protected]>
Introduce the struct kfunc_md_tramp_prog for BPF_PROG_TYPE_TRACING, and
add the field "bpf_progs" to struct kfunc_md. These filed will be used
in the next patch of bpf global trampoline.

And the KFUNC_MD_FL_TRACING_ORIGIN is introduced to indicate that origin
call is needed on this function.

Add the function kfunc_md_bpf_link and kfunc_md_bpf_unlink to add or
remove bpf prog to kfunc_md. Meanwhile, introduce kunfc_md_bpf_ips() to
get all the kernel functions in kfunc_mds that contains bpf progs.

The KFUNC_MD_FL_BPF_REMOVING indicate that a removing operation is in
progress, and we shouldn't return it if "bpf_prog_cnt<=1" in
kunfc_md_bpf_ips().

Signed-off-by: Menglong Dong <[email protected]>
Implement the bpf global trampoline "bpf_global_caller" for x86_64. The
logic of it is similar to the bpf trampoline:

1. save the regs for function args. For now, only the function with args
   count no more than 6 is supported
2. save rbx and r12, which will be used to store the prog list and return
   value of __bpf_prog_enter_recur
3. get the origin function address on the stack. To get the real function
   address, we make it "&= $0xfffffffffffffff0", as it is always 16-bytes
   aligned
4. get the function metadata by calling kfunc_md_get_noref()
5. get the function args count from the kfunc_md and store it on the
   stack.
6. get the kfunc_md flags and store it on the stack. Call kfunc_md_enter()
   if origin call is needed
7. get the prog list for FENTRY, and run all the progs in the list with
   bpf_caller_prog_run
8. goto the end if origin call is not necessary
9. get the prog list for MODIFY_RETURN, and run all the progs in the list
   with bpf_caller_prog_run
10.restore the regs and do the origin call. We get the ip of the origin
   function by the rip in the stack
11.save the return value of the origin call to the stack.
12.get the prog list for FEXIT, and run all the progs in the list with
   bpf_caller_prog_run
13.restore rbx, r12, r13. In order to rebalance the RSB, we call
   bpf_global_caller_rsb here.

Indirect call is used in bpf_caller_prog_run, as we load and call the
function address from the stack in the origin call case. What's more, we
get the bpf progs from the kfunc_md and call it indirectly. We make the
indirect call with CALL_NOSPEC, and I'm not sure if it can prevent the
Spectre. I just saw others do it in the same way :/

We use the r13 to keep the address where we put the return value of the
origin call on the stack. The offset of it is
"FUNC_ARGS_OFFSET + 8 * nr_args".

The calling of kfunc_md_get_noref() should be within rcu_read_lock, which
I don't, as this will increase the overhead of a function call. And I'm
considering to make the calling of the bpf prog list within the rcu lock:

  rcu_read_lock()
  kfunc_md_get_noref()
  call fentry progs
  call modify_return progs
  rcu_read_unlock()

  call origin

  rcu_read_lock()
  call fexit progs
  rcu_read_unlock()

I'm not sure why the general bpf trampoline don't do it this way. Because
this will make the trampoline hold the rcu lock too long?

Signed-off-by: Menglong Dong <[email protected]>
Factor out ftrace_direct_update() from register_ftrace_direct(), which is
used to add new entries to the direct_functions. This function will be
used in the later patch.

Signed-off-by: Menglong Dong <[email protected]>
For now, we can change the address of a direct ftrace_ops with
modify_ftrace_direct(). However, we can't change the functions to filter
for a direct ftrace_ops. Therefore, we introduce the function
reset_ftrace_direct_ips() to do such things, and this function will reset
the functions to filter for a direct ftrace_ops.

This function do such thing in following steps:

1. filter out the new functions from ips that don't exist in the
   ops->func_hash->filter_hash and add them to the new hash.
2. add all the functions in the new ftrace_hash to direct_functions by
   ftrace_direct_update().
3. reset the functions to filter of the ftrace_ops to the ips with
   ftrace_set_filter_ips().
4. remove the functions that in the old ftrace_hash, but not in the new
   ftrace_hash from direct_functions.

Signed-off-by: Menglong Dong <[email protected]>
Introduce the struct bpf_gtramp_link, which is used to attach
a bpf prog to multi functions. Meanwhile, introduce corresponding
function bpf_gtrampoline_{link,unlink}_prog.

The lock global_tr_lock is held during global trampoline link and unlink.
Why we define the global_tr_lock as rw_semaphore? Well, it should be mutex
here, but we will use the rw_semaphore in the later patch for the
trampoline override case :/

When unlink the global trampoline link, we mark all the function in the
bpf_gtramp_link with KFUNC_MD_FL_BPF_REMOVING and update the global
trampoline with bpf_gtrampoline_update(). If this is the last bpf prog
in the kfunc_md, the function will be remove from the filter_hash of the
ftrace_ops of bpf_global_trampoline. Then, we remove the bpf prog from
the kfunc_md, and free the kfunc_md if necessary.

Signed-off-by: Menglong Dong <[email protected]>
In this commit, we add the 'accessed_args' field to struct bpf_prog_aux,
which is used to record the accessed index of the function args in
btf_ctx_access().

Meanwhile, we add the function btf_check_func_part_match() to compare the
accessed function args of two function prototype. This function will be
used in the following commit.

Signed-off-by: Menglong Dong <[email protected]>
Refactor the struct modules_array to more general struct ptr_array, which
is used to store the pointers.

Meanwhile, introduce the bpf_try_add_ptr(), which checks the existing of
the ptr before adding it to the array.

Seems it should be moved to another files in "lib", and I'm not sure where
to add it now, and let's move it to kernel/bpf/syscall.c for now.

Signed-off-by: Menglong Dong <[email protected]>
Add target btf to the function args of bpf_check_attach_target(), then
the caller can specify the btf to check.

Signed-off-by: Menglong Dong <[email protected]>
Move the checking of btf_id_deny and noreturn_deny from
check_attach_btf_id() to bpf_check_attach_target(). Therefore, we can do
such checking during attaching for tracing multi-link in the later
patches.

Signed-off-by: Menglong Dong <[email protected]>
Factor the function __arch_get_bpf_regs_nr() to get the regs count that
used by the function args.

The arch_get_bpf_regs_nr() will return -ENOTSUPP if the regs is not enough
to hold the function args.

Signed-off-by: Menglong Dong <[email protected]>
In this commit, we add the support to allow attaching a tracing BPF
program to multi hooks, which is similar to BPF_TRACE_KPROBE_MULTI.

The use case is obvious. For now, we have to create a BPF program for each
kernel function, for which we want to trace, even through all the program
have the same (or similar logic). This can consume extra memory, and make
the program loading slow if we have plenty of kernel function to trace.
The KPROBE_MULTI maybe a alternative, but it can't do what TRACING do. For
example, the kretprobe can't obtain the function args, but the FEXIT can.

For now, we support to create multi-link for fentry/fexit/modify_return
with the following new attach types that we introduce:

  BPF_TRACE_FENTRY_MULTI
  BPF_TRACE_FEXIT_MULTI
  BPF_MODIFY_RETURN_MULTI

We introduce the struct bpf_tracing_multi_link for this purpose, which
can hold all the kernel modules, target bpf program (for attaching to bpf
program) or target btf (for attaching to kernel function) that we
referenced.

During loading, the first target is used for verification by the verifer.
And during attaching, we check the consistency of all the targets with
the first target.

Signed-off-by: Menglong Dong <[email protected]>
Factor out __unregister_ftrace_direct, which doesn't hold the direct_mutex
lock.

Signed-off-by: Menglong Dong <[email protected]>
Introduce the function replace_ftrace_direct(). This is used to replace
the direct ftrace_ops for a function, and will be used in the next patch.

Let's call the origin ftrace_ops A, and the new ftrace_ops B. First, we
register B directly, and the callback of the functions in A and B will
fallback to the ftrace_ops_list case.

Then, we modify the address of the entry in the direct_functions to
B->direct_call, and remove it from A. This will update the dyn_rec and
make the functions call b->direct_call directly. If no function in
A->filter_hash, just unregister it.

So a record can have more than one direct ftrace_ops, and we need check
if there is any direct ops for the record before remove the
FTRACE_OPS_FL_DIRECT in __ftrace_hash_rec_update().

Signed-off-by: Menglong Dong <[email protected]>
For now, the bpf global trampoline can't work together with trampoline.
For example, we will fail on attaching the FENTRY_MULTI to the functions
that FENTRY exists, and FENTRY will also fail if FENTRY_MULTI exists.

We make the global trampoline work together with trampoline in this
commit.

It is not easy. The most difficult part is synchronization between
bpf_gtrampoline_link_prog and bpf_trampoline_link_prog, and we use a
rw_semaphore here, which is quite ugly. We hold the write lock in
bpf_gtrampoline_link_prog and read lock in bpf_trampoline_link_prog.

We introduce the function bpf_gtrampoline_link_tramp() to make
bpf_gtramp_link fit bpf_trampoline, which will be called in
bpf_gtrampoline_link_prog(). If the bpf_trampoline of the function exist
in the kfunc_md or we find it with bpf_trampoline_lookup_exist(), it means
that we need do the fitting. The fitting is simple, we create a
bpf_shim_tramp_link for our prog and link it to the bpf_trampoline with
__bpf_trampoline_link_prog().

It's a little complex for the bpf_trampoline_link_prog() case. We create
bpf_shim_tramp_link for all the bpf progs in kfunc_md and add it to the
bpf_trampoline before we call __bpf_trampoline_link_prog() in
bpf_gtrampoline_replace(). And we will fallback in
bpf_gtrampoline_replace_finish() if error is returned by
__bpf_trampoline_link_prog().

In __bpf_gtrampoline_unlink_prog(), we will call bpf_gtrampoline_remove()
to release the bpf_shim_tramp_link, and the bpf prog will be unlinked if
it is ever linked successfully in bpf_link_free().

Another solution is to fit into the existing trampoline. For example, we
can add the bpf prog to the kfunc_md if tracing_multi bpf prog is attached
on the target function when we attach a tracing bpf prog. And we can also
update the tracing_multi prog to the trampoline if tracing prog exists
on the target function. I think this will make the compatibility much
easier.

The code in this part is very ugly and messy, and I think it will be a
liberation to split it out to another series :/

Signed-off-by: Menglong Dong <[email protected]>
By default, the kernel btf that we load during loading program will be
freed after the programs are loaded in bpf_object_load(). However, we
still need to use these btf for tracing of multi-link during attaching.
Therefore, we don't free the btfs until the bpf object is closed if any
bpf programs of the type multi-link tracing exist.

Meanwhile, introduce the new api bpf_object__free_btf() to manually free
the btfs after attaching.

Signed-off-by: Menglong Dong <[email protected]>
Add supporting for the attach types of:

BPF_TRACE_FENTRY_MULTI
BPF_TRACE_FEXIT_MULTI
BPF_MODIFY_RETURN_MULTI

Signed-off-by: Menglong Dong <[email protected]>
For now, the libbpf find the btf type id by loop all the btf types and
compare its name, which is inefficient if we have many functions to
lookup.

We add the "use_hash" to the function args of find_kernel_btf_id() to
indicate if we should lookup the btf type id by hash. The hash table will
be initialized if it has not yet.

Signed-off-by: Menglong Dong <[email protected]>
We add skip_invalid and attach_tracing for tracing_multi for the
selftests.

When we try to attach all the functions in available_filter_functions with
tracing_multi, we can't tell if the target symbol can be attached
successfully, and the attaching will fail. When skip_invalid is set to
true, we will check if it can be attached in libbpf, and skip the invalid
entries.

We will skip the symbols in the following cases:

1. the btf type not exist
2. the btf type is not a function proto
3. the function args count more that 6
4. the return type is struct or union
5. any function args is struct or union

The 5th rule can be a manslaughter, but it's ok for the testings.

"attach_tracing" is used to convert a TRACING prog to TRACING_MULTI. For
example, we can set the attach type to FENTRY_MULTI before we load the
skel. And we can attach the prog with
bpf_program__attach_trace_multi_opts() with "attach_tracing=1". The libbpf
will attach the target btf type of the prog automatically. This is also
used to reuse the selftests of tracing.

(Oh my goodness! What am I doing?)

Signed-off-by: Menglong Dong <[email protected]>
The glob_match() in test_progs.c has almost the same logic with the
glob_match() in libbpf.c, so we replace it to make the code simple.

Signed-off-by: Menglong Dong <[email protected]>
We need to get all the kernel function that can be traced sometimes, so we
move the get_syms() and get_addrs() in kprobe_multi_test.c to test_progs.c
and rename it to bpf_get_ksyms() and bpf_get_addrs().

Signed-off-by: Menglong Dong <[email protected]>
In this commit, we add some testcases for the following attach types:

BPF_TRACE_FENTRY_MULTI
BPF_TRACE_FEXIT_MULTI
BPF_MODIFY_RETURN_MULTI

We reuse the testings in fentry_test.c, fexit_test.c and modify_return.c
by attach the tracing bpf prog as tracing_multi.

We add some functions to skip for tracing progs to bpf_get_ksyms() in this
commit.

Signed-off-by: Menglong Dong <[email protected]>
Add testcase for the performance of the trace bpf progs. In this testcase,
bpf_fentry_test1() will be called 10000000 times in bpf_testmod_bench_run,
and the time consumed will be returned. Following cases is considered:

- nop: nothing is attached to bpf_fentry_test1()
- fentry: a empty FENTRY bpf program is attached to bpf_fentry_test1()
- fentry_multi_single: a empty FENTRY_MULTI bpf program is attached to
  bpf_fentry_test1()
- fentry_multi_all: a empty FENTRY_MULTI bpf program is attached to all
  the kernel functions
- kprobe_multi_single: a empty KPROBE_MULTI bpf program is attached to
  bpf_fentry_test1()
- kprobe_multi_all: a empty KPROBE_MULTI bpf program is attached to all
  the kernel functions

And we can get the result by running:

  ./test_progs -t tracing_multi_bench -v | grep time

Signed-off-by: Menglong Dong <[email protected]>
@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: c5cebb2
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=966845
version: 1

@kernel-patches-daemon-bpf-rc
Copy link
Author

At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=966845 expired. Closing PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant