Skip to content

Commit 053c8e1

Browse files
borkmannAlexei Starovoitov
authored and
Alexei Starovoitov
committed
bpf: Add generic attach/detach/query API for multi-progs
This adds a generic layer called bpf_mprog which can be reused by different attachment layers to enable multi-program attachment and dependency resolution. In-kernel users of the bpf_mprog don't need to care about the dependency resolution internals, they can just consume it with few API calls. The initial idea of having a generic API sparked out of discussion [0] from an earlier revision of this work where tc's priority was reused and exposed via BPF uapi as a way to coordinate dependencies among tc BPF programs, similar as-is for classic tc BPF. The feedback was that priority provides a bad user experience and is hard to use [1], e.g.: I cannot help but feel that priority logic copy-paste from old tc, netfilter and friends is done because "that's how things were done in the past". [...] Priority gets exposed everywhere in uapi all the way to bpftool when it's right there for users to understand. And that's the main problem with it. The user don't want to and don't need to be aware of it, but uapi forces them to pick the priority. [...] Your cover letter [0] example proves that in real life different service pick the same priority. They simply don't know any better. Priority is an unnecessary magic that apps _have_ to pick, so they just copy-paste and everyone ends up using the same. The course of the discussion showed more and more the need for a generic, reusable API where the "same look and feel" can be applied for various other program types beyond just tc BPF, for example XDP today does not have multi- program support in kernel, but also there was interest around this API for improving management of cgroup program types. Such common multi-program management concept is useful for BPF management daemons or user space BPF applications coordinating internally about their attachments. Both from Cilium and Meta side [2], we've collected the following requirements for a generic attach/detach/query API for multi-progs which has been implemented as part of this work: - Support prog-based attach/detach and link API - Dependency directives (can also be combined): - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none} - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user space application does not need CAP_SYS_ADMIN to retrieve foreign fds via bpf_*_get_fd_by_id() - BPF_F_LINK flag as {prog,link} toggle - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and BPF_F_AFTER will just append for attaching - Enforced only at attach time - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their own infra for replacing their internal prog - If no flags are set, then it's default append behavior for attaching - Internal revision counter and optionally being able to pass expected_revision - User space application can query current state with revision, and pass it along for attachment to assert current state before doing updates - Query also gets extension for link_ids array and link_attach_flags: - prog_ids are always filled with program IDs - link_ids are filled with link IDs when link was used, otherwise 0 - {prog,link}_attach_flags for holding {prog,link}-specific flags - Must be easy to integrate/reuse for in-kernel users The uapi-side changes needed for supporting bpf_mprog are rather minimal, consisting of the additions of the attachment flags, revision counter, and expanding existing union with relative_{fd,id} member. The bpf_mprog framework consists of an bpf_mprog_entry object which holds an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path structure) is part of bpf_mprog_bundle. Both have been separated, so that fast-path gets efficient packing of bpf_prog pointers for maximum cache efficiency. Also, array has been chosen instead of linked list or other structures to remove unnecessary indirections for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry is populated and then just swapped which avoids additional allocations that could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could be converted to dynamic allocation if necessary at a point in future. Locking is deferred to the in-kernel user of bpf_mprog, for example, in case of tcx which uses this API in the next patch, it piggybacks on rtnl. An extensive test suite for checking all aspects of this API for prog-based attach/detach and link API comes as BPF selftests in this series. Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog management. [0] https://lore.kernel.org/bpf/[email protected] [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
1 parent 3226e31 commit 053c8e1

File tree

6 files changed

+821
-17
lines changed

6 files changed

+821
-17
lines changed

MAINTAINERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3684,6 +3684,7 @@ F: include/linux/filter.h
36843684
F: include/linux/tnum.h
36853685
F: kernel/bpf/core.c
36863686
F: kernel/bpf/dispatcher.c
3687+
F: kernel/bpf/mprog.c
36873688
F: kernel/bpf/syscall.c
36883689
F: kernel/bpf/tnum.c
36893690
F: kernel/bpf/trampoline.c

include/linux/bpf_mprog.h

Lines changed: 318 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,318 @@
1+
/* SPDX-License-Identifier: GPL-2.0 */
2+
/* Copyright (c) 2023 Isovalent */
3+
#ifndef __BPF_MPROG_H
4+
#define __BPF_MPROG_H
5+
6+
#include <linux/bpf.h>
7+
8+
/* bpf_mprog framework:
9+
*
10+
* bpf_mprog is a generic layer for multi-program attachment. In-kernel users
11+
* of the bpf_mprog don't need to care about the dependency resolution
12+
* internals, they can just consume it with few API calls. Currently available
13+
* dependency directives are BPF_F_{BEFORE,AFTER} which enable insertion of
14+
* a BPF program or BPF link relative to an existing BPF program or BPF link
15+
* inside the multi-program array as well as prepend and append behavior if
16+
* no relative object was specified, see corresponding selftests for concrete
17+
* examples (e.g. tc_links and tc_opts test cases of test_progs).
18+
*
19+
* Usage of bpf_mprog_{attach,detach,query}() core APIs with pseudo code:
20+
*
21+
* Attach case:
22+
*
23+
* struct bpf_mprog_entry *entry, *entry_new;
24+
* int ret;
25+
*
26+
* // bpf_mprog user-side lock
27+
* // fetch active @entry from attach location
28+
* [...]
29+
* ret = bpf_mprog_attach(entry, &entry_new, [...]);
30+
* if (!ret) {
31+
* if (entry != entry_new) {
32+
* // swap @entry to @entry_new at attach location
33+
* // ensure there are no inflight users of @entry:
34+
* synchronize_rcu();
35+
* }
36+
* bpf_mprog_commit(entry);
37+
* } else {
38+
* // error path, bail out, propagate @ret
39+
* }
40+
* // bpf_mprog user-side unlock
41+
*
42+
* Detach case:
43+
*
44+
* struct bpf_mprog_entry *entry, *entry_new;
45+
* int ret;
46+
*
47+
* // bpf_mprog user-side lock
48+
* // fetch active @entry from attach location
49+
* [...]
50+
* ret = bpf_mprog_detach(entry, &entry_new, [...]);
51+
* if (!ret) {
52+
* // all (*) marked is optional and depends on the use-case
53+
* // whether bpf_mprog_bundle should be freed or not
54+
* if (!bpf_mprog_total(entry_new)) (*)
55+
* entry_new = NULL (*)
56+
* // swap @entry to @entry_new at attach location
57+
* // ensure there are no inflight users of @entry:
58+
* synchronize_rcu();
59+
* bpf_mprog_commit(entry);
60+
* if (!entry_new) (*)
61+
* // free bpf_mprog_bundle (*)
62+
* } else {
63+
* // error path, bail out, propagate @ret
64+
* }
65+
* // bpf_mprog user-side unlock
66+
*
67+
* Query case:
68+
*
69+
* struct bpf_mprog_entry *entry;
70+
* int ret;
71+
*
72+
* // bpf_mprog user-side lock
73+
* // fetch active @entry from attach location
74+
* [...]
75+
* ret = bpf_mprog_query(attr, uattr, entry);
76+
* // bpf_mprog user-side unlock
77+
*
78+
* Data/fast path:
79+
*
80+
* struct bpf_mprog_entry *entry;
81+
* struct bpf_mprog_fp *fp;
82+
* struct bpf_prog *prog;
83+
* int ret = [...];
84+
*
85+
* rcu_read_lock();
86+
* // fetch active @entry from attach location
87+
* [...]
88+
* bpf_mprog_foreach_prog(entry, fp, prog) {
89+
* ret = bpf_prog_run(prog, [...]);
90+
* // process @ret from program
91+
* }
92+
* [...]
93+
* rcu_read_unlock();
94+
*
95+
* bpf_mprog locking considerations:
96+
*
97+
* bpf_mprog_{attach,detach,query}() must be protected by an external lock
98+
* (like RTNL in case of tcx).
99+
*
100+
* bpf_mprog_entry pointer can be an __rcu annotated pointer (in case of tcx
101+
* the netdevice has tcx_ingress and tcx_egress __rcu pointer) which gets
102+
* updated via rcu_assign_pointer() pointing to the active bpf_mprog_entry of
103+
* the bpf_mprog_bundle.
104+
*
105+
* Fast path accesses the active bpf_mprog_entry within RCU critical section
106+
* (in case of tcx it runs in NAPI which provides RCU protection there,
107+
* other users might need explicit rcu_read_lock()). The bpf_mprog_commit()
108+
* assumes that for the old bpf_mprog_entry there are no inflight users
109+
* anymore.
110+
*
111+
* The READ_ONCE()/WRITE_ONCE() pairing for bpf_mprog_fp's prog access is for
112+
* the replacement case where we don't swap the bpf_mprog_entry.
113+
*/
114+
115+
#define bpf_mprog_foreach_tuple(entry, fp, cp, t) \
116+
for (fp = &entry->fp_items[0], cp = &entry->parent->cp_items[0];\
117+
({ \
118+
t.prog = READ_ONCE(fp->prog); \
119+
t.link = cp->link; \
120+
t.prog; \
121+
}); \
122+
fp++, cp++)
123+
124+
#define bpf_mprog_foreach_prog(entry, fp, p) \
125+
for (fp = &entry->fp_items[0]; \
126+
(p = READ_ONCE(fp->prog)); \
127+
fp++)
128+
129+
#define BPF_MPROG_MAX 64
130+
131+
struct bpf_mprog_fp {
132+
struct bpf_prog *prog;
133+
};
134+
135+
struct bpf_mprog_cp {
136+
struct bpf_link *link;
137+
};
138+
139+
struct bpf_mprog_entry {
140+
struct bpf_mprog_fp fp_items[BPF_MPROG_MAX];
141+
struct bpf_mprog_bundle *parent;
142+
};
143+
144+
struct bpf_mprog_bundle {
145+
struct bpf_mprog_entry a;
146+
struct bpf_mprog_entry b;
147+
struct bpf_mprog_cp cp_items[BPF_MPROG_MAX];
148+
struct bpf_prog *ref;
149+
atomic64_t revision;
150+
u32 count;
151+
};
152+
153+
struct bpf_tuple {
154+
struct bpf_prog *prog;
155+
struct bpf_link *link;
156+
};
157+
158+
static inline struct bpf_mprog_entry *
159+
bpf_mprog_peer(const struct bpf_mprog_entry *entry)
160+
{
161+
if (entry == &entry->parent->a)
162+
return &entry->parent->b;
163+
else
164+
return &entry->parent->a;
165+
}
166+
167+
static inline void bpf_mprog_bundle_init(struct bpf_mprog_bundle *bundle)
168+
{
169+
BUILD_BUG_ON(sizeof(bundle->a.fp_items[0]) > sizeof(u64));
170+
BUILD_BUG_ON(ARRAY_SIZE(bundle->a.fp_items) !=
171+
ARRAY_SIZE(bundle->cp_items));
172+
173+
memset(bundle, 0, sizeof(*bundle));
174+
atomic64_set(&bundle->revision, 1);
175+
bundle->a.parent = bundle;
176+
bundle->b.parent = bundle;
177+
}
178+
179+
static inline void bpf_mprog_inc(struct bpf_mprog_entry *entry)
180+
{
181+
entry->parent->count++;
182+
}
183+
184+
static inline void bpf_mprog_dec(struct bpf_mprog_entry *entry)
185+
{
186+
entry->parent->count--;
187+
}
188+
189+
static inline int bpf_mprog_max(void)
190+
{
191+
return ARRAY_SIZE(((struct bpf_mprog_entry *)NULL)->fp_items) - 1;
192+
}
193+
194+
static inline int bpf_mprog_total(struct bpf_mprog_entry *entry)
195+
{
196+
int total = entry->parent->count;
197+
198+
WARN_ON_ONCE(total > bpf_mprog_max());
199+
return total;
200+
}
201+
202+
static inline bool bpf_mprog_exists(struct bpf_mprog_entry *entry,
203+
struct bpf_prog *prog)
204+
{
205+
const struct bpf_mprog_fp *fp;
206+
const struct bpf_prog *tmp;
207+
208+
bpf_mprog_foreach_prog(entry, fp, tmp) {
209+
if (tmp == prog)
210+
return true;
211+
}
212+
return false;
213+
}
214+
215+
static inline void bpf_mprog_mark_for_release(struct bpf_mprog_entry *entry,
216+
struct bpf_tuple *tuple)
217+
{
218+
WARN_ON_ONCE(entry->parent->ref);
219+
if (!tuple->link)
220+
entry->parent->ref = tuple->prog;
221+
}
222+
223+
static inline void bpf_mprog_complete_release(struct bpf_mprog_entry *entry)
224+
{
225+
/* In the non-link case prog deletions can only drop the reference
226+
* to the prog after the bpf_mprog_entry got swapped and the
227+
* bpf_mprog ensured that there are no inflight users anymore.
228+
*
229+
* Paired with bpf_mprog_mark_for_release().
230+
*/
231+
if (entry->parent->ref) {
232+
bpf_prog_put(entry->parent->ref);
233+
entry->parent->ref = NULL;
234+
}
235+
}
236+
237+
static inline void bpf_mprog_revision_new(struct bpf_mprog_entry *entry)
238+
{
239+
atomic64_inc(&entry->parent->revision);
240+
}
241+
242+
static inline void bpf_mprog_commit(struct bpf_mprog_entry *entry)
243+
{
244+
bpf_mprog_complete_release(entry);
245+
bpf_mprog_revision_new(entry);
246+
}
247+
248+
static inline u64 bpf_mprog_revision(struct bpf_mprog_entry *entry)
249+
{
250+
return atomic64_read(&entry->parent->revision);
251+
}
252+
253+
static inline void bpf_mprog_entry_copy(struct bpf_mprog_entry *dst,
254+
struct bpf_mprog_entry *src)
255+
{
256+
memcpy(dst->fp_items, src->fp_items, sizeof(src->fp_items));
257+
}
258+
259+
static inline void bpf_mprog_entry_grow(struct bpf_mprog_entry *entry, int idx)
260+
{
261+
int total = bpf_mprog_total(entry);
262+
263+
memmove(entry->fp_items + idx + 1,
264+
entry->fp_items + idx,
265+
(total - idx) * sizeof(struct bpf_mprog_fp));
266+
267+
memmove(entry->parent->cp_items + idx + 1,
268+
entry->parent->cp_items + idx,
269+
(total - idx) * sizeof(struct bpf_mprog_cp));
270+
}
271+
272+
static inline void bpf_mprog_entry_shrink(struct bpf_mprog_entry *entry, int idx)
273+
{
274+
/* Total array size is needed in this case to enure the NULL
275+
* entry is copied at the end.
276+
*/
277+
int total = ARRAY_SIZE(entry->fp_items);
278+
279+
memmove(entry->fp_items + idx,
280+
entry->fp_items + idx + 1,
281+
(total - idx - 1) * sizeof(struct bpf_mprog_fp));
282+
283+
memmove(entry->parent->cp_items + idx,
284+
entry->parent->cp_items + idx + 1,
285+
(total - idx - 1) * sizeof(struct bpf_mprog_cp));
286+
}
287+
288+
static inline void bpf_mprog_read(struct bpf_mprog_entry *entry, u32 idx,
289+
struct bpf_mprog_fp **fp,
290+
struct bpf_mprog_cp **cp)
291+
{
292+
*fp = &entry->fp_items[idx];
293+
*cp = &entry->parent->cp_items[idx];
294+
}
295+
296+
static inline void bpf_mprog_write(struct bpf_mprog_fp *fp,
297+
struct bpf_mprog_cp *cp,
298+
struct bpf_tuple *tuple)
299+
{
300+
WRITE_ONCE(fp->prog, tuple->prog);
301+
cp->link = tuple->link;
302+
}
303+
304+
int bpf_mprog_attach(struct bpf_mprog_entry *entry,
305+
struct bpf_mprog_entry **entry_new,
306+
struct bpf_prog *prog_new, struct bpf_link *link,
307+
struct bpf_prog *prog_old,
308+
u32 flags, u32 id_or_fd, u64 revision);
309+
310+
int bpf_mprog_detach(struct bpf_mprog_entry *entry,
311+
struct bpf_mprog_entry **entry_new,
312+
struct bpf_prog *prog, struct bpf_link *link,
313+
u32 flags, u32 id_or_fd, u64 revision);
314+
315+
int bpf_mprog_query(const union bpf_attr *attr, union bpf_attr __user *uattr,
316+
struct bpf_mprog_entry *entry);
317+
318+
#endif /* __BPF_MPROG_H */

include/uapi/linux/bpf.h

Lines changed: 28 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1113,7 +1113,12 @@ enum bpf_perf_event_type {
11131113
*/
11141114
#define BPF_F_ALLOW_OVERRIDE (1U << 0)
11151115
#define BPF_F_ALLOW_MULTI (1U << 1)
1116+
/* Generic attachment flags. */
11161117
#define BPF_F_REPLACE (1U << 2)
1118+
#define BPF_F_BEFORE (1U << 3)
1119+
#define BPF_F_AFTER (1U << 4)
1120+
#define BPF_F_ID (1U << 5)
1121+
#define BPF_F_LINK BPF_F_LINK /* 1 << 13 */
11171122

11181123
/* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
11191124
* verifier will perform strict alignment checking as if the kernel
@@ -1444,14 +1449,19 @@ union bpf_attr {
14441449
};
14451450

14461451
struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
1447-
__u32 target_fd; /* container object to attach to */
1448-
__u32 attach_bpf_fd; /* eBPF program to attach */
1452+
union {
1453+
__u32 target_fd; /* target object to attach to or ... */
1454+
__u32 target_ifindex; /* target ifindex */
1455+
};
1456+
__u32 attach_bpf_fd;
14491457
__u32 attach_type;
14501458
__u32 attach_flags;
1451-
__u32 replace_bpf_fd; /* previously attached eBPF
1452-
* program to replace if
1453-
* BPF_F_REPLACE is used
1454-
*/
1459+
__u32 replace_bpf_fd;
1460+
union {
1461+
__u32 relative_fd;
1462+
__u32 relative_id;
1463+
};
1464+
__u64 expected_revision;
14551465
};
14561466

14571467
struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
@@ -1497,16 +1507,26 @@ union bpf_attr {
14971507
} info;
14981508

14991509
struct { /* anonymous struct used by BPF_PROG_QUERY command */
1500-
__u32 target_fd; /* container object to query */
1510+
union {
1511+
__u32 target_fd; /* target object to query or ... */
1512+
__u32 target_ifindex; /* target ifindex */
1513+
};
15011514
__u32 attach_type;
15021515
__u32 query_flags;
15031516
__u32 attach_flags;
15041517
__aligned_u64 prog_ids;
1505-
__u32 prog_cnt;
1518+
union {
1519+
__u32 prog_cnt;
1520+
__u32 count;
1521+
};
1522+
__u32 :32;
15061523
/* output: per-program attach_flags.
15071524
* not allowed to be set during effective query.
15081525
*/
15091526
__aligned_u64 prog_attach_flags;
1527+
__aligned_u64 link_ids;
1528+
__aligned_u64 link_attach_flags;
1529+
__u64 revision;
15101530
} query;
15111531

15121532
struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */

0 commit comments

Comments
 (0)