|
2 | 2 | BPF Iterators
|
3 | 3 | =============
|
4 | 4 |
|
| 5 | +-------- |
| 6 | +Overview |
| 7 | +-------- |
| 8 | + |
| 9 | +BPF supports two separate entities collectively known as "BPF iterators": BPF |
| 10 | +iterator *program type* and *open-coded* BPF iterators. The former is |
| 11 | +a stand-alone BPF program type which, when attached and activated by user, |
| 12 | +will be called once for each entity (task_struct, cgroup, etc) that is being |
| 13 | +iterated. The latter is a set of BPF-side APIs implementing iterator |
| 14 | +functionality and available across multiple BPF program types. Open-coded |
| 15 | +iterators provide similar functionality to BPF iterator programs, but gives |
| 16 | +more flexibility and control to all other BPF program types. BPF iterator |
| 17 | +programs, on the other hand, can be used to implement anonymous or BPF |
| 18 | +FS-mounted special files, whose contents are generated by attached BPF iterator |
| 19 | +program, backed by seq_file functionality. Both are useful depending on |
| 20 | +specific needs. |
| 21 | + |
| 22 | +When adding a new BPF iterator program, it is expected that similar |
| 23 | +functionality will be added as open-coded iterator for maximum flexibility. |
| 24 | +It's also expected that iteration logic and code will be maximally shared and |
| 25 | +reused between two iterator API surfaces. |
5 | 26 |
|
6 |
| ----------- |
7 |
| -Motivation |
8 |
| ----------- |
| 27 | +------------------------ |
| 28 | +Open-coded BPF Iterators |
| 29 | +------------------------ |
| 30 | + |
| 31 | +Open-coded BPF iterators are implemented as tightly-coupled trios of kfuncs |
| 32 | +(constructor, next element fetch, destructor) and iterator-specific type |
| 33 | +describing on-the-stack iterator state, which is guaranteed by the BPF |
| 34 | +verifier to not be tampered with outside of the corresponding |
| 35 | +constructor/destructor/next APIs. |
| 36 | + |
| 37 | +Each kind of open-coded BPF iterator has its own associated |
| 38 | +struct bpf_iter_<type>, where <type> denotes a specific type of iterator. |
| 39 | +bpf_iter_<type> state needs to live on BPF program stack, so make sure it's |
| 40 | +small enough to fit on BPF stack. For performance reasons its best to avoid |
| 41 | +dynamic memory allocation for iterator state and size the state struct big |
| 42 | +enough to fit everything necessary. But if necessary, dynamic memory |
| 43 | +allocation is a way to bypass BPF stack limitations. Note, state struct size |
| 44 | +is part of iterator's user-visible API, so changing it will break backwards |
| 45 | +compatibility, so be deliberate about designing it. |
| 46 | + |
| 47 | +All kfuncs (constructor, next, destructor) have to be named consistently as |
| 48 | +bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents iterator |
| 49 | +type, and iterator state should be represented as a matching |
| 50 | +`struct bpf_iter_<type>` state type. Also, all iter kfuncs should have |
| 51 | +a pointer to this `struct bpf_iter_<type>` as the very first argument. |
| 52 | + |
| 53 | +Additionally: |
| 54 | + - Constructor, i.e., `bpf_iter_<type>_new()`, can have arbitrary extra |
| 55 | + number of arguments. Return type is not enforced either. |
| 56 | + - Next method, i.e., `bpf_iter_<type>_next()`, has to return a pointer |
| 57 | + type and should have exactly one argument: `struct bpf_iter_<type> *` |
| 58 | + (const/volatile/restrict and typedefs are ignored). |
| 59 | + - Destructor, i.e., `bpf_iter_<type>_destroy()`, should return void and |
| 60 | + should have exactly one argument, similar to the next method. |
| 61 | + - `struct bpf_iter_<type>` size is enforced to be positive and |
| 62 | + a multiple of 8 bytes (to fit stack slots correctly). |
| 63 | + |
| 64 | +Such strictness and consistency allows to build generic helpers abstracting |
| 65 | +important, but boilerplate, details to be able to use open-coded iterators |
| 66 | +effectively and ergonomically (see libbpf's bpf_for_each() macro). This is |
| 67 | +enforced at kfunc registration point by the kernel. |
| 68 | + |
| 69 | +Constructor/next/destructor implementation contract is as follows: |
| 70 | + - constructor, `bpf_iter_<type>_new()`, always initializes iterator state on |
| 71 | + the stack. If any of the input arguments are invalid, constructor should |
| 72 | + make sure to still initialize it such that subsequent next() calls will |
| 73 | + return NULL. I.e., on error, *return error and construct empty iterator*. |
| 74 | + Constructor kfunc is marked with KF_ITER_NEW flag. |
| 75 | + |
| 76 | + - next method, `bpf_iter_<type>_next()`, accepts pointer to iterator state |
| 77 | + and produces an element. Next method should always return a pointer. The |
| 78 | + contract between BPF verifier is that next method *guarantees* that it |
| 79 | + will eventually return NULL when elements are exhausted. Once NULL is |
| 80 | + returned, subsequent next calls *should keep returning NULL*. Next method |
| 81 | + is marked with KF_ITER_NEXT (and should also have KF_RET_NULL as |
| 82 | + NULL-returning kfunc, of course). |
| 83 | + |
| 84 | + - destructor, `bpf_iter_<type>_destroy()`, is always called once. Even if |
| 85 | + constructor failed or next returned nothing. Destructor frees up any |
| 86 | + resources and marks stack space used by `struct bpf_iter_<type>` as usable |
| 87 | + for something else. Destructor is marked with KF_ITER_DESTROY flag. |
| 88 | + |
| 89 | +Any open-coded BPF iterator implementation has to implement at least these |
| 90 | +three methods. It is enforced that for any given type of iterator only |
| 91 | +applicable constructor/destructor/next are callable. I.e., verifier ensures |
| 92 | +you can't pass number iterator state into, say, cgroup iterator's next method. |
| 93 | + |
| 94 | +From a 10,000-feet BPF verification point of view, next methods are the points |
| 95 | +of forking a verification state, which are conceptually similar to what |
| 96 | +verifier is doing when validating conditional jumps. Verifier is branching out |
| 97 | +`call bpf_iter_<type>_next` instruction and simulates two outcomes: NULL |
| 98 | +(iteration is done) and non-NULL (new element is returned). NULL is simulated |
| 99 | +first and is supposed to reach exit without looping. After that non-NULL case |
| 100 | +is validated and it either reaches exit (for trivial examples with no real |
| 101 | +loop), or reaches another `call bpf_iter_<type>_next` instruction with the |
| 102 | +state equivalent to already (partially) validated one. State equivalency at |
| 103 | +that point means we technically are going to be looping forever without |
| 104 | +"breaking out" out of established "state envelope" (i.e., subsequent |
| 105 | +iterations don't add any new knowledge or constraints to the verifier state, |
| 106 | +so running 1, 2, 10, or a million of them doesn't matter). But taking into |
| 107 | +account the contract stating that iterator next method *has to* return NULL |
| 108 | +eventually, we can conclude that loop body is safe and will eventually |
| 109 | +terminate. Given we validated logic outside of the loop (NULL case), and |
| 110 | +concluded that loop body is safe (though potentially looping many times), |
| 111 | +verifier can claim safety of the overall program logic. |
| 112 | + |
| 113 | +------------------------ |
| 114 | +BPF Iterators Motivation |
| 115 | +------------------------ |
9 | 116 |
|
10 | 117 | There are a few existing ways to dump kernel data into user space. The most
|
11 | 118 | popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
|
@@ -323,8 +430,8 @@ Now, in the userspace program, pass the pointer of struct to the
|
323 | 430 |
|
324 | 431 | ::
|
325 | 432 |
|
326 |
| - link = bpf_program__attach_iter(prog, &opts); iter_fd = |
327 |
| - bpf_iter_create(bpf_link__fd(link)); |
| 433 | + link = bpf_program__attach_iter(prog, &opts); |
| 434 | + iter_fd = bpf_iter_create(bpf_link__fd(link)); |
328 | 435 |
|
329 | 436 | If both *tid* and *pid* are zero, an iterator created from this struct
|
330 | 437 | ``bpf_iter_attach_opts`` will include every opened file of every task in the
|
|
0 commit comments