kernel-patches
diff --git a/‎Documentation/bpf/bpf_iterators.rst
Lines changed: 112 additions & 5 deletions b/‎Documentation/bpf/bpf_iterators.rst
Lines changed: 112 additions & 5 deletions
diff --git a/‎Documentation/bpf/kfuncs.rst
Lines changed: 17 additions & 0 deletions b/‎Documentation/bpf/kfuncs.rst
Lines changed: 17 additions & 0 deletions
@@ -2,10 +2,117 @@
 BPF Iterators
 =============
 
+--------
+Overview
+--------
+
+BPF supports two separate entities collectively known as "BPF iterators": BPF
+iterator *program type* and *open-coded* BPF iterators. The former is
+a stand-alone BPF program type which, when attached and activated by user,
+will be called once for each entity (task_struct, cgroup, etc) that is being
+iterated. The latter is a set of BPF-side APIs implementing iterator
+functionality and available across multiple BPF program types. Open-coded
+iterators provide similar functionality to BPF iterator programs, but gives
+more flexibility and control to all other BPF program types. BPF iterator
+programs, on the other hand, can be used to implement anonymous or BPF
+FS-mounted special files, whose contents are generated by attached BPF iterator
+program, backed by seq_file functionality. Both are useful depending on
+specific needs.
+
+When adding a new BPF iterator program, it is expected that similar
+functionality will be added as open-coded iterator for maximum flexibility.
+It's also expected that iteration logic and code will be maximally shared and
+reused between two iterator API surfaces.
 
-----------
-Motivation
-----------
+------------------------
+Open-coded BPF Iterators
+------------------------
+
+Open-coded BPF iterators are implemented as tightly-coupled trios of kfuncs
+(constructor, next element fetch, destructor) and iterator-specific type
+describing on-the-stack iterator state, which is guaranteed by the BPF
+verifier to not be tampered with outside of the corresponding
+constructor/destructor/next APIs.
+
+Each kind of open-coded BPF iterator has its own associated
+struct bpf_iter_<type>, where <type> denotes a specific type of iterator.
+bpf_iter_<type> state needs to live on BPF program stack, so make sure it's
+small enough to fit on BPF stack. For performance reasons its best to avoid
+dynamic memory allocation for iterator state and size the state struct big
+enough to fit everything necessary. But if necessary, dynamic memory
+allocation is a way to bypass BPF stack limitations. Note, state struct size
+is part of iterator's user-visible API, so changing it will break backwards
+compatibility, so be deliberate about designing it.
+
+All kfuncs (constructor, next, destructor) have to be named consistently as
+bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents iterator
+type, and iterator state should be represented as a matching
+`struct bpf_iter_<type>` state type. Also, all iter kfuncs should have
+a pointer to this `struct bpf_iter_<type>` as the very first argument.
+
+Additionally:
+  - Constructor, i.e., `bpf_iter_<type>_new()`, can have arbitrary extra
+    number of arguments. Return type is not enforced either.
+  - Next method, i.e., `bpf_iter_<type>_next()`, has to return a pointer
+    type and should have exactly one argument: `struct bpf_iter_<type> *`
+    (const/volatile/restrict and typedefs are ignored).
+  - Destructor, i.e., `bpf_iter_<type>_destroy()`, should return void and
+    should have exactly one argument, similar to the next method.
+  - `struct bpf_iter_<type>` size is enforced to be positive and
+    a multiple of 8 bytes (to fit stack slots correctly).
+
+Such strictness and consistency allows to build generic helpers abstracting
+important, but boilerplate, details to be able to use open-coded iterators
+effectively and ergonomically (see libbpf's bpf_for_each() macro). This is
+enforced at kfunc registration point by the kernel.
+
+Constructor/next/destructor implementation contract is as follows:
+  - constructor, `bpf_iter_<type>_new()`, always initializes iterator state on
+    the stack. If any of the input arguments are invalid, constructor should
+    make sure to still initialize it such that subsequent next() calls will
+    return NULL. I.e., on error, *return error and construct empty iterator*.
+    Constructor kfunc is marked with KF_ITER_NEW flag.
+
+  - next method, `bpf_iter_<type>_next()`, accepts pointer to iterator state
+    and produces an element. Next method should always return a pointer. The
+    contract between BPF verifier is that next method *guarantees* that it
+    will eventually return NULL when elements are exhausted. Once NULL is
+    returned, subsequent next calls *should keep returning NULL*. Next method
+    is marked with KF_ITER_NEXT (and should also have KF_RET_NULL as
+    NULL-returning kfunc, of course).
+
+  - destructor, `bpf_iter_<type>_destroy()`, is always called once. Even if
+    constructor failed or next returned nothing.  Destructor frees up any
+    resources and marks stack space used by `struct bpf_iter_<type>` as usable
+    for something else. Destructor is marked with KF_ITER_DESTROY flag.
+
+Any open-coded BPF iterator implementation has to implement at least these
+three methods. It is enforced that for any given type of iterator only
+applicable constructor/destructor/next are callable. I.e., verifier ensures
+you can't pass number iterator state into, say, cgroup iterator's next method.
+
+From a 10,000-feet BPF verification point of view, next methods are the points
+of forking a verification state, which are conceptually similar to what
+verifier is doing when validating conditional jumps. Verifier is branching out
+`call bpf_iter_<type>_next` instruction and simulates two outcomes: NULL
+(iteration is done) and non-NULL (new element is returned). NULL is simulated
+first and is supposed to reach exit without looping. After that non-NULL case
+is validated and it either reaches exit (for trivial examples with no real
+loop), or reaches another `call bpf_iter_<type>_next` instruction with the
+state equivalent to already (partially) validated one. State equivalency at
+that point means we technically are going to be looping forever without
+"breaking out" out of established "state envelope" (i.e., subsequent
+iterations don't add any new knowledge or constraints to the verifier state,
+so running 1, 2, 10, or a million of them doesn't matter). But taking into
+account the contract stating that iterator next method *has to* return NULL
+eventually, we can conclude that loop body is safe and will eventually
+terminate. Given we validated logic outside of the loop (NULL case), and
+concluded that loop body is safe (though potentially looping many times),
+verifier can claim safety of the overall program logic.
+
+------------------------
+BPF Iterators Motivation
+------------------------
 
 There are a few existing ways to dump kernel data into user space. The most
 popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
@@ -323,8 +430,8 @@ Now, in the userspace program, pass the pointer of struct to the
 
 ::
 
-  link = bpf_program__attach_iter(prog, &opts); iter_fd =
-  bpf_iter_create(bpf_link__fd(link));
+  link = bpf_program__attach_iter(prog, &opts);
+  iter_fd = bpf_iter_create(bpf_link__fd(link));
 
 If both *tid* and *pid* are zero, an iterator created from this struct
 ``bpf_iter_attach_opts`` will include every opened file of every task in the
 
@@ -160,6 +160,23 @@ Or::
                 ...
         }
 
+2.2.6 __prog Annotation
+---------------------------
+This annotation is used to indicate that the argument needs to be fixed up to
+the bpf_prog_aux of the caller BPF program. Any value passed into this argument
+is ignored, and rewritten by the verifier.
+
+An example is given below::
+
+        __bpf_kfunc int bpf_wq_set_callback_impl(struct bpf_wq *wq,
+                                                 int (callback_fn)(void *map, int *key, void *value),
+                                                 unsigned int flags,
+                                                 void *aux__prog)
+         {
+                struct bpf_prog_aux *aux = aux__prog;
+                ...
+         }
+
 .. _BPF_kfunc_nodef:
 
 2.3 Using an existing kernel function