CP-52911: Negative scheduler for xapi threads. #6416

GabrielBuica · 2025-04-09T15:10:46Z

This is a bit complex and it makes use of a few recently added changes:

Thread Classification using ocaml\libs\tgroups;
Thread Local Storage from xapi-stdext-threads\threadsext;
ocaml's global timeslice configuration ocaml\libs\timeslice.

Summary:

We want a fair scheduling of different xapi threads, corresponding to the share allocated for each thread class inside tgroup.ml, but we cannot give more time to threads before the kernel decides to schedule a different one. Instead, we take away time from threads that have been running for longer than their defined share. This is achieved by a combination of yielding and sleeping depending on the thread class. Unix.sleelpf is used to throttle the requests from xe while critical working threads are just yielding more often.

Every global timeslice we want to yield and start calculating the ideal times for the next epoch. The resulting values are stored in Tgroup.ThreadGroups. At this step we also reset the time counters. We also increment the epoch counter.
Every time we hit the periodic hook, we update the time running for the current thread inside the epoch. If the time running has exceeded the expected ideal time, we decide if we want to yield or sleep for the remain time in the current timeslice epoch.
If the time is exceeded, we also reset the counter for the thread_last_yield. This would give a good approximation for how long the next thread has been running.

The ideal time is calculated pretty straight forward: we look at ThreadGroups with more than 1 thread running. We add up the shares, updating the new total share, we normalize each share and we used the results as weights to get the ideal time for this ThreadGroup. The ideal time for the individual thread is the time for the ThreadGroup divided by the number of threads tunning inside the respective class.

This behaviour is behind 2 flags and both need to be enabled for this to take place. tgroups-enabled and runtime-sched.

Results:
While timing vm starts under different loads of xe vm-list, I managed to get the following results for different timeslices:

In the above, negative percentage are the improvement in time compared with the version without negative scheduling but under the same load and using the same timeslice values.

and these are the absolute vm start times in seconds under loads (with this configuration):

For this, results, I used slightly more skewed shares to illustrate that under load the amount of time to start a VM is can be close to that of starting a VM without a load. And the results are best visible when the load and test and clearly separable in terms of threads classification.

This is on top of ThreadLocalStorage PR, #6354.

Note: this will probably need changes to xs-opam (bechamel.monotonic_clock).

ocaml/libs/xapi-stdext/lib/xapi-stdext-threads/dune

ocaml/libs/xapi-stdext/lib/xapi-stdext-threads/threadext.mli

ocaml/libs/tgroup/tgroup.mli

lindig

The results are encouraging but I am nervervous about this kind of code. How can we be sure that we are not creating a situation where the system does not make any progress because it deadlocks or does not schedule some jobs?

ocaml/libs/tgroup/tgroup.ml

lindig · 2025-04-09T15:44:53Z

ocaml/libs/tgroup/tgroup.ml

+         |> Atomic.compare_and_set tgroups seen
+        )
+    do
+      () (* todo: raise exception after n unsuccessful attempts *)


A recursive loop with a counter would be an easy way to track the attempts

I changed the array to a hashtable but I will keep in mind this approach.

lindig · 2025-04-09T15:47:05Z

ocaml/libs/tgroup/tgroup.ml

+    let () = Cgroup.set_cgroup creator in
+    let group = Group.of_creator creator in
+    Some group
+  with _ -> None


Do we want to log this case? How would you ever know what went wrong?

This is already logged. It's a warning above when the task is failed to be written in cgroups.

But that warning doesn't reraise the exception, so if we get to this point, something else must've raised an exception. Adding a with e -> D.log_backtrace e might be useful to log this at debug level, at least initially (although you'll need my backtrace PR to get that function).

ocaml/libs/tgroup/tgroup.ml

GabrielBuica · 2025-04-11T15:58:15Z

The results are encouraging but I am nervervous about this kind of code. How can we be sure that we are not creating a situation where the system does not make any progress because it deadlocks or does not schedule some jobs?

I understand the concern @lindig , I spent a lot of time manual testing this. All the functionality is guarded by 2 feature flags that are off by default. So we can thoroughly test this before enabling it.

I am also restricting the groups that are are sleeping to only one so that when the remaining work is done they can run. Moreover, if the amount of time calculated for sleeping is never bigger than the timeslice and if a thread has already run more than the remaining time in the timeslice, the sleep time is 0. It this scenario, the thread does not even yield. This is to make sure there are always threads that can run. Moreover, xapi core threads are only yielding (I am not using sleeps for those).

edwintorok · 2025-04-22T10:13:49Z

The results are encouraging but I am nervervous about this kind of code. How can we be sure that we are not creating a situation where the system does not make any progress because it deadlocks or does not schedule some jobs?

I understand the concern @lindig , I spent a lot of time manual testing this. All the functionality is guarded by 2 feature flags that are off by default. So we can thoroughly test this before enabling it.

I am also restricting the groups that are are sleeping to only one so that when the remaining work is done they can run. Moreover, if the amount of time calculated for sleeping is never bigger than the timeslice and if a thread has already run more than the remaining time in the timeslice, the sleep time is 0. It this scenario, the thread does not even yield. This is to make sure there are always threads that can run. Moreover, xapi core threads are only yielding (I am not using sleeps for those).

Now that we know that the underlying approach helps performance, we could perhaps move the sleeping code (which is really a ratelimit) elsewhere. E.g. we could move it to the top-level API call entrypoint, and keep only the yielding in the GC.Memprof callback.

We could have some strategically placed maybe_ratelimit() calls that invokes the potentially sleeping code (e.g. just before creating a new session, just before dispatching the API call -- after we know its classification, just before marshaling the reply). We'd need to check whether we still see the original perf improvements in that case, and add more maybe_ratelimit calls in that case.

That way we know we're not holding any locks and are not preventing other threads from making progress, instead of sleeping from a very deep GC callback. (for some locks we have an atomic counter that tells us whether we are holding any, but not in general for all Mutexes).

mg12 · 2025-04-22T11:00:34Z

ocaml/xapi/xapi_session.ml

+                let thread_ctx = ThreadRuntimeContext.get () in
+                (* authenticated_root here should mean a group has not been set yet and
+                   we should set one. otherwise go with what has already been set.*)
+                if
+                  thread_ctx.tgroup = Tgroup.Description.authenticated_root
+                  || thread_ctx.tgroup = Tgroup.Description.unauthenticated
+                then
+                  ThreadRuntimeContext.update
+                    (fun thread_ctx -> {thread_ctx with tgroup})
+                    thread_ctx ;
+
+                Tgroup.with_one_thread_of_group tgroup f


These lines that are repeated many times above, it feels like they should be a function like Tgroup.with_one_thread_of_unset_group tgroup f or similar inside the TGroup module.

I agree about the code duplication. I added a helper function at the top of the file. Unfortunately, I cannot call Threadext from Troups as there is a cycle dependency. i.e. the thread local storage contains the tgroup associated with the thread.

6e7a4b0.

This should enable correct classification of threads calling back into xapi, e.g. `Pool.get_all` call to set the correct timer of sessions. Signed-off-by: Gabriel Buica <[email protected]>

Updates the public facing function of the `Tgroup` library: -- `Tgroup.Cgroup.init` is now `Tgroup.init` -- `Tgroup.of_req_originator` has a new signature: `val of_req_originator : string option -> Group.t option` -- `Tgroup.of_creator` has a new signature: `val of_creator : Group.Creator.t -> Group.t` These are some enablers, and the `Group.t` is returned when set to make it possible to add it to the thread context. Signed-off-by: Gabriel Buica <[email protected]>

Everytime we set the `tgroup` of the current thread we want to add it in the `ThreadRuntimeContext` for ease of read access later on. Signed-off-by: Gabriel Buica <[email protected]>

Adds new module `Tgroup.ThreadGroups` that helps managing threads in bucket. The purpose of this is to facilitate the calculation for the scheduling of threads in batches instead of per thread. The intention is iterating over 2-5 classes of threads every timeslice for calculating the ideal time of individual threads by keeping track of the total number of threads with each respective classification. Signed-off-by: Gabriel Buica <[email protected]>

Summary of behaviour: 1. Every global timeslice we want to yield and start calculating the ideal times for the next epoch. The resulting values are stored in `Tgroup.ThreadGroups`. At this step we also reset the time counters. We also increment the epoch counter. 2. Everytime we hit the periodic hook, we update the time running for the current thread inside the epoch. If the time running has exceeded the expected ideal time, we decide if we want to yield or sleep for the remaing time in the current timeslice epoch. If the time is exceeded, we also reset the counter for the thread_last_yield. This would give a good approxiamtion for how long the next thread has been running. The ideal time is calculated pretty straight forward: we look at ThreadGroups with more than 1 thread running. We add up the shares, updating the new total share, we normalize each share and we used the results as weights to get the ideal time for this ThreadGroup. The ideal time for the individual thread is the time for the ThreadGroup divided by the number of threads tunning inside the respective class. Note: The functions inside the `Timeslice.Runtime` module are performance critical and are written in such way to minimize the number of memory allcations. e.g. avoiding closures, using `Bechamel.monotonic_clock` instead of `mtime`, etc. Signed-off-by: Gabriel Buica <[email protected]>

Adds a benchmarking test for functions used in `Timeslice.Runtime` module. This functions are meant to have low overhead and should have a minimum of memory allocations. Signed-off-by: Gabriel Buica <[email protected]>

GabrielBuica · 2025-05-07T12:23:59Z

Closing it for now as more investigation is needed to address the sleeping in the runtime concerns.

GabrielBuica force-pushed the private/dbuica/CP-52911-dev branch 2 times, most recently from 5f8574e to 58f39cc Compare April 9, 2025 15:17

psafont reviewed Apr 9, 2025

View reviewed changes

ocaml/libs/xapi-stdext/lib/xapi-stdext-threads/dune Outdated Show resolved Hide resolved

psafont reviewed Apr 9, 2025

View reviewed changes

ocaml/libs/xapi-stdext/lib/xapi-stdext-threads/threadext.mli Outdated Show resolved Hide resolved

psafont reviewed Apr 9, 2025

View reviewed changes

ocaml/libs/tgroup/tgroup.mli Outdated Show resolved Hide resolved

lindig reviewed Apr 9, 2025

View reviewed changes

psafont reviewed Apr 9, 2025

View reviewed changes

ocaml/libs/tgroup/tgroup.ml Outdated Show resolved Hide resolved

GabrielBuica force-pushed the private/dbuica/CP-52911-dev branch 2 times, most recently from 5454e23 to 26ccbfb Compare April 11, 2025 15:44

mg12 reviewed Apr 22, 2025

View reviewed changes

GabrielBuica force-pushed the private/dbuica/CP-52911-dev branch from e2021b1 to 6e7a4b0 Compare April 22, 2025 13:40

GabrielBuica added 6 commits April 28, 2025 16:37

CP-52320: Pass originator in make_rpc

5c81b1a

This should enable correct classification of threads calling back into xapi, e.g. `Pool.get_all` call to set the correct timer of sessions. Signed-off-by: Gabriel Buica <[email protected]>

CP-52997: Set tgroup in ThreadRuntimeContext

8606903

Everytime we set the `tgroup` of the current thread we want to add it in the `ThreadRuntimeContext` for ease of read access later on. Signed-off-by: Gabriel Buica <[email protected]>

CP-52911: benchmark Timeslice.Runtime functions

a0b3026

Adds a benchmarking test for functions used in `Timeslice.Runtime` module. This functions are meant to have low overhead and should have a minimum of memory allocations. Signed-off-by: Gabriel Buica <[email protected]>

GabrielBuica force-pushed the private/dbuica/CP-52911-dev branch from 7cfd87e to a0b3026 Compare May 6, 2025 08:36

GabrielBuica closed this May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CP-52911: Negative scheduler for xapi threads. #6416

CP-52911: Negative scheduler for xapi threads. #6416

Uh oh!

GabrielBuica commented Apr 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lindig left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lindig Apr 9, 2025

Uh oh!

GabrielBuica Apr 11, 2025

Uh oh!

lindig Apr 9, 2025

Uh oh!

GabrielBuica Apr 10, 2025

Uh oh!

edwintorok Apr 22, 2025

Uh oh!

Uh oh!

GabrielBuica commented Apr 11, 2025 •

edited

Loading

Uh oh!

edwintorok commented Apr 22, 2025

Uh oh!

mg12 Apr 22, 2025 •

edited

Loading

Uh oh!

GabrielBuica Apr 22, 2025

Uh oh!

GabrielBuica commented May 7, 2025

Uh oh!

Uh oh!

CP-52911: Negative scheduler for xapi threads. #6416

CP-52911: Negative scheduler for xapi threads. #6416

Uh oh!

Conversation

GabrielBuica commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lindig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lindig Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

GabrielBuica Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

lindig Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

GabrielBuica Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

edwintorok Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

GabrielBuica commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edwintorok commented Apr 22, 2025

Uh oh!

mg12 Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GabrielBuica Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

GabrielBuica commented May 7, 2025

Uh oh!

Uh oh!

GabrielBuica commented Apr 9, 2025 •

edited

Loading

GabrielBuica commented Apr 11, 2025 •

edited

Loading

mg12 Apr 22, 2025 •

edited

Loading