-
Notifications
You must be signed in to change notification settings - Fork 292
CP-52911: Negative scheduler for xapi threads. #6416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CP-52911: Negative scheduler for xapi threads. #6416
Conversation
5f8574e
to
58f39cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The results are encouraging but I am nervervous about this kind of code. How can we be sure that we are not creating a situation where the system does not make any progress because it deadlocks or does not schedule some jobs?
ocaml/libs/tgroup/tgroup.ml
Outdated
|> Atomic.compare_and_set tgroups seen | ||
) | ||
do | ||
() (* todo: raise exception after n unsuccessful attempts *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A recursive loop with a counter would be an easy way to track the attempts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the array to a hashtable but I will keep in mind this approach.
ocaml/libs/tgroup/tgroup.ml
Outdated
let () = Cgroup.set_cgroup creator in | ||
let group = Group.of_creator creator in | ||
Some group | ||
with _ -> None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to log this case? How would you ever know what went wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already logged. It's a warning above when the task is failed to be written in cgroups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that warning doesn't reraise the exception, so if we get to this point, something else must've raised an exception. Adding a with e -> D.log_backtrace e
might be useful to log this at debug level, at least initially (although you'll need my backtrace PR to get that function).
5454e23
to
26ccbfb
Compare
I understand the concern @lindig , I spent a lot of time manual testing this. All the functionality is guarded by 2 feature flags that are off by default. So we can thoroughly test this before enabling it. I am also restricting the groups that are are sleeping to only one so that when the remaining work is done they can run. Moreover, if the amount of time calculated for sleeping is never bigger than the timeslice and if a thread has already run more than the remaining time in the timeslice, the sleep time is 0. It this scenario, the thread does not even yield. This is to make sure there are always threads that can run. Moreover, xapi core threads are only yielding (I am not using sleeps for those). |
Now that we know that the underlying approach helps performance, we could perhaps move the sleeping code (which is really a ratelimit) elsewhere. E.g. we could move it to the top-level API call entrypoint, and keep only the yielding in the GC.Memprof callback. We could have some strategically placed That way we know we're not holding any locks and are not preventing other threads from making progress, instead of sleeping from a very deep GC callback. (for some locks we have an atomic counter that tells us whether we are holding any, but not in general for all Mutexes). |
ocaml/xapi/xapi_session.ml
Outdated
let thread_ctx = ThreadRuntimeContext.get () in | ||
(* authenticated_root here should mean a group has not been set yet and | ||
we should set one. otherwise go with what has already been set.*) | ||
if | ||
thread_ctx.tgroup = Tgroup.Description.authenticated_root | ||
|| thread_ctx.tgroup = Tgroup.Description.unauthenticated | ||
then | ||
ThreadRuntimeContext.update | ||
(fun thread_ctx -> {thread_ctx with tgroup}) | ||
thread_ctx ; | ||
|
||
Tgroup.with_one_thread_of_group tgroup f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines that are repeated many times above, it feels like they should be a function like Tgroup.with_one_thread_of_unset_group tgroup f
or similar inside the TGroup module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about the code duplication. I added a helper function at the top of the file. Unfortunately, I cannot call Threadext
from Troups
as there is a cycle dependency. i.e. the thread local storage contains the tgroup associated with the thread.
e2021b1
to
6e7a4b0
Compare
This should enable correct classification of threads calling back into xapi, e.g. `Pool.get_all` call to set the correct timer of sessions. Signed-off-by: Gabriel Buica <[email protected]>
Updates the public facing function of the `Tgroup` library: -- `Tgroup.Cgroup.init` is now `Tgroup.init` -- `Tgroup.of_req_originator` has a new signature: `val of_req_originator : string option -> Group.t option` -- `Tgroup.of_creator` has a new signature: `val of_creator : Group.Creator.t -> Group.t` These are some enablers, and the `Group.t` is returned when set to make it possible to add it to the thread context. Signed-off-by: Gabriel Buica <[email protected]>
Everytime we set the `tgroup` of the current thread we want to add it in the `ThreadRuntimeContext` for ease of read access later on. Signed-off-by: Gabriel Buica <[email protected]>
Adds new module `Tgroup.ThreadGroups` that helps managing threads in bucket. The purpose of this is to facilitate the calculation for the scheduling of threads in batches instead of per thread. The intention is iterating over 2-5 classes of threads every timeslice for calculating the ideal time of individual threads by keeping track of the total number of threads with each respective classification. Signed-off-by: Gabriel Buica <[email protected]>
Summary of behaviour: 1. Every global timeslice we want to yield and start calculating the ideal times for the next epoch. The resulting values are stored in `Tgroup.ThreadGroups`. At this step we also reset the time counters. We also increment the epoch counter. 2. Everytime we hit the periodic hook, we update the time running for the current thread inside the epoch. If the time running has exceeded the expected ideal time, we decide if we want to yield or sleep for the remaing time in the current timeslice epoch. If the time is exceeded, we also reset the counter for the thread_last_yield. This would give a good approxiamtion for how long the next thread has been running. The ideal time is calculated pretty straight forward: we look at ThreadGroups with more than 1 thread running. We add up the shares, updating the new total share, we normalize each share and we used the results as weights to get the ideal time for this ThreadGroup. The ideal time for the individual thread is the time for the ThreadGroup divided by the number of threads tunning inside the respective class. Note: The functions inside the `Timeslice.Runtime` module are performance critical and are written in such way to minimize the number of memory allcations. e.g. avoiding closures, using `Bechamel.monotonic_clock` instead of `mtime`, etc. Signed-off-by: Gabriel Buica <[email protected]>
Adds a benchmarking test for functions used in `Timeslice.Runtime` module. This functions are meant to have low overhead and should have a minimum of memory allocations. Signed-off-by: Gabriel Buica <[email protected]>
7cfd87e
to
a0b3026
Compare
Closing it for now as more investigation is needed to address the sleeping in the runtime concerns. |
This is a bit complex and it makes use of a few recently added changes:
ocaml\libs\tgroups
;xapi-stdext-threads\threadsext
;ocaml\libs\timeslice
.Summary:
We want a fair scheduling of different xapi threads, corresponding to the share allocated for each thread class inside
tgroup.ml
, but we cannot give more time to threads before the kernel decides to schedule a different one. Instead, we take away time from threads that have been running for longer than their defined share. This is achieved by a combination of yielding and sleeping depending on the thread class.Unix.sleelpf
is used to throttle the requests fromxe
while critical working threads are just yielding more often.Tgroup.ThreadGroups
. At this step we also reset the time counters. We also increment the epoch counter.If the time is exceeded, we also reset the counter for the thread_last_yield. This would give a good approximation for how long the next thread has been running.
The ideal time is calculated pretty straight forward: we look at ThreadGroups with more than 1 thread running. We add up the shares, updating the new total share, we normalize each share and we used the results as weights to get the ideal time for this ThreadGroup. The ideal time for the individual thread is the time for the ThreadGroup divided by the number of threads tunning inside the respective class.
This behaviour is behind 2 flags and both need to be enabled for this to take place.
tgroups-enabled
andruntime-sched
.Results:
While timing
vm starts
under different loads ofxe vm-list
, I managed to get the following results for different timeslices:In the above, negative percentage are the improvement in time compared with the version without negative scheduling but under the same load and using the same timeslice values.
and these are the absolute vm start times in seconds under loads (with this configuration):

For this, results, I used slightly more skewed shares to illustrate that under load the amount of time to start a VM is can be close to that of starting a VM without a load. And the results are best visible when the load and test and clearly separable in terms of threads classification.
This is on top of
ThreadLocalStorage
PR, #6354.Note: this will probably need changes to
xs-opam
(bechamel.monotonic_clock).