-
Notifications
You must be signed in to change notification settings - Fork 292
Tracing: optimize and reduce overhead #5958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
edwintorok
merged 11 commits into
xapi-project:master
from
edwintorok:private/edvint/tracingopt0
Sep 18, 2024
Merged
Tracing: optimize and reduce overhead #5958
edwintorok
merged 11 commits into
xapi-project:master
from
edwintorok:private/edvint/tracingopt0
Sep 18, 2024
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
6bde2f2
to
0a72408
Compare
The overhead is on the order of 10ns when disabled, but substantially higher when enabled. Signed-off-by: Edwin Török <[email protected]>
When max spans or max traces is hit we log debug messages at very high rate. Limit how many failures we log, and reset the failure limit every time we've cleared the finished spans table by exporting. Don't call the debug function at all when the counter is too high, this avoid s the overhead of formatting the string too (we should really use the Logs module which already knows how to efficiently disable formatting when logging is turned off). This improves performance significantly and avoids affecting benchmarks: ``` ╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮ │name │ major-allocated │ minor-allocated │ monotonic-clock │ workload │ workload │ ├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤ │ tracing/max span overflow │ 0.0000 mjw/run│ 1195.7273 mnw/run│ 130818578.6364 ns/run ╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮ │name │ major-allocated │ minor-allocated │ monotonic-clock │ workload │ workload │ ├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤ │ tracing/max span overflow │ 0.0000 mjw/run│ 480.7951 mnw/run│ 14724.6233 ns/run ``` Signed-off-by: Edwin Török <[email protected]>
The 'name' field was unused, and Tracer.t would otherwise be equivalent to TracerProvider.t. This also removes one level of indirection. It also enables further optimizations, where get_tracer could directly return a provider. Signed-off-by: Edwin Török <[email protected]>
Instead of fetching a list of enabled tracers from the locked hashtable everytime, look it up directly. Change the 'observer' global boolean to an Atomic.t that stores the currently enabled TracerProvider (if any, a [no_op] tracer that is disabled otherwise). Now 'Tracer.get_tracer' can run without allocating memory, and without having to take any locks. When the tracer gets created/destroyed/enabled/disabled we walk the entire list and find the current tracer, if any, and set the current tracer appropriately. This is O(tracerproviders), but we only support tracerproviders=0, or tracerproviders=1, and this is a rare operation, that happens on startup or the first time the tracer gets enabled. tracing/overhead(on, create span) with workloads: ``` before: 14940.958498 ns/run (confidence: 16363.887949 to 13856.836570) after: 11975.740184 ns/run (confidence: 12710.262010 to 11463.052415) ``` Signed-off-by: Edwin Török <[email protected]>
Signed-off-by: Edwin Török <[email protected]>
We only need to traverse this once, in a direct order. without workload: before: 25172.588238 (confidence: 38289.717741 to 14519.713302); after: 15774.959559 ns/run (confidence: 23866.336958 to 9640.244697); with workload: before: 14940.958498 ns/run (confidence: 16363.887949 to 13856.836570) after: 11028.540184 ns/run (confidence: 11687.173016 to 10454.879814); Signed-off-by: Edwin Török <[email protected]>
Instead of constructing intermediate seq elements, directly fold the new attribute list into the old set. Signed-off-by: Edwin Török <[email protected]>
Tracing shouldn't use locks, because we use it to find hot spots in the application. And introducing a single global lock might introduce lock contention that wouldn't otherwise be present. Use atomic operations on immutable data structures instead. Although a Map is O(log n), and not O(1) like a hashtable, it doesn't require holding any locks to traverse it, or update it. We only need to do an atomic compare-and-set operation once we've finished updating it, and if we raced with anyone (unlikely on OCaml4, unless you got interrupted by the tick thread), then try again with a backoff. We shouldn't hand roll atomic data structures like this, but instead use Saturn (Skip lists), or Kcas (which has a generic [update] function that implements the above method and also works on OCaml 5). before: 3827.092437 ns/run (confidence: 4275.705106 to 3550.511099); after: 2727.247326 ns/run (confidence: 3019.854167 to 2582.316754); Note: when benchmarking ensure that /sys/devices/system/clocksource/clocksource0/current_clocksource is set to TSC. If set to Xen, then reading timestamps is very slow. Signed-off-by: Edwin Török <[email protected]>
A small improvement, but could be more on deep span trees. before:2727.247326 ns/run (confidence: 3019.854167 to 2582.316754) after: 2590.000000 ns/run(confidence: 3362.115942 to 2068.393072); Signed-off-by: Edwin Török <[email protected]>
And print the failed ID. Signed-off-by: Edwin Török <[email protected]>
0a72408
to
3347189
Compare
max-spans=1000 is small for a VM.start, now that we have more instrumentation. With the other optimizations in this series of commits it should now be safe to increase the number of spans, we no longer perform O(n^2) operations on them. Signed-off-by: Edwin Török <[email protected]>
Tested in koji. Had to also increase |
psafont
reviewed
Sep 17, 2024
psafont
reviewed
Sep 17, 2024
psafont
approved these changes
Sep 17, 2024
lindig
approved these changes
Sep 17, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We've noticed some delays when tracing is enabled, and suspected it is due to lock contention or GC activity created by the tracing module itself.
I've added some benchmarks to measure the overhead of tracing, and made some improvements (more improvements are possible).
The benchmarks test both a situation with no workload, and a situation with 2 other threads: one running the same workload as the function under test, and another that runs a very GC intensive workload.
Improved:
We are now at ~9µs/nested span on this particular machine, but further improvements are possible by delaying work to the exporter thread and finishing really quickly otherwise (e.g. upstream ocaml-trace claims ~60ns/span).
When tracing is disabled the overhead is very small, just like before (on the order of 20ns without a workload).
Before:
After:
Draft PR, this was only unit tested so far.