Tracing: optimize and reduce overhead #5958

edwintorok · 2024-08-29T16:09:42Z

We've noticed some delays when tracing is enabled, and suspected it is due to lock contention or GC activity created by the tracing module itself.

I've added some benchmarks to measure the overhead of tracing, and made some improvements (more improvements are possible).
The benchmarks test both a situation with no workload, and a situation with 2 other threads: one running the same workload as the function under test, and another that runs a very GC intensive workload.

Improved:

when max spans limit is hit: don't spam syslog. Time goes down from 202 milliseconds to 14 microseconds per span
reduce lock contention: eliminate one global lock, replace with atomic ops
reduce allocation

We are now at ~9µs/nested span on this particular machine, but further improvements are possible by delaying work to the exporter thread and finishing really quickly otherwise (e.g. upstream ocaml-trace claims ~60ns/span).

When tracing is disabled the overhead is very small, just like before (on the order of 20ns without a workload).

Before:

Running benchmarks (no workloads)
╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                                 │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  tracing/max span overflow          │             0.0000 mjw/run│          1187.0000 mnw/run│       96214066.4652 ns/run│
│  tracing/overhead(off)              │             0.0000 mjw/run│            14.0000 mnw/run│             19.0061 ns/run│
│  tracing/overhead(on, create span)  │             2.9250 mjw/run│           496.3888 mnw/run│          25431.5210 ns/run│
│  tracing/overhead(on, no span)      │             0.0000 mjw/run│            14.0000 mnw/run│             18.8063 ns/run│
╰─────────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

tracing/overhead(on, no span) (ns):
 { monotonic-clock per run = 18.806282 (confidence: 20.597860 to 16.986898);
   r² = Some 0.63699 }
tracing/max span overflow (ns):
 { monotonic-clock per run = 96214066.465201 (confidence: 125336685.666667 to 70136316.380165);
   r² = Some 0.518256 }
tracing/overhead(off) (ns):
 { monotonic-clock per run = 19.006133 (confidence: 20.619355 to 17.390492);
   r² = Some 0.667542 }
tracing/overhead(on, create span) (ns):
 { monotonic-clock per run = 25431.521030 (confidence: 37607.254799 to 15151.227932);
   r² = Some 0.0329005 }

Running benchmarks (workloads)
╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                                 │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  tracing/max span overflow          │         80934.7363 mjw/run│      10580588.4725 mnw/run│      202271848.5714 ns/run│
│  tracing/overhead(off)              │             0.0000 mjw/run│            16.5479 mnw/run│            212.0958 ns/run│
│  tracing/overhead(on, create span)  │             0.0000 mjw/run│           503.4400 mnw/run│          15633.2400 ns/run│
│  tracing/overhead(on, no span)      │             0.0000 mjw/run│            18.3256 mnw/run│            326.9568 ns/run│
╰─────────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

tracing/overhead(on, no span) (ns):
 { monotonic-clock per run = 326.956811; r² = Some -9.40099 }
tracing/max span overflow (ns):
 { monotonic-clock per run = 202271848.571429; r² = Some 0.065489 }
tracing/overhead(off) (ns):
 { monotonic-clock per run = 212.095829; r² = Some -4.46794 }
tracing/overhead(on, create span) (ns):
 { monotonic-clock per run = 15633.240000; r² = Some 0.805726 }

After:

Running benchmarks (no workloads)
╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                                 │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  tracing/max span overflow          │             0.0964 mjw/run│           379.0196 mnw/run│          14479.0371 ns/run│
│  tracing/overhead(off)              │             0.0000 mjw/run│            14.0000 mnw/run│             18.2174 ns/run│
│  tracing/overhead(on, create span)  │             0.0000 mjw/run│           397.3440 mnw/run│          14712.1307 ns/run│
│  tracing/overhead(on, no span)      │             0.0000 mjw/run│            14.0000 mnw/run│             18.1771 ns/run│
╰─────────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

tracing/overhead(on, no span) (ns):
 { monotonic-clock per run = 18.177093 (confidence: 19.990180 to 16.374104);
   r² = Some 0.600249 }
tracing/max span overflow (ns):
 { monotonic-clock per run = 14479.037101 (confidence: 15625.053708 to 13293.667720);
   r² = Some 0.635766 }
tracing/overhead(off) (ns):
 { monotonic-clock per run = 18.217390 (confidence: 19.845841 to 16.442527);
   r² = Some 0.642373 }
tracing/overhead(on, create span) (ns):
 { monotonic-clock per run = 14712.130670 (confidence: 23416.444172 to 7426.891454);
   r² = Some 0.0476263 }

Running benchmarks (workloads)
╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                                 │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  tracing/max span overflow          │             0.0000 mjw/run│           380.4429 mnw/run│           7658.2207 ns/run│
│  tracing/overhead(off)              │             0.0000 mjw/run│            15.3007 mnw/run│            102.2479 ns/run│
│  tracing/overhead(on, create span)  │             0.0000 mjw/run│           400.5094 mnw/run│           8657.5585 ns/run│
│  tracing/overhead(on, no span)      │             0.0000 mjw/run│            18.1333 mnw/run│            373.9794 ns/run│
╰─────────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

tracing/overhead(on, no span) (ns):
 { monotonic-clock per run = 373.979447 (confidence: 435.400802 to 336.056569);
   r² = Some -1.84338 }
tracing/max span overflow (ns):
 { monotonic-clock per run = 7658.220695 (confidence: 7952.878804 to 7396.117711);
   r² = Some 0.950954 }
tracing/overhead(off) (ns):
 { monotonic-clock per run = 102.247932 (confidence: 119.364768 to 89.830417);
   r² = Some -0.607146 }
tracing/overhead(on, create span) (ns):
 { monotonic-clock per run = 8657.558458 (confidence: 8904.596470 to 8435.216348);
   r² = Some 0.956299 }

Draft PR, this was only unit tested so far.

The overhead is on the order of 10ns when disabled, but substantially higher when enabled. Signed-off-by: Edwin Török <[email protected]>

When max spans or max traces is hit we log debug messages at very high rate. Limit how many failures we log, and reset the failure limit every time we've cleared the finished spans table by exporting. Don't call the debug function at all when the counter is too high, this avoid s the overhead of formatting the string too (we should really use the Logs module which already knows how to efficiently disable formatting when logging is turned off). This improves performance significantly and avoids affecting benchmarks: ``` ╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮ │name │ major-allocated │ minor-allocated │ monotonic-clock │ workload │ workload │ ├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤ │ tracing/max span overflow │ 0.0000 mjw/run│ 1195.7273 mnw/run│ 130818578.6364 ns/run ╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮ │name │ major-allocated │ minor-allocated │ monotonic-clock │ workload │ workload │ ├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤ │ tracing/max span overflow │ 0.0000 mjw/run│ 480.7951 mnw/run│ 14724.6233 ns/run ``` Signed-off-by: Edwin Török <[email protected]>

The 'name' field was unused, and Tracer.t would otherwise be equivalent to TracerProvider.t. This also removes one level of indirection. It also enables further optimizations, where get_tracer could directly return a provider. Signed-off-by: Edwin Török <[email protected]>

Instead of fetching a list of enabled tracers from the locked hashtable everytime, look it up directly. Change the 'observer' global boolean to an Atomic.t that stores the currently enabled TracerProvider (if any, a [no_op] tracer that is disabled otherwise). Now 'Tracer.get_tracer' can run without allocating memory, and without having to take any locks. When the tracer gets created/destroyed/enabled/disabled we walk the entire list and find the current tracer, if any, and set the current tracer appropriately. This is O(tracerproviders), but we only support tracerproviders=0, or tracerproviders=1, and this is a rare operation, that happens on startup or the first time the tracer gets enabled. tracing/overhead(on, create span) with workloads: ``` before: 14940.958498 ns/run (confidence: 16363.887949 to 13856.836570) after: 11975.740184 ns/run (confidence: 12710.262010 to 11463.052415) ``` Signed-off-by: Edwin Török <[email protected]>

Signed-off-by: Edwin Török <[email protected]>

We only need to traverse this once, in a direct order. without workload: before: 25172.588238 (confidence: 38289.717741 to 14519.713302); after: 15774.959559 ns/run (confidence: 23866.336958 to 9640.244697); with workload: before: 14940.958498 ns/run (confidence: 16363.887949 to 13856.836570) after: 11028.540184 ns/run (confidence: 11687.173016 to 10454.879814); Signed-off-by: Edwin Török <[email protected]>

Instead of constructing intermediate seq elements, directly fold the new attribute list into the old set. Signed-off-by: Edwin Török <[email protected]>

Tracing shouldn't use locks, because we use it to find hot spots in the application. And introducing a single global lock might introduce lock contention that wouldn't otherwise be present. Use atomic operations on immutable data structures instead. Although a Map is O(log n), and not O(1) like a hashtable, it doesn't require holding any locks to traverse it, or update it. We only need to do an atomic compare-and-set operation once we've finished updating it, and if we raced with anyone (unlikely on OCaml4, unless you got interrupted by the tick thread), then try again with a backoff. We shouldn't hand roll atomic data structures like this, but instead use Saturn (Skip lists), or Kcas (which has a generic [update] function that implements the above method and also works on OCaml 5). before: 3827.092437 ns/run (confidence: 4275.705106 to 3550.511099); after: 2727.247326 ns/run (confidence: 3019.854167 to 2582.316754); Note: when benchmarking ensure that /sys/devices/system/clocksource/clocksource0/current_clocksource is set to TSC. If set to Xen, then reading timestamps is very slow. Signed-off-by: Edwin Török <[email protected]>

A small improvement, but could be more on deep span trees. before:2727.247326 ns/run (confidence: 3019.854167 to 2582.316754) after: 2590.000000 ns/run(confidence: 3362.115942 to 2068.393072); Signed-off-by: Edwin Török <[email protected]>

And print the failed ID. Signed-off-by: Edwin Török <[email protected]>

max-spans=1000 is small for a VM.start, now that we have more instrumentation. With the other optimizations in this series of commits it should now be safe to increase the number of spans, we no longer perform O(n^2) operations on them. Signed-off-by: Edwin Török <[email protected]>

edwintorok · 2024-09-16T16:51:54Z

Tested in koji. Had to also increase max_spans because a single VM start in a pool would exceed it.
This should now be safe because we use the cardinal operation of Map/Set instead of List.length, so we no longer have O(n^2) performance issues.

ocaml/libs/tracing/tracing.ml

edwintorok force-pushed the private/edvint/tracingopt0 branch from 6bde2f2 to 0a72408 Compare September 2, 2024 15:42

edwintorok added 10 commits September 16, 2024 10:51

CP-50614: Add benchmark for Tracing module

72cae2a

The overhead is on the order of 10ns when disabled, but substantially higher when enabled. Signed-off-by: Edwin Török <[email protected]>

CP-50614: Tracing: drop unused function

cf7b52e

Signed-off-by: Edwin Török <[email protected]>

CP-50614: Tracing: merge attributes more efficiently

df77068

Instead of constructing intermediate seq elements, directly fold the new attribute list into the old set. Signed-off-by: Edwin Török <[email protected]>

CP-50614: Tracing: be more acceptive of span/trace ids

3347189

And print the failed ID. Signed-off-by: Edwin Török <[email protected]>

edwintorok force-pushed the private/edvint/tracingopt0 branch from 0a72408 to 3347189 Compare September 16, 2024 09:51

edwintorok mentioned this pull request Sep 16, 2024

Update feature/perf from master #5993

Closed

edwintorok marked this pull request as ready for review September 16, 2024 16:51

psafont reviewed Sep 17, 2024

View reviewed changes

ocaml/libs/tracing/tracing.ml Show resolved Hide resolved

psafont reviewed Sep 17, 2024

View reviewed changes

ocaml/libs/tracing/tracing.ml Show resolved Hide resolved

psafont approved these changes Sep 17, 2024

View reviewed changes

lindig approved these changes Sep 17, 2024

View reviewed changes

edwintorok added this pull request to the merge queue Sep 18, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 18, 2024

edwintorok added this pull request to the merge queue Sep 18, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 18, 2024

edwintorok added this pull request to the merge queue Sep 18, 2024

Merged via the queue into xapi-project:master with commit 7e9c2a3 Sep 18, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tracing: optimize and reduce overhead #5958

Tracing: optimize and reduce overhead #5958

Uh oh!

edwintorok commented Aug 29, 2024 •

edited

Loading

Uh oh!

edwintorok commented Sep 16, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tracing: optimize and reduce overhead #5958

Tracing: optimize and reduce overhead #5958

Uh oh!

Conversation

edwintorok commented Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edwintorok commented Sep 16, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

edwintorok commented Aug 29, 2024 •

edited

Loading