Skip to content

Conversation

edwintorok
Copy link
Contributor

@edwintorok edwintorok commented Aug 29, 2024

We've noticed some delays when tracing is enabled, and suspected it is due to lock contention or GC activity created by the tracing module itself.

I've added some benchmarks to measure the overhead of tracing, and made some improvements (more improvements are possible).
The benchmarks test both a situation with no workload, and a situation with 2 other threads: one running the same workload as the function under test, and another that runs a very GC intensive workload.

Improved:

  • when max spans limit is hit: don't spam syslog. Time goes down from 202 milliseconds to 14 microseconds per span
  • reduce lock contention: eliminate one global lock, replace with atomic ops
  • reduce allocation

We are now at ~9µs/nested span on this particular machine, but further improvements are possible by delaying work to the exporter thread and finishing really quickly otherwise (e.g. upstream ocaml-trace claims ~60ns/span).

When tracing is disabled the overhead is very small, just like before (on the order of 20ns without a workload).

Before:

Running benchmarks (no workloads)
╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                                 │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  tracing/max span overflow          │             0.0000 mjw/run│          1187.0000 mnw/run│       96214066.4652 ns/run│
│  tracing/overhead(off)              │             0.0000 mjw/run│            14.0000 mnw/run│             19.0061 ns/run│
│  tracing/overhead(on, create span)  │             2.9250 mjw/run│           496.3888 mnw/run│          25431.5210 ns/run│
│  tracing/overhead(on, no span)      │             0.0000 mjw/run│            14.0000 mnw/run│             18.8063 ns/run│
╰─────────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

tracing/overhead(on, no span) (ns):
 { monotonic-clock per run = 18.806282 (confidence: 20.597860 to 16.986898);
   r² = Some 0.63699 }
tracing/max span overflow (ns):
 { monotonic-clock per run = 96214066.465201 (confidence: 125336685.666667 to 70136316.380165);
   r² = Some 0.518256 }
tracing/overhead(off) (ns):
 { monotonic-clock per run = 19.006133 (confidence: 20.619355 to 17.390492);
   r² = Some 0.667542 }
tracing/overhead(on, create span) (ns):
 { monotonic-clock per run = 25431.521030 (confidence: 37607.254799 to 15151.227932);
   r² = Some 0.0329005 }

Running benchmarks (workloads)
╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                                 │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  tracing/max span overflow          │         80934.7363 mjw/run│      10580588.4725 mnw/run│      202271848.5714 ns/run│
│  tracing/overhead(off)              │             0.0000 mjw/run│            16.5479 mnw/run│            212.0958 ns/run│
│  tracing/overhead(on, create span)  │             0.0000 mjw/run│           503.4400 mnw/run│          15633.2400 ns/run│
│  tracing/overhead(on, no span)      │             0.0000 mjw/run│            18.3256 mnw/run│            326.9568 ns/run│
╰─────────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

tracing/overhead(on, no span) (ns):
 { monotonic-clock per run = 326.956811; r² = Some -9.40099 }
tracing/max span overflow (ns):
 { monotonic-clock per run = 202271848.571429; r² = Some 0.065489 }
tracing/overhead(off) (ns):
 { monotonic-clock per run = 212.095829; r² = Some -4.46794 }
tracing/overhead(on, create span) (ns):
 { monotonic-clock per run = 15633.240000; r² = Some 0.805726 }

After:

Running benchmarks (no workloads)
╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                                 │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  tracing/max span overflow          │             0.0964 mjw/run│           379.0196 mnw/run│          14479.0371 ns/run│
│  tracing/overhead(off)              │             0.0000 mjw/run│            14.0000 mnw/run│             18.2174 ns/run│
│  tracing/overhead(on, create span)  │             0.0000 mjw/run│           397.3440 mnw/run│          14712.1307 ns/run│
│  tracing/overhead(on, no span)      │             0.0000 mjw/run│            14.0000 mnw/run│             18.1771 ns/run│
╰─────────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

tracing/overhead(on, no span) (ns):
 { monotonic-clock per run = 18.177093 (confidence: 19.990180 to 16.374104);
   r² = Some 0.600249 }
tracing/max span overflow (ns):
 { monotonic-clock per run = 14479.037101 (confidence: 15625.053708 to 13293.667720);
   r² = Some 0.635766 }
tracing/overhead(off) (ns):
 { monotonic-clock per run = 18.217390 (confidence: 19.845841 to 16.442527);
   r² = Some 0.642373 }
tracing/overhead(on, create span) (ns):
 { monotonic-clock per run = 14712.130670 (confidence: 23416.444172 to 7426.891454);
   r² = Some 0.0476263 }

Running benchmarks (workloads)
╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                                 │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  tracing/max span overflow          │             0.0000 mjw/run│           380.4429 mnw/run│           7658.2207 ns/run│
│  tracing/overhead(off)              │             0.0000 mjw/run│            15.3007 mnw/run│            102.2479 ns/run│
│  tracing/overhead(on, create span)  │             0.0000 mjw/run│           400.5094 mnw/run│           8657.5585 ns/run│
│  tracing/overhead(on, no span)      │             0.0000 mjw/run│            18.1333 mnw/run│            373.9794 ns/run│
╰─────────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

tracing/overhead(on, no span) (ns):
 { monotonic-clock per run = 373.979447 (confidence: 435.400802 to 336.056569);
   r² = Some -1.84338 }
tracing/max span overflow (ns):
 { monotonic-clock per run = 7658.220695 (confidence: 7952.878804 to 7396.117711);
   r² = Some 0.950954 }
tracing/overhead(off) (ns):
 { monotonic-clock per run = 102.247932 (confidence: 119.364768 to 89.830417);
   r² = Some -0.607146 }
tracing/overhead(on, create span) (ns):
 { monotonic-clock per run = 8657.558458 (confidence: 8904.596470 to 8435.216348);
   r² = Some 0.956299 }

Draft PR, this was only unit tested so far.

@edwintorok edwintorok force-pushed the private/edvint/tracingopt0 branch from 6bde2f2 to 0a72408 Compare September 2, 2024 15:42
The overhead is on the order of 10ns when disabled,
but substantially higher when enabled.

Signed-off-by: Edwin Török <[email protected]>
When max spans or max traces is hit we log debug messages at very high rate.

Limit how many failures we log, and reset the failure limit every time we've cleared the finished spans table by exporting.

Don't call the debug function at all when the counter is too high, this avoid s the overhead of formatting the string too
(we should really use the Logs module which already knows how to efficiently disable formatting when logging is turned off).

This improves performance significantly and avoids affecting benchmarks:
```
╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                                 │  major-allocated          │  minor-allocated          │  monotonic-clock          │  workload                 │  workload                 │
├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  tracing/max span overflow          │             0.0000 mjw/run│          1195.7273 mnw/run│      130818578.6364 ns/run

╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                                 │  major-allocated          │  minor-allocated          │  monotonic-clock          │  workload                 │  workload                 │
├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  tracing/max span overflow          │             0.0000 mjw/run│           480.7951 mnw/run│          14724.6233 ns/run
```

Signed-off-by: Edwin Török <[email protected]>
The 'name' field was unused, and Tracer.t would otherwise be equivalent to TracerProvider.t.
This also removes one level of indirection.

It also enables further optimizations, where get_tracer could directly return a provider.

Signed-off-by: Edwin Török <[email protected]>
Instead of fetching a list of enabled tracers from the locked hashtable everytime,
look it up directly.
Change the 'observer' global boolean to an Atomic.t that stores the currently enabled TracerProvider
(if any, a [no_op] tracer that is disabled otherwise).
Now 'Tracer.get_tracer' can run without allocating memory, and without having to take any locks.

When the tracer gets created/destroyed/enabled/disabled we walk the entire list and find the current tracer, if any,
and set the current tracer appropriately.
This is O(tracerproviders), but we only support tracerproviders=0, or tracerproviders=1, and this is a rare  operation,
that happens on startup or the first time the tracer gets enabled.

tracing/overhead(on, create span) with workloads:
```
before: 14940.958498 ns/run (confidence: 16363.887949 to 13856.836570)
after: 11975.740184 ns/run (confidence: 12710.262010 to 11463.052415)
```

Signed-off-by: Edwin Török <[email protected]>
We only need to traverse this once, in a direct order.

without workload:
before: 25172.588238 (confidence: 38289.717741 to 14519.713302);
after: 15774.959559 ns/run (confidence: 23866.336958 to 9640.244697);
with workload:
before: 14940.958498 ns/run (confidence: 16363.887949 to 13856.836570)
after: 11028.540184 ns/run (confidence: 11687.173016 to 10454.879814);

Signed-off-by: Edwin Török <[email protected]>
Instead of constructing intermediate seq elements, directly fold the new attribute list into the old set.

Signed-off-by: Edwin Török <[email protected]>
Tracing shouldn't use locks, because we use it to find hot spots in the application.
And introducing a single global lock might introduce lock contention that wouldn't otherwise be present.

Use atomic operations on immutable data structures instead.
Although a Map is O(log n), and not O(1) like a hashtable, it doesn't require holding any locks to traverse it,
or update it.
We only need to do an atomic compare-and-set operation once we've finished updating it, and if we raced with anyone
 (unlikely on OCaml4, unless you got interrupted by the tick thread), then try again with a backoff.

We shouldn't hand roll atomic data structures like this, but instead use Saturn (Skip lists), or Kcas (which has a
generic [update] function that implements the above method and also works on OCaml 5).

before: 3827.092437 ns/run (confidence: 4275.705106 to 3550.511099);
after:  2727.247326 ns/run (confidence: 3019.854167 to 2582.316754);

Note: when benchmarking ensure that /sys/devices/system/clocksource/clocksource0/current_clocksource is set to TSC.
If set to Xen, then reading timestamps is very slow.

Signed-off-by: Edwin Török <[email protected]>
A small improvement, but could be more on deep span trees.

before:2727.247326 ns/run (confidence: 3019.854167 to 2582.316754)
after: 2590.000000 ns/run(confidence: 3362.115942 to 2068.393072);

Signed-off-by: Edwin Török <[email protected]>
And print the failed ID.

Signed-off-by: Edwin Török <[email protected]>
max-spans=1000 is small for a VM.start, now that we have more instrumentation.
With the other optimizations in this series of commits it should now be safe to increase the number of spans,
we no longer perform O(n^2) operations on them.

Signed-off-by: Edwin Török <[email protected]>
@edwintorok edwintorok marked this pull request as ready for review September 16, 2024 16:51
@edwintorok
Copy link
Contributor Author

Tested in koji. Had to also increase max_spans because a single VM start in a pool would exceed it.
This should now be safe because we use the cardinal operation of Map/Set instead of List.length, so we no longer have O(n^2) performance issues.

@edwintorok edwintorok added this pull request to the merge queue Sep 18, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 18, 2024
@edwintorok edwintorok added this pull request to the merge queue Sep 18, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 18, 2024
@edwintorok edwintorok added this pull request to the merge queue Sep 18, 2024
Merged via the queue into xapi-project:master with commit 7e9c2a3 Sep 18, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants