-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Add separate activities for Hub methods #55439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if (serviceProvider.GetService<ActivitySource>() is ActivitySource activitySource | ||
&& activitySource.HasListeners()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like SignalR should have its own activity source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Each component should have its own source, so they can be enabled/disabled on a per-component basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name should probably have "server" in it somewhere. There might be a SignalR client activity source in the future, and they'd have different names.
var tags = new ActivityTagsCollection( | ||
[new("rpc.message.type", "SENT"), new("rpc.message.id", count)]); | ||
var @event = new ActivityEvent("rpc.message", tags: tags); | ||
activity.AddEvent(@event); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a memory concern here? Can this OOM eventually? What if I have a long stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought someone mentioned it already having a limit. But looking at AddEvent
I don't see any limit. So yes, we should add one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should remove this wholesale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some debate in the OTel community about whether we should use events on traces or log messages. Now that traces and logs can be correlated, using logs is good. Maybe this is a place that should be considered? It also means you can set a lowish level and that can be configured on a per-provider basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure those events are super-helpful even if we eliminate memory concern. I remember playing with it and it was really hard to correlate these events or make sense what's going on based on them.
I'd expect users who care about the details to create custom activities on client/server for small operations. And events would be a confusing noise for those who don't care.
I'd just rely on the existing logs (I assume there are some already) for now. OTel will eventually unify span events into logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Let's remove events. Nothing prevents us adding new events in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest changing them to logs, and we can then pick an appropriate verbosity level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove anything from this code path. We already have verbose logs at other layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will those be fired within the context of this activity, so that they get tagged correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but I don't think it's important enough to log every stream item. Though I'd be interested to know if we did something similar in grpc.
{ | ||
// https://github.com/open-telemetry/semantic-conventions/blob/main/docs/rpc/rpc-spans.md#server-attributes | ||
activity.AddTag("rpc.method", methodName); | ||
activity.AddTag("rpc.system", "signalr"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to update otel semconv and add signalr
constant to system enum? LMK if you need any help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rpc semconv are still experimental, so I'd expect some breaking changes there especially around events (verbosity, memory consumption, usefulness concerns).
So I'd suggest to
- remove events - memory and verbosity are a good concerns to not follow the spec here
"In the lifetime of an RPC stream, an event for each message sent/received on
client and server spans SHOULD be created."
- be prepared to make some (probably) minor breaking changes in
rpc.*
attribute names in .NET 10+
[Update] Also, it would be great to create an issue in otel semconv repo and share memory concern for events.
var requestContext = Activity.Current?.Context; | ||
// Get off the parent span. | ||
// This is likely the Http Request span and we want Hub method invocations to not be collected under a long running span. | ||
Activity.Current = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the activity created by this method is complete? (or never created at all because no one is listening or sampling)
Is the original request activity restored? I think it should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It already should be restored, that's how async locals work. And the link check in the test(s) verifies this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like the wrong place to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do it once in HubConnectionHandler.OnConnectedAsync
and store the original activity in a field so we can access it for the linking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the linking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're removing our activities from the parent span, the whole websocket request, it'd be nice to point to the parent via a link so there is some correlation. Also, the default OnConnectedAsync span we were talking about below becomes less useful if you can't see when the request started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It already should be restored, that's how async locals work. And the link check in the test(s) verifies this.
I would expect Activity.Stop()
to set Activity.Current
to its parent when called, but if we clear out the parent before creating this activity and then call stop, how does it know to restore the old parent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does it know to restore the old parent?
It wouldn't. I'm guessing @BrennanConroy was referring to the implicit restoration of async locals that occurs when an async method returns to its caller? That does work but it feels like a subtle dependency that could easily be broken in some future refactoring. Ideally I'd clear Activity.Current somewhere up the stack to make it obvious what scope of code is executing dissociated from the long-running HTTP server activity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed where we clear the parent as mentioned in #55439 (comment)
This still relies on async local behavior to reset the parent if the user runs code after the SignalR middleware runs, but that seems fine? Unless we want to explicitly reset it when HubConnectionHandler.OnConnectedAsync
finishes.
// See https://github.com/dotnet/aspnetcore/blob/027c60168383421750f01e427e4f749d0684bc02/src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelMetrics.cs#L308 | ||
//activity.AddTag("server.address", ...); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to move this code to a shared file. Kestrel and SignalR can share it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we can't properly get this info until #43786 is done.
try | ||
{ | ||
// OnConnectedAsync won't work with client results (ISingleClientProxy.InvokeAsync) | ||
InitializeHub(hub, connection, invokeAllowed: false); | ||
|
||
activity = CreateActivity(scope.ServiceProvider, nameof(hub.OnConnectedAsync)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the OnConnectedAsync
and disconnected invocations always significant? Would it make sense to only report them if the methods are overridden?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're significant in the sense that they describe when the connection is usable. So for example if the handshake took a long time you could see the gap between the connection starting and OnConnectedAsync
starting where it's not usable yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep them
@BrennanConroy can you show an end to end trace with the dashboard with connect/disconnect and multiple invocations? |
Isn't that what the first picture in the PR description shows? |
Why are the resource names in your pictures GUIDs? That is more confusing than it should be. |
I was using the preview4 Aspire dashboard docker container. Updating to preview6 cleaned it up. |
Any other feedback? |
I looked at what the ASP.NET Core activity does and it calls aspnetcore/src/Hosting/Hosting/src/Internal/HostingApplicationDiagnostics.cs Lines 518 to 523 in e925769
|
Should We want to avoid needing extra logic in OpenTelemetry so this should probably be built into SignalR. Edit: Also, |
// This is likely the Http Request span and we want Hub method invocations to not be collected under a long running span. | ||
Activity.Current = null; | ||
|
||
var activity = signalRActivitySource.ActivitySource.CreateActivity($"{_fullHubName}/{methodName}", ActivityKind.Server, parentId: null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does every activity created by this method correspond to some distinct incoming SignalR message that is being processed? If yes then all good but just wanted to check.
Yep, it does. This is based on the DiagnosticSource exception notifications. |
I think that was to preserve an invariant that the Activity Duration and EndTime would be set prior to delivering the DiagnosticListener callback. Since this code isn't doing any diagnostic listener callbacks (which is just fine) that SetEndTime() should be unnecessary. Activity.Stop() will take care of it. |
Updated to set the |
try | ||
{ | ||
InitializeHub(hub, connection); | ||
|
||
activity = StartActivity(connection, scope.ServiceProvider, nameof(hub.OnDisconnectedAsync)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make sure this doesn't allocate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's allocating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nothing should be.
// Make sure to call Activity.Stop() once the Hub method completes, and consider calling SetActivityError on exception. | ||
private static Activity? StartActivity(HubConnectionContext connectionContext, IServiceProvider serviceProvider, string methodName) | ||
{ | ||
if (serviceProvider.GetService<SignalRActivitySource>() is SignalRActivitySource signalRActivitySource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this once? Does this need to be per invocation? (We can do this change as a follow up).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a service provider until this point. We could do the whole _activitySource ??= serviceProvider.GetService<SignalRActivitySource>()
pattern to only do it once, but that relies on SignalRActivitySource
always being a singleton (at least with MEDI, idk how other containers work). This is true today, but if we ever make it public that would be a concern.
First part of #51557
Follows conventions from https://github.com/open-telemetry/semantic-conventions/blob/main/docs/rpc/rpc-spans.md
Haven't added
server.address
yet since it looks like we'd need to copy a chunk of code from Kestrel to do it properly:aspnetcore/src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelMetrics.cs
Line 308 in 027c601
Hub methods are separate spans:
Streaming has events: