Skip to content

New and improved kestrel.connection.duration tags: error.type and protocol error #56164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JamesNK opened this issue Jun 10, 2024 · 1 comment
Labels
api-approved API was approved in API review, it can be implemented area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions

Comments

@JamesNK
Copy link
Member

JamesNK commented Jun 10, 2024

Background and Motivation

See #53358. We want error.type to provide a low-cardinality reason for why a connection was closed in error scenarios.

Proposed API

kestrel.connection.duration is a histogram counter. A timer starts when a connection starts, and the timer ends and is recorded when the connection ends. Because of this, there is the opportunity to provide information about why the connection ended.

I propose two changes:

  • Record the protocol error code for HTTP/2 and HTTP/3 on the connection duration metric.
    • Tag is named http.connection.protocol_error_code. An issue to discuss on OTEL semantic conventions: Add http.connection.protocol_error_code to connection duration metric open-telemetry/semantic-conventions#1135. If the tag likely won't be standardized, then it could be prefixed with kestrel..
    • The protocol error code is an unsigned integer.
    • Is omitted if the code is NO_ERROR (http/2) or H3_NO_ERROR (http/3).
    • Tag is present even if the server didn't physically send the error code to the client. For example, the transport for a HTTP connection closes unexpectedly and the server ends the connection with an error. The server internally says the connection ended with a specific error code, even if it doesn't have the opportunity to send to the client.
  • Kestrel keeps track of why a connection closes and sets the error.type tag to the close reason.
    • error.type isn't set for non-error reasons, e.g. the transport closing.
    • The error.type reason will mostly come from the HTTP layer, but HTTPS middleware can set a status if the connection failed the HTTPS handshake
    • If there is already an error.type value then it takes priority over the reason (basically, the first value set to error.type is what's used)
    • The connection end reasons that are set to error.type follow the standard OTEL naming standard for enums: snake case. e.g. app_shutdown.
    • Errors that don't fall into one of the known connection end reasons have an error.type value of _OTHER (similar to other OTEL enums that have a set range of values).

Usage Examples

Someone wants to monitor incoming HTTP connections to the server.

  • Enable OTEL metrics collection for the Kestrel meter name.
  • Export telemetry to telemetry store.
  • Queries kestrel.connection.duration to see error.type values.

Alternative Designs

The connection end reason could also always be set to its own tag, e.g. kestrel.connection.end_reason. In that case, it could include non-error values. However, there are few non-error connection end reasons. And error reasons would also be set to error.type. It doesn't seem valuable to me to have both tags when people are likely focused on connection errors.

kestrel.connection.end_reason could be added in the future if there is demand for it.

Risks

Ensure that Kestrel tags match future OTEL semantic conventions around connections.

@JamesNK JamesNK added the api-ready-for-review API is ready for formal API review - https://github.com/dotnet/apireviews label Jun 10, 2024
@ghost ghost added the area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions label Jun 10, 2024
@amcasey
Copy link
Member

amcasey commented Jul 3, 2024

[API Review]

  • The cardinality seems potentially large (~50)
    • TBD: we can probably collapse some
    • Seeing all of them is unlikely
    • In practice, the cardinality is both manageable and in line with precedent (e.g. HTTP status code)
  • Do other OTel providers use error.type in connection duration metrics?
    • No, in the sense that that's a non-standard metric
  • error.type seems like a good tag/attribute (exact values TBD in PR)
  • There can be multiple end reasons (e.g. because something blew up in the course of shutting down for a graceful reason)
    • Only report the first ungraceful reason
  • Protocol doesn't seem like a namespace - we can leave it in the name as protocol_error_code
  • HTTP/2 and HTTP/3 error codes occupy non-overlapping ranges
    • Not actually essential since the protocol version is in another tag
  • protocol_error_code doesn't seem essential for an MVP
  • Would end_reason be the same as error.type or would it be "error" to indicate you should check error.type?
    • The former - it would be much easier to query

error.type approved; http.connection.protocol_error_code and kestrel.connection.end_reason deferred.

@amcasey amcasey added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for formal API review - https://github.com/dotnet/apireviews labels Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-approved API was approved in API review, it can be implemented area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions
Projects
None yet
Development

No branches or pull requests

2 participants