Skip to content

otel: subchannel metrics #12202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

AgraVator
Copy link
Contributor

@AgraVator AgraVator commented Jul 7, 2025

Implements A94

@AgraVator AgraVator marked this pull request as ready for review July 9, 2025 10:05
@AgraVator AgraVator force-pushed the otel-subchannel-metrics branch from eabbf58 to 9595507 Compare July 15, 2025 10:00
@AgraVator AgraVator force-pushed the otel-subchannel-metrics branch from 9595507 to c713561 Compare July 22, 2025 17:55
@@ -415,7 +415,7 @@ void exitIdleMode() {
LbHelperImpl lbHelper = new LbHelperImpl();
lbHelper.lb = loadBalancerFactory.newLoadBalancer(lbHelper);
// Delay setting lbHelper until fully initialized, since loadBalancerFactory is user code and
// may throw. We don't want to confuse our state, even if we will enter panic mode.
// may throw. We don't want to confuse our state, even if we enter panic mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is more accurate as it is. After entering panic mode there is nothing much we can do about state, the comment implies delaying entering that panic mode and maintaining a sane state for the channel for as long as possible before bringing in potential user code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a grammar change...

addressIndex.getCurrentEagAttributes(), NameResolver.ATTR_BACKEND_SERVICE),
getAttributeOrDefault(
addressIndex.getCurrentEagAttributes(), LoadBalancer.ATTR_LOCALITY_NAME),
"Peer Pressure",
Copy link
Contributor

@kannanjgithub kannanjgithub Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not in the gRFC? Instead there is a "List of allowed values for grpc.disconnect_error".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Phase 1 we won't be plumbing disconnect_error, will raise another PR with this as the base branch for the same

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So? We can use a valid value now: unknown. Using "unknown" means we are technically implementing the gRFC with this change, and it is just an optimization to reduce the amount of unknowns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @param optionalLabelValues the optional label values for the metric.
*/
@Override
public void addLongUpDownCounter(LongUpDownCounterMetricInstrument metricInstrument, long value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add unit tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just wrappers on the underlying OTel API for UpDownCounter...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but I do see logic here and there are still things to check. See that we have tests in MetricRecorderImplTest for addLongCounter(), which is very similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -117,6 +119,22 @@ public void addLongCounter(LongCounterMetricInstrument metricInstrument, long va
counter.add(value, attributes);
}

@Override
public void addLongUpDownCounter(LongUpDownCounterMetricInstrument metricInstrument, long value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add unit test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added for the one above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about in OpenTelemetryMetricsSinkTest for the updown metric ? Existing tests also check for metric enabled and disabled cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

* @return the newly created LongUpDownCounterMetricInstrument
* @throws IllegalStateException if a metric with the same name already exists
*/
public LongUpDownCounterMetricInstrument registerLongUpDownCounter(String name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add unit test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added for the one above

* @param optionalLabelValues the optional label values for the metric.
*/
@Override
public void addLongUpDownCounter(LongUpDownCounterMetricInstrument metricInstrument, long value,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but I do see logic here and there are still things to check. See that we have tests in MetricRecorderImplTest for addLongCounter(), which is very similar.

addressIndex.getCurrentEagAttributes(), NameResolver.ATTR_BACKEND_SERVICE),
getAttributeOrDefault(
addressIndex.getCurrentEagAttributes(), LoadBalancer.ATTR_LOCALITY_NAME),
"Peer Pressure",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So? We can use a valid value now: unknown. Using "unknown" means we are technically implementing the gRFC with this change, and it is just an optimization to reduce the amount of unknowns.

Copy link
Contributor

@kannanjgithub kannanjgithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@@ -117,6 +119,22 @@ public void addLongCounter(LongCounterMetricInstrument metricInstrument, long va
counter.add(value, attributes);
}

@Override
public void addLongUpDownCounter(LongUpDownCounterMetricInstrument metricInstrument, long value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about in OpenTelemetryMetricsSinkTest for the updown metric ? Existing tests also check for metric enabled and disabled cases.

@@ -75,7 +76,7 @@ public SubchannelMetrics(MetricRecorder metricRecorder) {
);
}

public void recordConnectionAttemptSucceeded(OtelMetricsAttributes labelSet) {
public void recordConnectionAttemptSucceeded(MetricsAttributes labelSet) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you consider inlining this class into InternalSubchannel. MetricsAttributes is awkward, as it isn't at all clear what values are used when. The caller is basically expected to know the implementation here, yet each method has at most two lines. We're not making anything simpler by adding this abstraction. Inlining these would seem to make things much more obvious. It's not like this has tests either; any change here will require a change to InternalSubchannelTest. I'd agree it is disappointing that registering the metrics is so long, but they aren't complicated lines. So complexity is low. (If we needed this API for real, I'd seriously consider having a separate class type passed into each of the methods.)

DisconnectError will end up needing to be public, in order to be used by transports. DisconnectError could be its own file, but really I'd put it in ManagedClientTransport, as it is really a part of the transportShutdown() API. But either way, this doesn't seem like a good place for it, as this class is just a helper for InternalSubchannel.

Copy link
Member

@ejona86 ejona86 Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more, having this utility with just a ton of String arguments is really little different than the metric recorder API itself. In both cases we have to line up the arguments. So we could keep this utility class and just pass in a bunch of arguments instead of trying to package it in MetricsAttributes.

Thinking even more, we can sort of have our cake and eat it too without making a bunch of builders. If use the "lots o arguments" approach, but we put comments in the calling code (InternalSubchannel) for each argument, we can see that the argument names match. And ErrorProne will check that the comments match the variable names in SubchannelMetrics. See https://errorprone.info/bugpattern/ParameterName

(There's still the question of "how much value does the utility provide" and we could still choose to ditch it. But if you do want to keep it, I think this turns out better for this very specific problem at hand.)

* The name of the locality that this EquivalentAddressGroup is in.
*/
public static final Attributes.Key<String> ATTR_LOCALITY_NAME =
Attributes.Key.create("io.grpc.lb.locality");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string here is just for debugging. We typically try to match the API where the value is accessible. See ATTR_AUTHORITY_OVERRIDE just above, for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants