Skip to content

Conversation

@ulascansenturk
Copy link
Contributor

@ulascansenturk ulascansenturk commented Nov 19, 2025

RELEASE NOTES: N/A
fixes #8700

@codecov
Copy link

codecov bot commented Nov 19, 2025

Codecov Report

❌ Patch coverage is 78.57143% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.17%. Comparing base (50c6321) to head (890acb3).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
stats/opentelemetry/client_tracing.go 78.57% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8715      +/-   ##
==========================================
- Coverage   83.21%   83.17%   -0.04%     
==========================================
  Files         419      419              
  Lines       32427    32453      +26     
==========================================
+ Hits        26985    26994       +9     
- Misses       4054     4067      +13     
- Partials     1388     1392       +4     
Files with missing lines Coverage Δ
stats/opentelemetry/opentelemetry.go 73.07% <ø> (ø)
stats/opentelemetry/trace.go 90.32% <ø> (-1.99%) ⬇️
stats/opentelemetry/client_tracing.go 86.36% <78.57%> (-2.73%) ⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@eshitachandwani
Copy link
Member

Hi @ulascansenturk ,

Thank you for submitting this PR; we appreciate your contribution.

We already had an existing PR (#8342) that addressed this issue and was previously merged. However, we had to revert it because it introduced a flaky test: Test/TraceSpan_WithRetriesAndNameResolutionDelay. The failed run is linked here for context: [failed run link].

Our ideal path forward is to cherry-pick the original PR (#8342) and include a fix for the flaky test in a follow-up commit. This ensures the original author receives proper credit.

The main requirement now is to identify and fix the cause of the test flakiness. Could you please investigate this and submit the necessary fix? Your help in resolving this would be greatly appreciated.

vinothkumarr227 and others added 2 commits November 20, 2025 13:51
Fixes: grpc#8299

RELEASE NOTES:

- stats/opentelemetry: Retry attempts (`grpc.previous-rpc-attempts`) are
now recorded as span attributes for non-transparent client retries.
…tionDelay

This commit fixes the flaky test TestTraceSpan_WithRetriesAndNameResolutionDelay
which was introduced in the previous commit and caused PR grpc#8342 to be reverted.

Root Cause:
The test had race conditions related to timing:
1. The goroutine that updates resolver state could complete before or after
   the delayed resolution event was fully processed and recorded in spans
2. Span export timing was not synchronized with test validation, causing
   the test to sometimes check spans before they were fully exported

Fix:
1. Added 'stateUpdated' event to synchronize between the resolver state
   update completing and span validation beginning
2. Added explicit wait for the stateUpdated event before validating spans
3. Added a 50ms sleep after RPC completion to give the span exporter time
   to process and export all spans before validation

Testing:
- Test now passes consistently (10+ consecutive runs)
- Passes with race detector enabled (-race flag)
- No data races detected

Fixes grpc#8700
@ulascansenturk ulascansenturk force-pushed the fix/8700-otel-retry-per-call-tracking branch from 9b6371e to b1732c6 Compare November 20, 2025 10:54
@ulascansenturk
Copy link
Contributor Author

Hi @ulascansenturk ,

Thank you for submitting this PR; we appreciate your contribution.

We already had an existing PR (#8342) that addressed this issue and was previously merged. However, we had to revert it because it introduced a flaky test: Test/TraceSpan_WithRetriesAndNameResolutionDelay. The failed run is linked here for context: [failed run link].

Our ideal path forward is to cherry-pick the original PR (#8342) and include a fix for the flaky test in a follow-up commit. This ensures the original author receives proper credit.

The main requirement now is to identify and fix the cause of the test flakiness. Could you please investigate this and submit the necessary fix? Your help in resolving this would be greatly appreciated.

Hi @eshitachandwani,

Thank you for the guidance! I've restructured the PR as requested:

  1. Cherry-picked the original PR stats/opentelemetry: record retry attempts from clientStream #8342 (preserves @vinothkumarr227's authorship)
  2. Fixed the flaky test in a follow-up commit

Root cause: Timing race conditions between resolver state updates and span validation.
Verified with 10+ test runs and race detector - no flakiness detected.

// Give the span exporter a small amount of time to process and export
// all spans from the completed RPC. This reduces flakiness by ensuring
// all trace events have been fully recorded before validation.
time.Sleep(50 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we need this since waitForTraceSpans already waits for the spans till context expires.

@arjan-bal
Copy link
Contributor

Background

The failed test logs mention the following:

e2e_test.go:298: Spans mismatch (-want +got):
                  []opentelemetry_test.traceSpanInfo(Inverse(cmpopts.SortSlices, []opentelemetry_test.traceSpanInfo{
                  	{
                  		spanKind: "internal",
                  		name:     "Attempt.grpc.testing.TestService.FullDuplexCall",
                  		events: []trace.Event{
                - 			{Name: "Delayed LB pick complete"},
                  			{Name: "Outbound message", Attributes: {{Key: "sequence-number", Value: {...}}, {Key: "message-size", Value: {...}}}},
                  		},
                  		attributes: {{Key: "previous-rpc-attempts", Value: {...}}, {Key: "transparent-retry", Value: {...}}},
                  		status:     s"Error",
                  	},
                  	{spanKind: "internal", name: "Attempt.grpc.testing.TestService.FullDuplexCall", events: {{Name: "Outbound message", Attributes: {{Key: "sequence-number", Value: {...}}, {Key: "message-size", Value: {...}}}}}, attributes: {{Key: "previous-rpc-attempts", Value: {...}}, {Key: "transparent-retry", Value: {...}}}, ...},
                  	{spanKind: "internal", name: "Attempt.grpc.testing.TestService.FullDuplexCall", events: {{Name: "Outbound message", Attributes: {{Key: "sequence-number", Value: {...}}, {Key: "message-size", Value: {...}}}}}, attributes: {{Key: "previous-rpc-attempts", Value: {...}}, {Key: "transparent-retry", Value: {...}}}, ...},
                  	... // 4 identical elements
                  }))

The event Delayed LB pick complete was expected, but is missing. This event is added by the open-telemetry plugin when the DelayedPickComplete is sent by gRPC.

In gRPC, the logic to emit this event relies on the pickerBlocked variable being set to true here:

grpc-go/picker_wrapper.go

Lines 105 to 201 in a764d3f

func (pw *pickerWrapper) pick(ctx context.Context, failfast bool, info balancer.PickInfo) (pick, error) {
var ch chan struct{}
var lastPickErr error
pickBlocked := false
for {
pg := pw.pickerGen.Load()
if pg == nil {
return pick{}, ErrClientConnClosing
}
if pg.picker == nil {
ch = pg.blockingCh
}
if ch == pg.blockingCh {
// This could happen when either:
// - pw.picker is nil (the previous if condition), or
// - we have already called pick on the current picker.
select {
case <-ctx.Done():
var errStr string
if lastPickErr != nil {
errStr = "latest balancer error: " + lastPickErr.Error()
} else {
errStr = fmt.Sprintf("%v while waiting for connections to become ready", ctx.Err())
}
switch ctx.Err() {
case context.DeadlineExceeded:
return pick{}, status.Error(codes.DeadlineExceeded, errStr)
case context.Canceled:
return pick{}, status.Error(codes.Canceled, errStr)
}
case <-ch:
}
continue
}
// If the channel is set, it means that the pick call had to wait for a
// new picker at some point. Either it's the first iteration and this
// function received the first picker, or a picker errored with
// ErrNoSubConnAvailable or errored with failfast set to false, which
// will trigger a continue to the next iteration. In the first case this
// conditional will hit if this call had to block (the channel is set).
// In the second case, the only way it will get to this conditional is
// if there is a new picker.
if ch != nil {
pickBlocked = true
}
ch = pg.blockingCh
p := pg.picker
pickResult, err := p.Pick(info)
if err != nil {
if err == balancer.ErrNoSubConnAvailable {
continue
}
if st, ok := status.FromError(err); ok {
// Status error: end the RPC unconditionally with this status.
// First restrict the code to the list allowed by gRFC A54.
if istatus.IsRestrictedControlPlaneCode(st) {
err = status.Errorf(codes.Internal, "received picker error with illegal status: %v", err)
}
return pick{}, dropError{error: err}
}
// For all other errors, wait for ready RPCs should block and other
// RPCs should fail with unavailable.
if !failfast {
lastPickErr = err
continue
}
return pick{}, status.Error(codes.Unavailable, err.Error())
}
acbw, ok := pickResult.SubConn.(*acBalancerWrapper)
if !ok {
logger.Errorf("subconn returned from pick is type %T, not *acBalancerWrapper", pickResult.SubConn)
continue
}
if t := acbw.ac.getReadyTransport(); t != nil {
if channelz.IsOn() {
doneChannelzWrapper(acbw, &pickResult)
}
return pick{transport: t, result: pickResult, blocked: pickBlocked}, nil
}
if pickResult.Done != nil {
// Calling done with nil error, no bytes sent and no bytes received.
// DoneInfo with default value works.
pickResult.Done(balancer.DoneInfo{})
}
logger.Infof("blockingPicker: the picked transport is not ready, loop back to repick")
// If ok == false, ac.state is not READY.
// A valid picker always returns READY subConn. This means the state of ac
// just changed, and picker will be updated shortly.
// continue back to the beginning of the for loop to repick.
}
}

Repro

To reproduce the failure, you can simply add a time.Sleep(20 * time.Millisecond) at the very top of the pick method linked above.

func (pw *pickerWrapper) pick(ctx context.Context, failfast bool, info balancer.PickInfo) (pick, error) {
+	time.Sleep(20 * time.Millisecond)
	var ch chan struct{}

The Problem

When an RPC is made, there's a race happening:

  1. The Load Balancing policy (pickfirst here) gets the list of addresses from the resolver, creates subchannels and updates the picker once the subchannel is ready.
  2. At the same time, the pick method above tries to atomically load the current picker.

If 1. above wins the race, the DelayedPickComplete will be missing and the test fails. If 2. wins the race, pickerBlocked is set and the pick method waits for the next picker to be generated.

The Solution

To solve this race, the test needs to ensure that the non-blocking picker is not produced the first time the pick method loads the picker. A blocking picker is one that returns the specific error balancer.ErrNoSubConnAvailable. It tells the channel that it should wait for the next picker to be produced and not fail the RPC immediately.
One solution that I could think of is to write a Stub balancer. The stub balancer will return a blocking picker. Once the channel calls Pick on the blocking picker, the test knows that DelayedPickComplete will be emitted. The stub balancer can then continue to produce a non-blocking picker.

The previous fix removed time.Sleep based on review feedback that
waitForTraceSpans should be sufficient. However, waitForTraceSpans only
checked for span presence (name + kind), not span completion.

This led to test flakiness where spans were captured before all their
events and status were fully recorded.

Fix: Enhanced waitForTraceSpans to also check that spans have ended
(non-zero EndTime) before returning, ensuring all span data is complete
before validation.

Testing:
- Passes 10+ consecutive runs without flakiness
- Passes with -race flag
- All opentelemetry tests pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

otel: retries must be tracked per-call and not per-attempt

4 participants