stats/opentelemetry: Track retry counts per-call instead of per-attempt #8715

ulascansenturk · 2025-11-19T21:13:01Z

RELEASE NOTES: N/A
fixes #8700

codecov · 2025-11-19T21:16:07Z

Codecov Report

❌ Patch coverage is 78.57143% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.17%. Comparing base (50c6321) to head (890acb3).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
stats/opentelemetry/client_tracing.go	78.57%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8715      +/-   ##
==========================================
- Coverage   83.21%   83.17%   -0.04%     
==========================================
  Files         419      419              
  Lines       32427    32453      +26     
==========================================
+ Hits        26985    26994       +9     
- Misses       4054     4067      +13     
- Partials     1388     1392       +4

Files with missing lines	Coverage Δ
stats/opentelemetry/opentelemetry.go	`73.07% <ø> (ø)`
stats/opentelemetry/trace.go	`90.32% <ø> (-1.99%)`	⬇️
stats/opentelemetry/client_tracing.go	`86.36% <78.57%> (-2.73%)`	⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

eshitachandwani · 2025-11-20T09:23:33Z

Hi @ulascansenturk ,

Thank you for submitting this PR; we appreciate your contribution.

We already had an existing PR (#8342) that addressed this issue and was previously merged. However, we had to revert it because it introduced a flaky test: Test/TraceSpan_WithRetriesAndNameResolutionDelay. The failed run is linked here for context: [failed run link].

Our ideal path forward is to cherry-pick the original PR (#8342) and include a fix for the flaky test in a follow-up commit. This ensures the original author receives proper credit.

The main requirement now is to identify and fix the cause of the test flakiness. Could you please investigate this and submit the necessary fix? Your help in resolving this would be greatly appreciated.

Fixes: grpc#8299 RELEASE NOTES: - stats/opentelemetry: Retry attempts (`grpc.previous-rpc-attempts`) are now recorded as span attributes for non-transparent client retries.

…tionDelay This commit fixes the flaky test TestTraceSpan_WithRetriesAndNameResolutionDelay which was introduced in the previous commit and caused PR grpc#8342 to be reverted. Root Cause: The test had race conditions related to timing: 1. The goroutine that updates resolver state could complete before or after the delayed resolution event was fully processed and recorded in spans 2. Span export timing was not synchronized with test validation, causing the test to sometimes check spans before they were fully exported Fix: 1. Added 'stateUpdated' event to synchronize between the resolver state update completing and span validation beginning 2. Added explicit wait for the stateUpdated event before validating spans 3. Added a 50ms sleep after RPC completion to give the span exporter time to process and export all spans before validation Testing: - Test now passes consistently (10+ consecutive runs) - Passes with race detector enabled (-race flag) - No data races detected Fixes grpc#8700

ulascansenturk · 2025-11-20T11:04:45Z

Hi @ulascansenturk ,

Thank you for submitting this PR; we appreciate your contribution.

We already had an existing PR (#8342) that addressed this issue and was previously merged. However, we had to revert it because it introduced a flaky test: Test/TraceSpan_WithRetriesAndNameResolutionDelay. The failed run is linked here for context: [failed run link].

Our ideal path forward is to cherry-pick the original PR (#8342) and include a fix for the flaky test in a follow-up commit. This ensures the original author receives proper credit.

The main requirement now is to identify and fix the cause of the test flakiness. Could you please investigate this and submit the necessary fix? Your help in resolving this would be greatly appreciated.

Hi @eshitachandwani,

Thank you for the guidance! I've restructured the PR as requested:

Cherry-picked the original PR stats/opentelemetry: record retry attempts from clientStream #8342 (preserves @vinothkumarr227's authorship)
Fixed the flaky test in a follow-up commit

Root cause: Timing race conditions between resolver state updates and span validation.
Verified with 10+ test runs and race detector - no flakiness detected.

eshitachandwani · 2025-11-24T09:25:31Z

stats/opentelemetry/e2e_test.go

+			// Give the span exporter a small amount of time to process and export
+			// all spans from the completed RPC. This reduces flakiness by ensuring
+			// all trace events have been fully recorded before validation.
+			time.Sleep(50 * time.Millisecond)


I dont think we need this since waitForTraceSpans already waits for the spans till context expires.

arjan-bal · 2025-11-26T13:02:53Z

Background

The failed test logs mention the following:

e2e_test.go:298: Spans mismatch (-want +got):
                  []opentelemetry_test.traceSpanInfo(Inverse(cmpopts.SortSlices, []opentelemetry_test.traceSpanInfo{
                  	{
                  		spanKind: "internal",
                  		name:     "Attempt.grpc.testing.TestService.FullDuplexCall",
                  		events: []trace.Event{
                - 			{Name: "Delayed LB pick complete"},
                  			{Name: "Outbound message", Attributes: {{Key: "sequence-number", Value: {...}}, {Key: "message-size", Value: {...}}}},
                  		},
                  		attributes: {{Key: "previous-rpc-attempts", Value: {...}}, {Key: "transparent-retry", Value: {...}}},
                  		status:     s"Error",
                  	},
                  	{spanKind: "internal", name: "Attempt.grpc.testing.TestService.FullDuplexCall", events: {{Name: "Outbound message", Attributes: {{Key: "sequence-number", Value: {...}}, {Key: "message-size", Value: {...}}}}}, attributes: {{Key: "previous-rpc-attempts", Value: {...}}, {Key: "transparent-retry", Value: {...}}}, ...},
                  	{spanKind: "internal", name: "Attempt.grpc.testing.TestService.FullDuplexCall", events: {{Name: "Outbound message", Attributes: {{Key: "sequence-number", Value: {...}}, {Key: "message-size", Value: {...}}}}}, attributes: {{Key: "previous-rpc-attempts", Value: {...}}, {Key: "transparent-retry", Value: {...}}}, ...},
                  	... // 4 identical elements
                  }))

The event Delayed LB pick complete was expected, but is missing. This event is added by the open-telemetry plugin when the DelayedPickComplete is sent by gRPC.

In gRPC, the logic to emit this event relies on the pickerBlocked variable being set to true here:

grpc-go/picker_wrapper.go

Lines 105 to 201 in a764d3f

    
           func (pw *pickerWrapper) pick(ctx context.Context, failfast bool, info balancer.PickInfo) (pick, error) { 
        
           	var ch chan struct{} 
        
           	var lastPickErr error 
        
           	pickBlocked := false 
        
           	for { 
        
           		pg := pw.pickerGen.Load() 
        
           		if pg == nil { 
        
           			return pick{}, ErrClientConnClosing 
        
           		} 
        
           		if pg.picker == nil { 
        
           			ch = pg.blockingCh 
        
           		} 
        
           		if ch == pg.blockingCh { 
        
           			// This could happen when either: 
        
           			// - pw.picker is nil (the previous if condition), or 
        
           			// - we have already called pick on the current picker. 
        
           			select { 
        
           			case <-ctx.Done(): 
        
           				var errStr string 
        
           				if lastPickErr != nil { 
        
           					errStr = "latest balancer error: " + lastPickErr.Error() 
        
           				} else { 
        
           					errStr = fmt.Sprintf("%v while waiting for connections to become ready", ctx.Err()) 
        
           				} 
        
           				switch ctx.Err() { 
        
           				case context.DeadlineExceeded: 
        
           					return pick{}, status.Error(codes.DeadlineExceeded, errStr) 
        
           				case context.Canceled: 
        
           					return pick{}, status.Error(codes.Canceled, errStr) 
        
           				} 
        
           			case <-ch: 
        
           			} 
        
           			continue 
        
           		} 
        
           		// If the channel is set, it means that the pick call had to wait for a 
        
           		// new picker at some point. Either it's the first iteration and this 
        
           		// function received the first picker, or a picker errored with 
        
           		// ErrNoSubConnAvailable or errored with failfast set to false, which 
        
           		// will trigger a continue to the next iteration. In the first case this 
        
           		// conditional will hit if this call had to block (the channel is set). 
        
           		// In the second case, the only way it will get to this conditional is 
        
           		// if there is a new picker. 
        
           		if ch != nil { 
        
           			pickBlocked = true 
        
           		} 
        
           		ch = pg.blockingCh 
        
           		p := pg.picker 
        
           		pickResult, err := p.Pick(info) 
        
           		if err != nil { 
        
           			if err == balancer.ErrNoSubConnAvailable { 
        
           				continue 
        
           			} 
        
           			if st, ok := status.FromError(err); ok { 
        
           				// Status error: end the RPC unconditionally with this status. 
        
           				// First restrict the code to the list allowed by gRFC A54. 
        
           				if istatus.IsRestrictedControlPlaneCode(st) { 
        
           					err = status.Errorf(codes.Internal, "received picker error with illegal status: %v", err) 
        
           				} 
        
           				return pick{}, dropError{error: err} 
        
           			} 
        
           			// For all other errors, wait for ready RPCs should block and other 
        
           			// RPCs should fail with unavailable. 
        
           			if !failfast { 
        
           				lastPickErr = err 
        
           				continue 
        
           			} 
        
           			return pick{}, status.Error(codes.Unavailable, err.Error()) 
        
           		} 
        
           		acbw, ok := pickResult.SubConn.(*acBalancerWrapper) 
        
           		if !ok { 
        
           			logger.Errorf("subconn returned from pick is type %T, not *acBalancerWrapper", pickResult.SubConn) 
        
           			continue 
        
           		} 
        
           		if t := acbw.ac.getReadyTransport(); t != nil { 
        
           			if channelz.IsOn() { 
        
           				doneChannelzWrapper(acbw, &pickResult) 
        
           			} 
        
           			return pick{transport: t, result: pickResult, blocked: pickBlocked}, nil 
        
           		} 
        
           		if pickResult.Done != nil { 
        
           			// Calling done with nil error, no bytes sent and no bytes received. 
        
           			// DoneInfo with default value works. 
        
           			pickResult.Done(balancer.DoneInfo{}) 
        
           		} 
        
           		logger.Infof("blockingPicker: the picked transport is not ready, loop back to repick") 
        
           		// If ok == false, ac.state is not READY. 
        
           		// A valid picker always returns READY subConn. This means the state of ac 
        
           		// just changed, and picker will be updated shortly. 
        
           		// continue back to the beginning of the for loop to repick. 
        
           	} 
        
           }

Repro

To reproduce the failure, you can simply add a time.Sleep(20 * time.Millisecond) at the very top of the pick method linked above.

func (pw *pickerWrapper) pick(ctx context.Context, failfast bool, info balancer.PickInfo) (pick, error) {
+	time.Sleep(20 * time.Millisecond)
	var ch chan struct{}

The Problem

When an RPC is made, there's a race happening:

The Load Balancing policy (pickfirst here) gets the list of addresses from the resolver, creates subchannels and updates the picker once the subchannel is ready.
At the same time, the pick method above tries to atomically load the current picker.

If 1. above wins the race, the DelayedPickComplete will be missing and the test fails. If 2. wins the race, pickerBlocked is set and the pick method waits for the next picker to be generated.

The Solution

To solve this race, the test needs to ensure that the non-blocking picker is not produced the first time the pick method loads the picker. A blocking picker is one that returns the specific error balancer.ErrNoSubConnAvailable. It tells the channel that it should wait for the next picker to be produced and not fail the RPC immediately.
One solution that I could think of is to write a Stub balancer. The stub balancer will return a blocking picker. Once the channel calls Pick on the blocking picker, the test knows that DelayedPickComplete will be emitted. The stub balancer can then continue to produce a non-blocking picker.

The previous fix removed time.Sleep based on review feedback that waitForTraceSpans should be sufficient. However, waitForTraceSpans only checked for span presence (name + kind), not span completion. This led to test flakiness where spans were captured before all their events and status were fully recorded. Fix: Enhanced waitForTraceSpans to also check that spans have ended (non-zero EndTime) before returning, ensuring all span data is complete before validation. Testing: - Passes 10+ consecutive runs without flakiness - Passes with -race flag - All opentelemetry tests pass

ulascansenturk mentioned this pull request Nov 19, 2025

otel: retries must be tracked per-call and not per-attempt #8700

Open

eshitachandwani assigned ulascansenturk Nov 20, 2025

eshitachandwani added Status: Requires Reporter Clarification Type: Bug labels Nov 20, 2025

eshitachandwani added this to the 1.78 Release milestone Nov 20, 2025

vinothkumarr227 and others added 2 commits November 20, 2025 13:51

stats/opentelemetry: record retry attempts from clientStream (grpc#8342)

adce981

Fixes: grpc#8299 RELEASE NOTES: - stats/opentelemetry: Retry attempts (`grpc.previous-rpc-attempts`) are now recorded as span attributes for non-transparent client retries.

ulascansenturk force-pushed the fix/8700-otel-retry-per-call-tracking branch from 9b6371e to b1732c6 Compare November 20, 2025 10:54

eshitachandwani reviewed Nov 24, 2025

View reviewed changes

address review

6d30a3e

ulascansenturk requested a review from eshitachandwani November 24, 2025 17:28

eshitachandwani assigned eshitachandwani and unassigned ulascansenturk Nov 25, 2025

eshitachandwani requested a review from arjan-bal November 26, 2025 05:44

eshitachandwani assigned arjan-bal Nov 26, 2025

arjan-bal removed their assignment Nov 26, 2025

arjan-bal removed the Status: Requires Reporter Clarification label Nov 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

stats/opentelemetry: Track retry counts per-call instead of per-attempt #8715

stats/opentelemetry: Track retry counts per-call instead of per-attempt #8715

ulascansenturk commented Nov 19, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 19, 2025 •

edited

Loading

Uh oh!

eshitachandwani commented Nov 20, 2025

Uh oh!

ulascansenturk commented Nov 20, 2025

Uh oh!

eshitachandwani Nov 24, 2025

Uh oh!

arjan-bal commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stats/opentelemetry: Track retry counts per-call instead of per-attempt #8715

Are you sure you want to change the base?

stats/opentelemetry: Track retry counts per-call instead of per-attempt #8715

Conversation

ulascansenturk commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eshitachandwani commented Nov 20, 2025

Uh oh!

ulascansenturk commented Nov 20, 2025

Uh oh!

eshitachandwani Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

arjan-bal commented Nov 26, 2025

Background

Repro

The Problem

The Solution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ulascansenturk commented Nov 19, 2025 •

edited

Loading

codecov bot commented Nov 19, 2025 •

edited

Loading