Skip to content

Conversation

@dfawley
Copy link
Member

@dfawley dfawley commented Sep 20, 2024

Fixes #7648 and #7507

The problem here was that the producer stream could attempt to start immediately after the channel went ready, but before ready was reported. The fix is to simply wait on the stateReadyChan even if the state is Ready at the start, which will proceed immediately if it's already closed. However, doing this requires some minor changes to managing the channel (only re-create when entering Ready and close when shutting down) and some test tweaks.

RELEASE NOTES: none

@dfawley dfawley added this to the 1.68 Release milestone Sep 20, 2024
@dfawley dfawley requested a review from arjan-bal September 20, 2024 00:20
@codecov
Copy link

codecov bot commented Sep 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.94%. Comparing base (1418e5e) to head (9860b13).
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7651      +/-   ##
==========================================
+ Coverage   81.89%   81.94%   +0.04%     
==========================================
  Files         361      361              
  Lines       27818    27821       +3     
==========================================
+ Hits        22782    22798      +16     
+ Misses       3847     3837      -10     
+ Partials     1189     1186       -3     
Files with missing lines Coverage Δ
balancer_wrapper.go 89.10% <100.00%> (+1.84%) ⬆️
clientconn.go 93.36% <100.00%> (+0.63%) ⬆️

... and 18 files with indirect coverage changes

Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arjan-bal arjan-bal assigned dfawley and unassigned arjan-bal Sep 20, 2024
@arjan-bal
Copy link
Contributor

arjan-bal commented Sep 20, 2024

Couple of question:

  • For an LB policy to receive the subconn state update for READY before the producer gets a stream, we need to ensure all the ancestor LB policies on the path from the channel to the consumer LB policy handle the subconn state updates in a synchronous manner. Otherwise, some parent may buffer the subconn state update causing delays. This is not part of the StateListener doc comment

    // StateListener is called when the state of the subconn changes. If nil,
    // Balancer.UpdateSubConnState will be called instead. Will never be
    // invoked until after Connect() is called on the SubConn created with
    // these options.
    StateListener func(SubConnState)

    Is this an undocumented requirement?

  • The communication b/w the producer and the LB policy is not serialized with subchannel state updates. If a subchannel goes from READY->IDLE. The producer may report updates as it hasn't yet seen the transport failure. An LB policy which has seen the IDLE state update may think that the producer is reporting metrics before the subchannel is READY. Is this a problem? Maybe LB policies should not assume that producer updates are delivered only after the connectivity changes to READY?

@dfawley
Copy link
Member Author

dfawley commented Sep 20, 2024

Otherwise, some parent may buffer the subconn state update causing delays

Are you thinking this might be happening somewhere?

Generally, delivering events out of order would probably cause problems (another reason to prefer state-based updates? 😄). Delaying some events will cause things to be reordered, so we'd need to be really careful about this.

The communication b/w the producer and the LB policy is not serialized with subchannel state updates

I'm not sure how we could even synchronize this, unless literally everything is done in one serializer, including the producer receiving from its stream. Consider:

  1. Producer is running on a READY Subchannel
  2. Producer receives a result (e.g. an ORCA load report or a health check)
  3. Producer schedules an update through the serializer
  4. Immediately before (2), the subchannel changes to IDLE and it gets scheduled first.

I think the only way to keep things perfectly in sync would rely on the producer itself running inside the LB policy's serializer, which would require an API change for the producer. Maybe we need to think more about this long-term.

@dfawley dfawley merged commit 8ea3460 into grpc:master Sep 20, 2024
14 checks passed
@dfawley dfawley deleted the producerfix2 branch September 20, 2024 16:17
janardhanvissa pushed a commit to janardhanvissa/grpc-go that referenced this pull request Oct 1, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 20, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: Test/ProducerStreamStartsAfterReady

2 participants