Skip to content

Conversation

@mgravell
Copy link
Collaborator

@mgravell mgravell commented Mar 28, 2023

Adds automatic root/unroot; please read notes in the TimerToken class (in ConnectionMultiplexer.cs)

@mgravell
Copy link
Collaborator Author

Test failure is expecting RedisTimeoutException but getting RedisConnectionException with "The message timed out in the backlog attemtping to send because no connection became available" - not sure if this is my stuff or the stuff earlier in your branch, @NickCraver, but... I'm almost thinking we could legitimately change that RCE to a RTE; it is a timeout

@mgravell
Copy link
Collaborator Author

one test still failing (AbortOnConnectFailTests.DisconnectAndReconnectThrowsConnectionExceptionSync) - message has changed to:

UnableToConnect on 127.0.0.1:6379/Interactive, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 100s, state: Connecting, mgr: 0 of 10 available, last-heartbeat: never, global: 0s ago, v: 2.6.113.56944

@NickCraver you know more about this test than me; care to opine? You mentioned that this was one of the brittle ones; we seem to have accidentally made it non-brittle by making it reliably fail; should we handle both messages? or is this wrong?

// we can still process it to avoid making things worse/more complex,
// but: we can't reliably assume this works, so: shout now!
next.Cancel();
next.Complete();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A return might be missing here so that the cancelled message is not enqueued.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that; another version had throw. I'm not too concerned about writing it though - although nobody can ever see the reply. But if we were to do anything, I think it would need to be return false; and not write if we didn't enqueue. However, in reality, this should.never happen; the muxer should still be alive simply by the nature of being here. Another way we could fix it would be to pass the muxer upwards when doing this, and GC.KeepAlive before exiting.

Copy link
Collaborator Author

@mgravell mgravell Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to pre-empt the obvious "then why is this code here?": If I can't be absolutely sure we can't get into this scenario, I'd rather be paranoid; without the heartbeat, there are scenarios where messages might not be terminated. Emphasis: we should have a muxer: that's literally what the rest of the PR is meant to guarantee. Just if I'm wrong, I'd rather give people cancellation faults than give them nothing at all.

## Unreleased

No pending changes.
- Fix [#2392](https://github.com/StackExchange/StackExchange.Redis/issues/2392): critical (but rare) GC bug that can lead to async tasks never completing if the multiplexer is not held by the consumer ([#2413 by mgravell](https://github.com/StackExchange/StackExchange.Redis/pull/2413))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that #2392 is a different issue, the issue fixed by this PR was discovered by @NickCraver while working on PR #2408.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - we'll refine release notes here, want to test with all combined manually

@NickCraver NickCraver changed the title proposed fix for GC rooting; extensive notes on how/why are in TimerToken Proposed fix for GC rooting + extensive notes on how/why are in TimerToken Mar 28, 2023
@NickCraver
Copy link
Collaborator

NickCraver commented Mar 28, 2023

@mgravell +1 on test changes - I think that's fine, the goal of that specific test set isn't the type thrown (that's checked very explicitly in the exception tests), so being more lenient there is perfectly fine by me.

@NickCraver
Copy link
Collaborator

Status: my only concern here (hopefully unfounded) is lock perf - we're going to run some tests on that front and then should be good to merge in - I'll update release docs and all that goodness here but not imperative in this PR since it's going into the other.

@mgravell
Copy link
Collaborator Author

mgravell commented Mar 28, 2023

Oh, right; forgot I wasn't merging to main! My PR number will be confusing in the release notes; we should fix that in the final outer PR

@NickCraver NickCraver merged commit 4ce4797 into craver/timeout-error-improvements Mar 29, 2023
@NickCraver NickCraver deleted the marc/fix-rooting branch March 29, 2023 11:44
NickCraver added a commit that referenced this pull request Mar 30, 2023
…fixes for outstanding scenarios (#2408)

This combination PR is both fixing a GC issue (see below, and #2413 for details) and improves timeout exception. Basically if a timeout happens for a message that was in the backlog but was never sent, the user now gets a much more informative message like this:

> Exception: The message timed out in the backlog attempting to send because no connection became available - Last Connection Exception: InternalFailure on 127.0.0.1:6379/Interactive, Initializing/NotStarted, last: GET, origin: ConnectedAsync, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 500s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, last-mbeat: 0s ago, global: 0s ago, v: 2.6.99.22667, command=PING, inst: 0, qu: 0, qs: 0, aw: False, bw: CheckingForTimeout, last-in: 0, cur-in: 0, sync-ops: 1, async-ops: 1, serverEndpoint: 127.0.0.1:6379, conn-sec: n/a, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: NAMISTOU-3(SE.Redis-v2.6.99.22667), IOCP: (Busy=0,Free=1000,Min=32,Max=1000), WORKER: (Busy=2,Free=32765,Min=32,Max=32767), POOL: (Threads=18,QueuedItems=0,CompletedItems=65), v: 2.6.99.22667 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Today, this isn't intuitive especially for connections with `AbortOnConnectFail` set to `false`. What happens is a multiplexer _never_ connects successfully, but the user just gets generic timeouts. This makes the error more specific and includes the inner exception (also as `.InnerException`) for more details, informing the user of a config/auth/whatever error underneath as to why things are never successfully sending.

Also adds `aoc: (0|1)` to the exception message for easier advice in issues (reflecting what `AbortOnConnectFail` is set to).

Co-authored-by: Nick Craver <[email protected]>
Co-authored-by: Marc Gravell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants