Skip to content

Cluster Connect works partially, with a full timeout. #2251

@iteplov

Description

@iteplov

AWS Redis 6.2.5 cluster 2 shards, each with replica, with a config endpoint. FWIW, the config endpoint is a DNS alias of a random cluster node, I believe.

Originally, a .NET Core app uses SE.Redis 2.2.88, configured with that single config endpoint. Everything works.

After upgrading to SE.Redis 2.6.66, random Redis commands started to return non-sensical errors that were reproducible somewhat in 50% of the cases.

E.g. a simple transaction with a single operation against a single random key would give EXECABORT in 50% of the runs. (was not the root issue but the specific reason for that was that in 50% tx would be sent to the server as an empty "multi"/"exec" pair, i.e. missing the actual op, according to redis-cli monitor).

Additionally, the time of the initial ConnectionMultiplexer.Connect would be always around ConnectTimeout, regardless of the value. Very much similar to what is described in this issue here.

All errors/delays would disappear if all 4 endpoints would be spelled out in configuration, i.e. the issue would take place only if the muxer has to discover remaining endpoints from the provided one.

The described issues were reproduced in tests, so what the app does and where it runs seems irrelevant. The configuration is mostly vanilla (dumped below) and didn't change during the upgrade. The app uses custom SSL callback that allows "*.amazonaws.com" certs, not sure if that makes a difference here.

Logs

SE.Redis 2.6.66 connection sequence (ConnectTimeout = 30s), from ConnectionMultplexer.Connect

**** Connecting...
01:32:28.2322: Connecting (sync) on .NET 6.0.7 (StackExchange.Redis: v2.6.66.47313)
01:32:28.3465: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380,keepAlive=60,syncTimeout=5000,asyncTimeout=5000,allowAdmin=False,version=5.0,connectTimeout=30000,password=*****,tiebreaker=__Booksleeve_TieBreak,ssl=True,sslHost=localhost,configChannel=__Booksleeve_MasterChanged,abortConnect=False,resolveDns=False,connectRetry=3,proxy=None,configCheckSeconds=60,defaultDatabase=0
01:32:28.3548: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380/Interactive: Connecting...
01:32:28.3777: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: BeginConnectAsync
01:32:28.4253: 1 unique nodes specified (with tiebreaker)
01:32:28.4301: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: OnConnectedAsync init (State=Connecting)
01:32:28.4350: Allowing 1 endpoint(s) 00:00:30 to respond...
01:32:28.4446: Awaiting 1 available task completion(s) for 30000ms, IOCP: (Busy=0,Free=1000,Min=12,Max=1000), WORKER: (Busy=1,Free=32766,Min=12,Max=32767), POOL: (Threads=5,QueuedItems=0,CompletedItems=5)
01:32:28.5670: Configuring TLS
01:32:28.7636: TLS connection established successfully using protocol: Tls12
01:32:28.7692: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380/Interactive: Connected 
01:32:28.7768: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Server handshake
01:32:28.7811: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Authenticating (password)
01:32:28.8240: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Setting client name: W-ITEPLO-TJ5N7(SE.Redis-v2.6.66.47313)
01:32:28.8331: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Auto-configuring...
01:32:28.8421: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Requesting tie-break (Key="__Booksleeve_TieBreak")...
01:32:28.8506: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380/Interactive: Writing: GET __Booksleeve_TieBreak
01:32:28.8588: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Sending critical tracer (handshake): ECHO
01:32:28.8620: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380/Interactive: Writing: ECHO
01:32:28.8711: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Flushing outbound buffer
01:32:28.8761: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: OnEstablishingAsync complete
01:32:28.8791: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Starting read
01:32:28.9559: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Auto-configured (INFO) role: primary
01:32:28.9617: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Auto-configured (INFO) version: 6.2.5
01:32:28.9678: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Auto-configured (INFO) server-type: cluster
01:32:29.0186: Response from clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380/Interactive / GET __Booksleeve_TieBreak: Error: MOVED 6202 services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380
01:32:29.0252: Response from clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380/Interactive / ECHO: BulkString: 16 bytes
01:32:29.0342: clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: OnConnectedAsync completed (From command: ECHO)
01:32:29.0390: All 1 available tasks completed cleanly, IOCP: (Busy=1,Free=999,Min=12,Max=1000), WORKER: (Busy=2,Free=32765,Min=12,Max=32767), POOL: (Threads=10,QueuedItems=0,CompletedItems=20)
01:32:29.0441: Endpoint summary:
01:32:29.0482:   clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Endpoint is ConnectedEstablished
01:32:29.0536:   clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Returned with success as Cluster primary (Source: From command: ECHO)
01:32:29.1106: services-0002-001.services.rfikkv.usw2.cache.amazonaws.com:6380: OnConnectedAsync init (State=Connecting)
01:32:29.1156: services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380: OnConnectedAsync init (State=Connecting)
01:32:29.1202: services-0001-001.services.rfikkv.usw2.cache.amazonaws.com:6380: OnConnectedAsync init (State=Connecting)
01:32:29.1236: services-0002-002.services.rfikkv.usw2.cache.amazonaws.com:6380: OnConnectedAsync init (State=Connecting)
01:32:29.1273: Allowing 4 endpoint(s) 00:00:29.3080000 to respond...
01:32:29.1315: Awaiting 4 available task completion(s) for 29308ms, IOCP: (Busy=0,Free=1000,Min=12,Max=1000), WORKER: (Busy=2,Free=32765,Min=12,Max=32767), POOL: (Threads=10,QueuedItems=0,CompletedItems=26)
01:32:58.4418: Not all available tasks completed cleanly (from ReconfigureAsync#1284, timeout 29308ms), IOCP: (Busy=0,Free=1000,Min=12,Max=1000), WORKER: (Busy=2,Free=32765,Min=12,Max=32767), POOL: (Threads=6,QueuedItems=0,CompletedItems=73)
01:32:58.4475:   Server[0] (services-0002-001.services.rfikkv.usw2.cache.amazonaws.com:6380) Status: WaitingForActivation (inst: 12, qs: 0, in: 0, qu: 12, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: ReadAsync. ws: Idle)
01:32:58.4510:   Server[1] (services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380) Status: WaitingForActivation (inst: 11, qs: 0, in: 0, qu: 11, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: ReadAsync. ws: Idle)
01:32:58.4550:   Server[2] (services-0001-001.services.rfikkv.usw2.cache.amazonaws.com:6380) Status: WaitingForActivation (inst: 12, qs: 0, in: 0, qu: 12, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: ReadAsync. ws: Idle)
01:32:58.4591:   Server[3] (services-0002-002.services.rfikkv.usw2.cache.amazonaws.com:6380) Status: WaitingForActivation (inst: 11, qs: 0, in: 0, qu: 11, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: ReadAsync. ws: Idle)
01:32:58.4632: Endpoint summary:
01:32:58.4668:   services-0002-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Endpoint is ConnectedEstablished
01:32:58.4737:   services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Endpoint is ConnectedEstablished
01:32:58.4795:   services-0001-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Endpoint is ConnectedEstablished
01:32:58.4833:   services-0002-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Endpoint is ConnectedEstablished
01:32:58.4875:   services-0002-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Did not respond
01:32:58.4908:   services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Did not respond
01:32:58.4940:   services-0001-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Did not respond
01:32:58.4969:   services-0002-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Did not respond
01:32:58.5004: Cluster: 16384 of 16384 slots covered
01:32:58.5055: Endpoint Summary:
01:32:58.5101:   clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Cluster v6.2.5, primary; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
01:32:58.5155:   clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: int ops=12, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=4, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
01:32:58.5243:   clustercfg.services.rfikkv.usw2.cache.amazonaws.com:6380: Circular op-count snapshot; int: 0+12=12 (1.20 ops/s; spans 10s); sub: 0+4=4 (0.40 ops/s; spans 10s)
01:32:58.5276:   services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Cluster v6.2.5, primary; keep-alive: 00:01:00; int: ConnectedEstablished; sub: n/a; not in use: DidNotRespond
01:32:58.5302:   services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=0, qu=0, qs=0, qc=0, wr=0
01:32:58.5332:   services-0001-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub:
01:32:58.5361:   services-0002-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Cluster v6.2.5, primary; keep-alive: 00:01:00; int: ConnectedEstablished; sub: n/a; not in use: DidNotRespond
01:32:58.5385:   services-0002-002.services.rfikkv.usw2.cache.amazonaws.com:6380: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=0, qu=0, qs=0, qc=0, wr=0
01:32:58.5409:   services-0002-002.services.rfikkv.usw2.cache.amazonaws.com:6380: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub:
01:32:58.5432:   services-0002-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Cluster v6.2.5, replica; keep-alive: 00:01:00; int: ConnectedEstablished; sub: n/a; not in use: DidNotRespond
01:32:58.5456:   services-0002-001.services.rfikkv.usw2.cache.amazonaws.com:6380: int ops=12, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=0, qu=0, qs=0, qc=0, wr=0
01:32:58.5481:   services-0002-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Circular op-count snapshot; int: 0+12=12 (1.20 ops/s; spans 10s); sub:
01:32:58.5510:   services-0001-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Cluster v6.2.5, replica; keep-alive: 00:01:00; int: ConnectedEstablished; sub: n/a; not in use: DidNotRespond
01:32:58.5535:   services-0001-001.services.rfikkv.usw2.cache.amazonaws.com:6380: int ops=12, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=0, qu=0, qs=0, qc=0, wr=0
01:32:58.5560:   services-0001-001.services.rfikkv.usw2.cache.amazonaws.com:6380: Circular op-count snapshot; int: 0+12=12 (1.20 ops/s; spans 10s); sub:
01:32:58.5587: Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
01:32:58.5610: Starting heartbeat...
01:32:58.5643: Total connect time: 30,332 ms
**** Connected

Theory

After debugging for some time, I believe the issue was introduced in #1947. Specifically this:

The connection completion handler will now fire when the second bridge/connection completes, this means we won't have interactive connected but subscription in an unknown state - both are connected before we fire the handler meaning the moment we come back from connect, subscriptions are in business.

Even more specifically, this line, that was later slightly updated in #2001, but not in a way to help here (all my cluster nodes use subscriptions).

Super-specifically, during activation of the discovered endpoints, the connection sequence is awaiting on these tasks. These are TCS tasks that are set completed when this happens.

And this now (well, since 2.5.*) requires a Subscription PhysicalBridge present on that endpoint. For the life of me I don't see it being created during the Reconfigure flow anywhere but here, which clearly happens only for the initially provided endpoints.

$1M Question

How come it's been out for that long and nobody noticed? Things go rather bad in this partially connected state. So I mostly hope that my theory is missing something obvious and all of that is fixable with a simple option flip.

Thanks for looking into this and all your work!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions