Fix MOVED errors by not randomly selecting another node #1001

thehydroimpulse · 2019-03-30T10:15:13Z

After some investigations we noticed a lot of MOVED errors that always seemed to happen when we received i/o timeouts. The original request selected the correct master node (for writes) and the correct master/slave for reads, however, if an i/o timeout or otherwise retryable error is received, a random node will be used after two attempts on the original node. This seems wrong as a random node will never be suitable to serve the request. Redis will most likely always send back MOVED errors as the random node is either not a master, not the right master, or not the right slave.

For writes, you cannot attempt on any other node than the currently selected master node. Best path here is probably around respecting the redirect configuration and keep retrying on that node.

For reads we can check if ReadOnly is set and retry on another slave. Otherwise we have to behave the same as writes.

Let me know if something along these lines would be acceptable and I can add some tests.

thehydroimpulse · 2019-04-01T19:24:13Z

@vmihailenco thoughts?

vmihailenco · 2019-04-07T11:14:54Z

Redis will most likely always send back MOVED errors as the random node is either not a master, not the right master, or not the right slave.

True, and go-redis will follow MOVED/ASK error and send request to the right node on next retry.

What you are proposing makes sense but

It will not fix your code - requests will still timeout. This PR only increases number of retries.
It will break cases when node is down and go-redis has outdated cluster topology.
Overall code becomes slightly harder to follow/read.

thehydroimpulse · 2019-04-07T12:45:14Z

It will not fix your code - requests will still timeout. This PR only increases number of retries.

It fixes it by not creating artificial moved errors. Networking in AWS is pretty flaky and we were getting consistent retryable errors that would lead to forced moved errors for no reason at all. This obviously won't fix the retryable errors (i/o timeouts) but will prevent us from wasting more round trips to redis.

Moreover, without this fix the end behaviour isn't different in the case of a node timing out everything: the node may still keep timing out but a random hop will be performed that just delays the inevitable, going back to the same original node.

It will break cases when node is down and go-redis has outdated cluster topology.

The code seems to already lazily reload the state if any error has occurred so I don't think that's the case. Moreover, I don't think it's ideal to depend on a randomly selecting nodes to detect if a node is down and the topology has changed.

Thoughts?

greggjs · 2019-04-18T20:29:23Z

I am experiencing this problem as well, and I believe this would fix this issue. @vmihailenco I believe this PR is not about fixing timeouts in the Redis client, but this PR is more about making retries in a clustered Redis actually hit nodes that store the information you're trying to operate on. We might as well speed up this process instead of wasting retries asking around for where the information is.

thehydroimpulse · 2019-06-12T19:11:02Z

@vmihailenco we've hit this issue again in production, anyway we can get this merged?

vmihailenco · 2019-06-13T11:40:11Z

Just make this super clear.

Current behavior:

same node - timeout
same node - timeout
try random node - redirect to correct node
correct node - timeout
try random node - redirect to correct node
correct node - timeout
try random node - redirect to correct node
correct node - timeout

Suggested behavior:

same node - timeout
same node - timeout
same node - timeout
same node - timeout
same node - timeout
same node - timeout
same node - timeout
same node - timeout

I don't like this change because:

I don't believe it fixes real problem (or you don't explain what the real problem is) - AWS network is good enough and until you are using super low read/write timeouts - all should be good. And I am not sure that using super low timeouts is such a good idea if your network latency is bad / not reliable.
It breaks the case when go-redis has incorrect cluster topology (e.g. in case of fail-over). Current behaviour is more intelligent (but still simple) than just hitting the same server over and over again. And it uses Redis Cluster feature designed specifically for this case.

thehydroimpulse · 2019-06-13T20:30:59Z

AWS network is good enough

At scale cloud networks are generally unreliable. Thus with the current behaviour we consistently get incorrect MOVED errors. This gets pretty bad at scale.

The suggested behaviour is to retry on the same node but with a potentially different connection. We currently set the retry to 3 because 8 is way too many attempts. Essentially the best behaviour is to fail fast. The current behaviour doesn't actually give any benefit other than causing MOVED errors.

MOVED errors should never happen unless the cluster has failed over. That's when you should refresh the cluster topology.

And I am not sure that using super low timeouts is such a good idea if your network latency is bad / not reliable.

Until context support recently landed this was the only way to ensure minimal blocking on Redis calls. A larger timeout is not acceptable to meet our latency requirement which is the reason we're using Redis in the first place.

It breaks the case when go-redis has incorrect cluster topology (e.g. in case of fail-over). Current behaviour is more intelligent (but still simple) than just hitting the same server over and over again. And it uses Redis Cluster feature designed specifically for this case.

Just because a connection has timed out doesn't mean the topology has changed. Only if a MOVED error is received should the topology change (or at an interval refetching the slots). Trying random nodes does not help with the topology case at all. It only causes artificial MOVED errors and puts more pressure on the cluster (by making needless redirects).

The reason I say needless is because for writes there are no other nodes in the cluster that can serve the request other than the current node. It's best to fail fast in this case. If the topology has changed (the current node has indeed failed) then on the next state refresh you'll have the new cluster slots. You're optimizing for a rare case (failover) by making the common case (networking blips) worse.

The same happens for reads but there you can retry another replica instead of hitting the same node. This is more correct behaviour than hitting a random node because there's a higher chance another replica of the same slot can serve the request. Hitting random nodes for reads just causes more network hops and more MOVED errors. If a replica has failed, trying another replica in the same slot is the best path forward.

tl;dr depending on network timeouts to detect failure leads to more issues than it solves, especially in a cloud environment at scale.

vmihailenco · 2019-06-16T11:02:05Z

PTAL at #1056

Daniel Fagnan added 2 commits March 30, 2019 03:07

Fix MOVED errors by not randomly selecting another node

99f3211

fix build

be2cb02

vmihailenco mentioned this pull request Jun 16, 2019

Use master / another slave node when current node is failing #1056

Merged

vmihailenco closed this Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix MOVED errors by not randomly selecting another node #1001

Fix MOVED errors by not randomly selecting another node #1001

Uh oh!

thehydroimpulse commented Mar 30, 2019 •

edited

Loading

Uh oh!

thehydroimpulse commented Apr 1, 2019

Uh oh!

vmihailenco commented Apr 7, 2019

Uh oh!

thehydroimpulse commented Apr 7, 2019

Uh oh!

greggjs commented Apr 18, 2019

Uh oh!

thehydroimpulse commented Jun 12, 2019

Uh oh!

vmihailenco commented Jun 13, 2019

Uh oh!

thehydroimpulse commented Jun 13, 2019 •

edited

Loading

Uh oh!

vmihailenco commented Jun 16, 2019

Uh oh!

Uh oh!

Fix MOVED errors by not randomly selecting another node #1001

Fix MOVED errors by not randomly selecting another node #1001

Uh oh!

Conversation

thehydroimpulse commented Mar 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thehydroimpulse commented Apr 1, 2019

Uh oh!

vmihailenco commented Apr 7, 2019

Uh oh!

thehydroimpulse commented Apr 7, 2019

Uh oh!

greggjs commented Apr 18, 2019

Uh oh!

thehydroimpulse commented Jun 12, 2019

Uh oh!

vmihailenco commented Jun 13, 2019

Uh oh!

thehydroimpulse commented Jun 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vmihailenco commented Jun 16, 2019

Uh oh!

Uh oh!

thehydroimpulse commented Mar 30, 2019 •

edited

Loading

thehydroimpulse commented Jun 13, 2019 •

edited

Loading