Skip to content

Fix MOVED errors by not randomly selecting another node #1001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

thehydroimpulse
Copy link

@thehydroimpulse thehydroimpulse commented Mar 30, 2019

After some investigations we noticed a lot of MOVED errors that always seemed to happen when we received i/o timeouts. The original request selected the correct master node (for writes) and the correct master/slave for reads, however, if an i/o timeout or otherwise retryable error is received, a random node will be used after two attempts on the original node. This seems wrong as a random node will never be suitable to serve the request. Redis will most likely always send back MOVED errors as the random node is either not a master, not the right master, or not the right slave.

For writes, you cannot attempt on any other node than the currently selected master node. Best path here is probably around respecting the redirect configuration and keep retrying on that node.

For reads we can check if ReadOnly is set and retry on another slave. Otherwise we have to behave the same as writes.

Let me know if something along these lines would be acceptable and I can add some tests.

@thehydroimpulse
Copy link
Author

@vmihailenco thoughts?

@vmihailenco
Copy link
Collaborator

Redis will most likely always send back MOVED errors as the random node is either not a master, not the right master, or not the right slave.

True, and go-redis will follow MOVED/ASK error and send request to the right node on next retry.

What you are proposing makes sense but

  • It will not fix your code - requests will still timeout. This PR only increases number of retries.
  • It will break cases when node is down and go-redis has outdated cluster topology.
  • Overall code becomes slightly harder to follow/read.

@thehydroimpulse
Copy link
Author

It will not fix your code - requests will still timeout. This PR only increases number of retries.

It fixes it by not creating artificial moved errors. Networking in AWS is pretty flaky and we were getting consistent retryable errors that would lead to forced moved errors for no reason at all. This obviously won't fix the retryable errors (i/o timeouts) but will prevent us from wasting more round trips to redis.

Moreover, without this fix the end behaviour isn't different in the case of a node timing out everything: the node may still keep timing out but a random hop will be performed that just delays the inevitable, going back to the same original node.

It will break cases when node is down and go-redis has outdated cluster topology.

The code seems to already lazily reload the state if any error has occurred so I don't think that's the case. Moreover, I don't think it's ideal to depend on a randomly selecting nodes to detect if a node is down and the topology has changed.

Thoughts?

@greggjs
Copy link
Contributor

greggjs commented Apr 18, 2019

I am experiencing this problem as well, and I believe this would fix this issue. @vmihailenco I believe this PR is not about fixing timeouts in the Redis client, but this PR is more about making retries in a clustered Redis actually hit nodes that store the information you're trying to operate on. We might as well speed up this process instead of wasting retries asking around for where the information is.

@thehydroimpulse
Copy link
Author

@vmihailenco we've hit this issue again in production, anyway we can get this merged?

@vmihailenco
Copy link
Collaborator

Just make this super clear.

Current behavior:

  1. same node - timeout
  2. same node - timeout
  3. try random node - redirect to correct node
  4. correct node - timeout
  5. try random node - redirect to correct node
  6. correct node - timeout
  7. try random node - redirect to correct node
  8. correct node - timeout

Suggested behavior:

  1. same node - timeout
  2. same node - timeout
  3. same node - timeout
  4. same node - timeout
  5. same node - timeout
  6. same node - timeout
  7. same node - timeout
  8. same node - timeout

I don't like this change because:

  • I don't believe it fixes real problem (or you don't explain what the real problem is) - AWS network is good enough and until you are using super low read/write timeouts - all should be good. And I am not sure that using super low timeouts is such a good idea if your network latency is bad / not reliable.
  • It breaks the case when go-redis has incorrect cluster topology (e.g. in case of fail-over). Current behaviour is more intelligent (but still simple) than just hitting the same server over and over again. And it uses Redis Cluster feature designed specifically for this case.

@thehydroimpulse
Copy link
Author

thehydroimpulse commented Jun 13, 2019

AWS network is good enough

At scale cloud networks are generally unreliable. Thus with the current behaviour we consistently get incorrect MOVED errors. This gets pretty bad at scale.

The suggested behaviour is to retry on the same node but with a potentially different connection. We currently set the retry to 3 because 8 is way too many attempts. Essentially the best behaviour is to fail fast. The current behaviour doesn't actually give any benefit other than causing MOVED errors.

MOVED errors should never happen unless the cluster has failed over. That's when you should refresh the cluster topology.

And I am not sure that using super low timeouts is such a good idea if your network latency is bad / not reliable.

Until context support recently landed this was the only way to ensure minimal blocking on Redis calls. A larger timeout is not acceptable to meet our latency requirement which is the reason we're using Redis in the first place.

It breaks the case when go-redis has incorrect cluster topology (e.g. in case of fail-over). Current behaviour is more intelligent (but still simple) than just hitting the same server over and over again. And it uses Redis Cluster feature designed specifically for this case.

Just because a connection has timed out doesn't mean the topology has changed. Only if a MOVED error is received should the topology change (or at an interval refetching the slots). Trying random nodes does not help with the topology case at all. It only causes artificial MOVED errors and puts more pressure on the cluster (by making needless redirects).

The reason I say needless is because for writes there are no other nodes in the cluster that can serve the request other than the current node. It's best to fail fast in this case. If the topology has changed (the current node has indeed failed) then on the next state refresh you'll have the new cluster slots. You're optimizing for a rare case (failover) by making the common case (networking blips) worse.

The same happens for reads but there you can retry another replica instead of hitting the same node. This is more correct behaviour than hitting a random node because there's a higher chance another replica of the same slot can serve the request. Hitting random nodes for reads just causes more network hops and more MOVED errors. If a replica has failed, trying another replica in the same slot is the best path forward.

tl;dr depending on network timeouts to detect failure leads to more issues than it solves, especially in a cloud environment at scale.

@vmihailenco
Copy link
Collaborator

PTAL at #1056

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants