Skip to content

TimeoutError during ClusterPipeline makes the client unrecoverable #3130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gabrielecerami opened this issue Feb 1, 2024 · 1 comment · Fixed by #3513
Closed

TimeoutError during ClusterPipeline makes the client unrecoverable #3130

gabrielecerami opened this issue Feb 1, 2024 · 1 comment · Fixed by #3513
Labels

Comments

@gabrielecerami
Copy link

Version: 4.6.0 connecting to a cluster version 6.2.7

Platform: Python 3.8 on Ubuntu 22.10 / Centos 7

Description: If one of the nodes in the cluster becomes unreachable in a way that returns TimeoutError, the client spirals down into and unrecoverable state

Consider this small snippet, that generates pipelines and executes them in random keys simulating some busy client.

from redis.cluster import RedisCluster, ClusterNode
import random
import time

startup_node = ClusterNode('mystartupnode', '6379')
client = RedisCluster(startup_nodes=[startup_node])

while True:
    try:
        for _ in range(10):

            pipeline = client.pipeline()
            for key in [f"key-{random.randint(10000,11000)}" for _ in range(50)]:
                pipeline.get(key)
            pipeline.execute()

    except Exception as error:
        print("Failure ", error)
    time.sleep(1)

While this script is running, if I tear down one of the nodes in the cluster, in a way that connection attempts from the client return TimeoutError, two things happen:

  • The client never recovers from the TimeoutError, even if I replace the server on a different IP, the unreachable node is kept in the nodes cache and continuously tried for each further iteration of the pipelines
  • The TimeoutError returns in the middle of the pipeline, and all the associated connections in the connection pool of all the nodes involved are not released, and additional pipeline commands (which still hit the TimeoutError) eventually fill up the connection pool to their max capacity, blocking any further connection.

I have tried to add a pipeline.reset() command in case of exception raised, but reading the method code, it doesn't really release any connection (there are a few TODOs for the WATCH case though)

During my tests, I've noticed that the way errors are treated in the ClusterPipeline._send_cluster_command here https://github.com/redis/redis-py/blob/v4.6.0/redis/cluster.py#L2001 is slightly different than the RedisCluster._execute_command here: https://github.com/redis/redis-py/blob/v4.6.0/redis/cluster.py#L1135.
The RedisCluster method reinitializes the nodes cache also in case of TimeoutError.

In fact, if in my virtual environment, if I alter the except clause in ClusterPipeline._send_cluster_command to include TimeoutError, the client recovers correctly and connections don't pile up, but I don't know if this could lead to some other side effects.

Copy link
Contributor

This issue is marked stale. It will be closed in 30 days if it is not updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant