Changed the PubSub's health check command to be performed only on the… #1733

barshaul · 2021-11-21T14:18:59Z

… first command execution.

Pull Request check-list

Please make sure to review and check all of these items:

[V ] Does $ tox pass with this change (including linting)?
[V ] Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)?
[ V] Is the new or changed code fully tested?
[ V] Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?

NOTE: these things are not required to open a PR and can be done
afterwards / while the PR is open.

Description of change

Closing #1720:

I reproduced the PubSub bug described in #1720 and was able to find the RC.
When the PubSub's 'execute_command' method is being called it passes a 'health_check' bool to determine if it needs to run a health check. The 'health_check' value is set to not self.subscribed, which checks if the pubsub instance has any items in the channels/patterns lists. That means, that we perform a health check within the execute_command function only if we are not yet subscribed. All subsequent commands, after the first subscription, should be executed without performing a health check, since the channels/patterns list is no longer empty.
The pubsub's 'get_messages()' method can be used to poll published messages after a pubsub instance has been created. If a poller thread is created (thread that waits on get_message()) it will listen on the same socket as the pubsub execute_command is listening to when it performs a health check. Hence, we should not send a healthcheck using the pubsub execute_command function after the poller thread is initiated, since then it will be racing the poller thread to read the response from the socket.

In the example in #1720 we see the following flow:

PubSub instance is being created
Channel 'foo' is being subscribed - health check is performed since self.channels is empty
'PONG' is received
A poller thread is being started, looping on 'get_messages()'
The poller thread polls the 'subscribe' response
Channel 'foo' is being unsubscribed, health check is not being performed since self.channels still contains 'foo'. 'foo' is removed from self.channels
The poller thread polls the 'unsubscribe' response
Channel 'baz' is being subscribed - health check is performed since self.channels is empty again
The poller thread tries to poll the 'subscribe' response and gets the 'PONG' response instead
The health check waits for a PONG response that has already been received by the poller thread, and is therefore timed out

Therefore, we shouldn't use self.channels and self.patterns to determine whether a health check needs to be executed, but we should have another variable to indicate whether this is the first command execution, and if so, to run a health check.

However, a poller thread may be started before subscribing to a channel, e.g. :

ps = r.pubsub()
poller = threading.Thread(target=poll, args=(ps,))
ps.subscribe('foo')

In this case, the health check will be performed and we will still get a race reading from the socket with the poller thread.
So, my suggestion is to add a new 'cmd_execution_health_check' variable initiated with 'True' to the pubsub class and to set it to False on:

The end of execute_command method, so the health check will be performed only on the first execution), or
get_message() function, so the health check will not be performed from the execute_command function at all.
health checks are being done by the get_message() method, so no need to execute it also from the main command execution.

This change fixes the reported bug.

… first command execution.

codecov-commenter · 2021-11-21T14:21:14Z

Codecov Report

Merging #1733 (7b5b25b) into master (64791a5) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1733   +/-   ##
=======================================
  Coverage   89.04%   89.04%           
=======================================
  Files          53       53           
  Lines       11051    11054    +3     
=======================================
+ Hits         9840     9843    +3     
  Misses       1211     1211

Impacted Files	Coverage Δ
redis/client.py	`82.15% <100.00%> (+0.06%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64791a5...7b5b25b. Read the comment docs.

chayim · 2021-11-22T07:04:01Z

@barshaul Thanks for contributing a great fix! I ran this overnight and there were zero failures after ~7500 runs. This feels fixed - but you may want to see @bmerry comment here

bmerry · 2021-11-22T07:08:35Z

but you may want to see @bmerry comment here

FYI my comment was based on the description of the fix - I'd missed that there was a PR implementing the description.

barshaul · 2021-11-22T09:35:28Z

@bmerry @chayim
I was wrong, trying to call get_message() without subscribing first will end up with RuntimeError (see test_pubsub::test_get_message_without_subscribe).
So the second case I brought up in the descroption isn't relevant and I removed the check I added for cmd_exection_health_check in the get_message since it's unnecessary.
Before I realized that the second case could not happen, I implemented another solution for the race condition which can be found in PR #1737, both are working, but you can see if you prefer this one better.
1737:
Added an option to call pubsub's method get_message() without subscribing first.
If get_message is called and no channel/pattern is subscribed, the method will return None without trying to read from the connection.
When timeout is passed and no channels are yet subscribed, the get_message() function will wait for the first to arrive - either a subscription has been made or the time has expired.

bmerry · 2021-11-22T14:15:29Z

Arguably there is a race condition where a thread might get_message immediately after self.connection is initialised in execute_command but before the command is actually sent. But such a program already has a race condition, since it can't guarantee that get_message won't run before self.connection is initialised and raise an error. So this seems safe.

It potentially makes the health check less useful though. If the pubsub is used for a bit, then all the subscriptions are removed, then it is used again 10 minutes later, there will be no health check at that point. Maybe that's a reasonable trade-off for correctness.

barshaul · 2021-12-22T14:52:19Z

See #1737 for the closing PR/

Changed the PubSub's health check command to be performed only on the…

018196d

… first command execution.

barshaul marked this pull request as ready for review November 21, 2021 14:34

chayim added the bug Bug label Nov 22, 2021

chayim added the 4.1.0 label Nov 22, 2021

Removed unnecessary check from get_message()

7b5b25b

This was referenced Nov 22, 2021

Resolving read race condition between pubsub's get_message() and execute_command() #1736

Closed

Fixing read race condition during pubsub #1737

Merged

bmerry mentioned this pull request Nov 23, 2021

Another race condition in health checks and pubsub #1740

Closed

barshaul closed this Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Changed the PubSub's health check command to be performed only on the… #1733

Changed the PubSub's health check command to be performed only on the… #1733

Uh oh!

barshaul commented Nov 21, 2021 •

edited

Loading

Uh oh!

codecov-commenter commented Nov 21, 2021 •

edited

Loading

Uh oh!

chayim commented Nov 22, 2021 •

edited

Loading

Uh oh!

bmerry commented Nov 22, 2021

Uh oh!

barshaul commented Nov 22, 2021 •

edited

Loading

Uh oh!

bmerry commented Nov 22, 2021

Uh oh!

barshaul commented Dec 22, 2021

Uh oh!

Uh oh!

Changed the PubSub's health check command to be performed only on the… #1733

Changed the PubSub's health check command to be performed only on the… #1733

Uh oh!

Conversation

barshaul commented Nov 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request check-list

Description of change

Uh oh!

codecov-commenter commented Nov 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chayim commented Nov 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bmerry commented Nov 22, 2021

Uh oh!

barshaul commented Nov 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bmerry commented Nov 22, 2021

Uh oh!

barshaul commented Dec 22, 2021

Uh oh!

Uh oh!

barshaul commented Nov 21, 2021 •

edited

Loading

codecov-commenter commented Nov 21, 2021 •

edited

Loading

chayim commented Nov 22, 2021 •

edited

Loading

barshaul commented Nov 22, 2021 •

edited

Loading