Skip to content

Changed the PubSub's health check command to be performed only on the… #1733

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

barshaul
Copy link
Contributor

@barshaul barshaul commented Nov 21, 2021

… first command execution.

Pull Request check-list

Please make sure to review and check all of these items:

  • [V ] Does $ tox pass with this change (including linting)?
  • [V ] Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)?
  • [ V] Is the new or changed code fully tested?
  • [ V] Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?

NOTE: these things are not required to open a PR and can be done
afterwards / while the PR is open.

Description of change

Closing #1720:

I reproduced the PubSub bug described in #1720 and was able to find the RC.
When the PubSub's 'execute_command' method is being called it passes a 'health_check' bool to determine if it needs to run a health check. The 'health_check' value is set to not self.subscribed, which checks if the pubsub instance has any items in the channels/patterns lists. That means, that we perform a health check within the execute_command function only if we are not yet subscribed. All subsequent commands, after the first subscription, should be executed without performing a health check, since the channels/patterns list is no longer empty.
The pubsub's 'get_messages()' method can be used to poll published messages after a pubsub instance has been created. If a poller thread is created (thread that waits on get_message()) it will listen on the same socket as the pubsub execute_command is listening to when it performs a health check. Hence, we should not send a healthcheck using the pubsub execute_command function after the poller thread is initiated, since then it will be racing the poller thread to read the response from the socket.

In the example in #1720 we see the following flow:

  1. PubSub instance is being created
  2. Channel 'foo' is being subscribed - health check is performed since self.channels is empty
  3. 'PONG' is received
  4. A poller thread is being started, looping on 'get_messages()'
  5. The poller thread polls the 'subscribe' response
  6. Channel 'foo' is being unsubscribed, health check is not being performed since self.channels still contains 'foo'. 'foo' is removed from self.channels
  7. The poller thread polls the 'unsubscribe' response
  8. Channel 'baz' is being subscribed - health check is performed since self.channels is empty again
  9. The poller thread tries to poll the 'subscribe' response and gets the 'PONG' response instead
  10. The health check waits for a PONG response that has already been received by the poller thread, and is therefore timed out

Therefore, we shouldn't use self.channels and self.patterns to determine whether a health check needs to be executed, but we should have another variable to indicate whether this is the first command execution, and if so, to run a health check.

However, a poller thread may be started before subscribing to a channel, e.g. :

  1. ps = r.pubsub()
  2. poller = threading.Thread(target=poll, args=(ps,))
  3. ps.subscribe('foo')

In this case, the health check will be performed and we will still get a race reading from the socket with the poller thread.
So, my suggestion is to add a new 'cmd_execution_health_check' variable initiated with 'True' to the pubsub class and to set it to False on: 

  1. The end of execute_command method, so the health check will be performed only on the first execution), or
  2. get_message() function, so the health check will not be performed from the execute_command function at all.
    health checks are being done by the get_message() method, so no need to execute it also from the main command execution.

This change fixes the reported bug.

@codecov-commenter
Copy link

codecov-commenter commented Nov 21, 2021

Codecov Report

Merging #1733 (7b5b25b) into master (64791a5) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1733   +/-   ##
=======================================
  Coverage   89.04%   89.04%           
=======================================
  Files          53       53           
  Lines       11051    11054    +3     
=======================================
+ Hits         9840     9843    +3     
  Misses       1211     1211           
Impacted Files Coverage Δ
redis/client.py 82.15% <100.00%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64791a5...7b5b25b. Read the comment docs.

@barshaul barshaul marked this pull request as ready for review November 21, 2021 14:34
@chayim chayim added the bug Bug label Nov 22, 2021
@chayim
Copy link
Contributor

chayim commented Nov 22, 2021

@barshaul Thanks for contributing a great fix! I ran this overnight and there were zero failures after ~7500 runs. This feels fixed - but you may want to see @bmerry comment here

@chayim chayim added the 4.1.0 label Nov 22, 2021
@bmerry
Copy link
Contributor

bmerry commented Nov 22, 2021

but you may want to see @bmerry comment here

FYI my comment was based on the description of the fix - I'd missed that there was a PR implementing the description.

@barshaul
Copy link
Contributor Author

barshaul commented Nov 22, 2021

@bmerry @chayim
I was wrong, trying to call get_message() without subscribing first will end up with RuntimeError (see test_pubsub::test_get_message_without_subscribe).
So the second case I brought up in the descroption isn't relevant and I removed the check I added for cmd_exection_health_check in the get_message since it's unnecessary.
Before I realized that the second case could not happen, I implemented another solution for the race condition which can be found in PR #1737, both are working, but you can see if you prefer this one better.
1737:
Added an option to call pubsub's method get_message() without subscribing first.
If get_message is called and no channel/pattern is subscribed, the method will return None without trying to read from the connection.
When timeout is passed and no channels are yet subscribed, the get_message() function will wait for the first to arrive - either a subscription has been made or the time has expired.

@bmerry
Copy link
Contributor

bmerry commented Nov 22, 2021

Arguably there is a race condition where a thread might get_message immediately after self.connection is initialised in execute_command but before the command is actually sent. But such a program already has a race condition, since it can't guarantee that get_message won't run before self.connection is initialised and raise an error. So this seems safe.

It potentially makes the health check less useful though. If the pubsub is used for a bit, then all the subscriptions are removed, then it is used again 10 minutes later, there will be no health check at that point. Maybe that's a reasonable trade-off for correctness.

@barshaul
Copy link
Contributor Author

See #1737 for the closing PR/

@barshaul barshaul closed this Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants