Keep exclusive queues with Khepri + network partition #14573
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why
With Mnesia, when the network partition strategy is set to
pause_minority
, nodes on the "minority side" are stopped.Thus, the exclusive queues that were hosted by nodes on that minority side are lost:
This was ok with Mnesia and how this network partition handling strategy is implemented. However, it does not work with Khepri because the nodes on the "minority side" continue to run and serve clients. Therefore the cluster ends up in a weird situation:
How
With Khepri, we stop to delete transient queue records in general, just because there is a node going down. Thanks to this, an exclusive queue and its consumer are not affected by a network partition: they continue to work.
However, if a node is really lost, we need to clean up dead queue records. This was already done for durable queues with both Mnesia and Khepri. But with Khepri, transient queue records persist in the store like durable queue records (unlike with Mnesia).
That's why this commit changes the clean-up function,
rabbit_amqqueue:forget_all_durable/1
intorabbit_amqqueue:forget_all/1
which deletes all queue records of queues that were hosted on the given node, regardless if they are transient or durable.Fixes #12949, #12597.