m2k_cluster_sync: Handle the case where a clustered node lost its state #16
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why
If a node loses its state, i.e. all Khepri and Mnesia data on disk, calling
mnesia_to_khepri:sync_cluster_membership/1
wouldn't repair the cluster before this patch.At best, it would leave the nodes untouched, thus with all but one nodes that think the lost node is clustered, and the lost node that believes it is unclustered.
How
We can't count of the state of Mnesia because it might have been lost too.
Instead we look at connected nodes: they may have reconnected the that lost node because Khepri on these will want to send Ra messages.
With this connected nodes list, we filter these that run the store only. This eliminates those that are connected for other reason, like a remote shell.
This new list is used to fing the largest Khepri cluster as before. However, in this process, we discard nodes that think they are standalone but are part of a cluster according to some other nodes. In the end, the returned largest Khepri cluster is also cleaned of these lost nodes.
The rest of the logic is unmodified. When Khepri is asked to re-add the lost node to the Khepri cluster, it will do the right thing to repair the cluster.