Skip to content

Conversation

dumbbell
Copy link
Collaborator

Why

If a node loses its state, i.e. all Khepri and Mnesia data on disk, calling mnesia_to_khepri:sync_cluster_membership/1 wouldn't repair the cluster before this patch.

At best, it would leave the nodes untouched, thus with all but one nodes that think the lost node is clustered, and the lost node that believes it is unclustered.

How

We can't count of the state of Mnesia because it might have been lost too.

Instead we look at connected nodes: they may have reconnected the that lost node because Khepri on these will want to send Ra messages.

With this connected nodes list, we filter these that run the store only. This eliminates those that are connected for other reason, like a remote shell.

This new list is used to fing the largest Khepri cluster as before. However, in this process, we discard nodes that think they are standalone but are part of a cluster according to some other nodes. In the end, the returned largest Khepri cluster is also cleaned of these lost nodes.

The rest of the logic is unmodified. When Khepri is asked to re-add the lost node to the Khepri cluster, it will do the right thing to repair the cluster.

@dumbbell dumbbell added the bug Something isn't working label Aug 26, 2024
@dumbbell dumbbell added this to the v0.6.0 milestone Aug 26, 2024
@dumbbell dumbbell requested a review from the-mikedavis August 26, 2024 17:38
@dumbbell dumbbell self-assigned this Aug 26, 2024
Copy link

codecov bot commented Aug 26, 2024

Codecov Report

Attention: Patch coverage is 93.75000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 75.50%. Comparing base (f17f13b) to head (52d3e56).
Report is 2 commits behind head on main.

Files Patch % Lines
src/m2k_cluster_sync.erl 93.75% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #16      +/-   ##
==========================================
+ Coverage   74.73%   75.50%   +0.77%     
==========================================
  Files          13       13              
  Lines         661      690      +29     
==========================================
+ Hits          494      521      +27     
- Misses        167      169       +2     
Flag Coverage Δ
erlang-25 75.50% <93.75%> (+1.22%) ⬆️
erlang-26 75.50% <93.75%> (+0.77%) ⬆️
os-ubuntu-latest 75.50% <93.75%> (+0.77%) ⬆️
os-windows-latest 75.50% <93.75%> (+0.77%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

[Why]
If a node loses its state, i.e. all Khepri and Mnesia data on disk,
calling `mnesia_to_khepri:sync_cluster_membership/1` wouldn't repair the
cluster before this patch.

At best, it would leave the nodes untouched, thus with all but one nodes
that think the lost node is clustered, and the lost node that believes
it is unclustered.

[How]
We can't count of the state of Mnesia because it might have been lost
too.

Instead we look at connected nodes: they may have reconnected the that
lost node because Khepri on these will want to send Ra messages.

With this connected nodes list, we filter these that run the store only.
This eliminates those that are connected for other reason, like a remote
shell.

This new list is used to fing the largest Khepri cluster as before.
However, in this process, we discard nodes that think they are
standalone but are part of a cluster according to some other nodes. In
the end, the returned largest Khepri cluster is also cleaned of these
lost nodes.

The rest of the logic is unmodified. When Khepri is asked to re-add the
lost node to the Khepri cluster, it will do the right thing to repair
the cluster.
@dumbbell dumbbell marked this pull request as ready for review August 27, 2024 08:58
@dumbbell dumbbell merged commit 13a8a18 into main Aug 27, 2024
@dumbbell dumbbell deleted the handle-lost-node branch August 27, 2024 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants