m2k_cluster_sync: Handle the case where a clustered node lost its state #16

dumbbell · 2024-08-26T17:38:42Z

Why

If a node loses its state, i.e. all Khepri and Mnesia data on disk, calling mnesia_to_khepri:sync_cluster_membership/1 wouldn't repair the cluster before this patch.

At best, it would leave the nodes untouched, thus with all but one nodes that think the lost node is clustered, and the lost node that believes it is unclustered.

How

We can't count of the state of Mnesia because it might have been lost too.

Instead we look at connected nodes: they may have reconnected the that lost node because Khepri on these will want to send Ra messages.

With this connected nodes list, we filter these that run the store only. This eliminates those that are connected for other reason, like a remote shell.

This new list is used to fing the largest Khepri cluster as before. However, in this process, we discard nodes that think they are standalone but are part of a cluster according to some other nodes. In the end, the returned largest Khepri cluster is also cleaned of these lost nodes.

The rest of the logic is unmodified. When Khepri is asked to re-add the lost node to the Khepri cluster, it will do the right thing to repair the cluster.

codecov · 2024-08-26T17:42:17Z

Codecov Report

Attention: Patch coverage is 93.75000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 75.50%. Comparing base (f17f13b) to head (52d3e56).
Report is 2 commits behind head on main.

Files	Patch %	Lines
src/m2k_cluster_sync.erl	93.75%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #16      +/-   ##
==========================================
+ Coverage   74.73%   75.50%   +0.77%     
==========================================
  Files          13       13              
  Lines         661      690      +29     
==========================================
+ Hits          494      521      +27     
- Misses        167      169       +2

Flag	Coverage Δ
erlang-25	`75.50% <93.75%> (+1.22%)`	⬆️
erlang-26	`75.50% <93.75%> (+0.77%)`	⬆️
os-ubuntu-latest	`75.50% <93.75%> (+0.77%)`	⬆️
os-windows-latest	`75.50% <93.75%> (+0.77%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

[Why] If a node loses its state, i.e. all Khepri and Mnesia data on disk, calling `mnesia_to_khepri:sync_cluster_membership/1` wouldn't repair the cluster before this patch. At best, it would leave the nodes untouched, thus with all but one nodes that think the lost node is clustered, and the lost node that believes it is unclustered. [How] We can't count of the state of Mnesia because it might have been lost too. Instead we look at connected nodes: they may have reconnected the that lost node because Khepri on these will want to send Ra messages. With this connected nodes list, we filter these that run the store only. This eliminates those that are connected for other reason, like a remote shell. This new list is used to fing the largest Khepri cluster as before. However, in this process, we discard nodes that think they are standalone but are part of a cluster according to some other nodes. In the end, the returned largest Khepri cluster is also cleaned of these lost nodes. The rest of the logic is unmodified. When Khepri is asked to re-add the lost node to the Khepri cluster, it will do the right thing to repair the cluster.

dumbbell added the bug Something isn't working label Aug 26, 2024

dumbbell added this to the v0.6.0 milestone Aug 26, 2024

dumbbell requested a review from the-mikedavis August 26, 2024 17:38

dumbbell self-assigned this Aug 26, 2024

the-mikedavis approved these changes Aug 26, 2024

View reviewed changes

dumbbell force-pushed the handle-lost-node branch from 5984207 to 52d3e56 Compare August 27, 2024 08:44

dumbbell marked this pull request as ready for review August 27, 2024 08:58

dumbbell merged commit 13a8a18 into main Aug 27, 2024

dumbbell deleted the handle-lost-node branch August 27, 2024 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

m2k_cluster_sync: Handle the case where a clustered node lost its state #16

m2k_cluster_sync: Handle the case where a clustered node lost its state #16

Uh oh!

dumbbell commented Aug 26, 2024

Uh oh!

codecov bot commented Aug 26, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

m2k_cluster_sync: Handle the case where a clustered node lost its state #16

m2k_cluster_sync: Handle the case where a clustered node lost its state #16

Uh oh!

Conversation

dumbbell commented Aug 26, 2024

Why

How

Uh oh!

codecov bot commented Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Aug 26, 2024 •

edited

Loading