Skip to content

Conversation

@dumbbell
Copy link
Collaborator

@dumbbell dumbbell commented Jan 26, 2023

So far, we had the following functions to list nodes in a RabbitMQ cluster:

  • rabbit_mnesia:cluster_nodes/1 to get members of the Mnesia cluster; the argument was used to select members (all members or only those running Mnesia and participating in the cluster)
  • rabbit_nodes:all/0 to get all members of the Mnesia cluster
  • rabbit_nodes:all_running/0 to get all members who currently run Mnesia

Basically:

  • rabbit_nodes:all/0 calls rabbit_mnesia:cluster_nodes(all)
  • rabbit_nodes:all_running/0 calls rabbit_mnesia:cluster_nodes(running)

We also have:

  • rabbit_node_monitor:alive_nodes/1 which filters the given list of nodes to only select those currently running Mnesia
  • rabbit_node_monitor:alive_rabbit_nodes/1 which filters the given list of nodes to only select those currently running RabbitMQ

Most of the code uses rabbit_mnesia:cluster_nodes/1 or the rabbit_nodes:all*/0 functions. rabbit_mnesia:cluster_nodes(running) or rabbit_nodes:all_running/0 is often used as a close approximation of "all cluster members running RabbitMQ". This list might be incorrect in times where a node is joining the clustered or is being worked on (i.e. Mnesia is running but not RabbitMQ).

With Khepri, there won't be the same possible approximation because we will try to keep Khepri/Ra running even if RabbitMQ is stopped to expand/shrink the cluster.

So in order to clarify what we want when we query a list of nodes, this patch introduces the following functions:

  • rabbit_nodes:list_members/0 to get all cluster members, regardless of their state
  • rabbit_nodes:list_reachable/0 to get all cluster members we can reach using Erlang distribution, regardless of the state of RabbitMQ
  • rabbit_nodes:list_running/0 to get all cluster members who run RabbitMQ, regardless of the maintenance state
  • rabbit_nodes:list_serving/0 to get all cluster members who run RabbitMQ and are accepting clients

In addition to the list functions, there are the corresponding rabbit_nodes:is_*(Node) checks and rabbit_nodes:filter_*(Nodes) filtering functions.

The code is modified to use these new functions. One possible significant change is that the new list functions will perform RPC calls to query the nodes' state, unlike rabbit_mnesia:cluster_nodes(running).

@dumbbell dumbbell added this to the 3.12.0 milestone Jan 26, 2023
@dumbbell dumbbell self-assigned this Jan 26, 2023
@dumbbell dumbbell force-pushed the add-node-lists-functions-to-clarify-intent branch 2 times, most recently from 18156b3 to 7c19e23 Compare January 27, 2023 16:12
@dumbbell dumbbell marked this pull request as ready for review January 30, 2023 09:46
@dumbbell dumbbell force-pushed the add-node-lists-functions-to-clarify-intent branch from 7c19e23 to 5ca86d9 Compare January 30, 2023 09:48
michaelklishin
michaelklishin previously approved these changes Feb 2, 2023
@dumbbell dumbbell force-pushed the add-node-lists-functions-to-clarify-intent branch 2 times, most recently from c87d10b to ff6a4ac Compare February 2, 2023 10:19
@dumbbell
Copy link
Collaborator Author

dumbbell commented Feb 2, 2023

I will wait for #6821 to be merged before in order to not interfere with the review and rebase of that branch.

@dumbbell dumbbell force-pushed the add-node-lists-functions-to-clarify-intent branch 2 times, most recently from e2ab72b to 1c2c156 Compare February 6, 2023 11:07
@dumbbell dumbbell marked this pull request as draft February 7, 2023 13:27
@dumbbell dumbbell force-pushed the add-node-lists-functions-to-clarify-intent branch 4 times, most recently from ee94e5e to 1d7db96 Compare February 8, 2023 10:59
@dumbbell dumbbell marked this pull request as ready for review February 8, 2023 13:42
@dumbbell
Copy link
Collaborator Author

dumbbell commented Feb 8, 2023

This patch is ready for review again.

I hope it doesn't introduce any regression, it's difficult to test.

If you think of a better naming, please share!

@dumbbell dumbbell force-pushed the add-node-lists-functions-to-clarify-intent branch 5 times, most recently from 77a0574 to 6a9e057 Compare February 13, 2023 10:41
…sted in

So far, we had the following functions to list nodes in a RabbitMQ
cluster:
* `rabbit_mnesia:cluster_nodes/1` to get members of the Mnesia cluster;
  the argument was used to select members (all members or only those
  running Mnesia and participating in the cluster)
* `rabbit_nodes:all/0` to get all members of the Mnesia cluster
* `rabbit_nodes:all_running/0` to get all members who currently run
  Mnesia

Basically:
* `rabbit_nodes:all/0` calls `rabbit_mnesia:cluster_nodes(all)`
* `rabbit_nodes:all_running/0` calls `rabbit_mnesia:cluster_nodes(running)`

We also have:
* `rabbit_node_monitor:alive_nodes/1` which filters the given list of
  nodes to only select those currently running Mnesia
* `rabbit_node_monitor:alive_rabbit_nodes/1` which filters the given
  list of nodes to only select those currently running RabbitMQ

Most of the code uses `rabbit_mnesia:cluster_nodes/1` or the
`rabbit_nodes:all*/0` functions. `rabbit_mnesia:cluster_nodes(running)`
or `rabbit_nodes:all_running/0` is often used as a close approximation
of "all cluster members running RabbitMQ". This list might be incorrect
in times where a node is joining the clustered or is being worked on
(i.e. Mnesia is running but not RabbitMQ).

With Khepri, there won't be the same possible approximation because we
will try to keep Khepri/Ra running even if RabbitMQ is stopped to
expand/shrink the cluster.

So in order to clarify what we want when we query a list of nodes, this
patch introduces the following functions:
* `rabbit_nodes:list_members/0` to get all cluster members, regardless
  of their state
* `rabbit_nodes:list_reachable/0` to get all cluster members we can
  reach using Erlang distribution, regardless of the state of RabbitMQ
* `rabbit_nodes:list_running/0` to get all cluster members who run
  RabbitMQ, regardless of the maintenance state
* `rabbit_nodes:list_serving/0` to get all cluster members who run
  RabbitMQ and are accepting clients

In addition to the list functions, there are the corresponding
`rabbit_nodes:is_*(Node)` checks and `rabbit_nodes:filter_*(Nodes)`
filtering functions.

The code is modified to use these new functions. One possible
significant change is that the new list functions will perform RPC calls
to query the nodes' state, unlike `rabbit_mnesia:cluster_nodes(running)`.
@dumbbell dumbbell force-pushed the add-node-lists-functions-to-clarify-intent branch from 6a9e057 to d656371 Compare February 13, 2023 11:59
@michaelklishin
Copy link
Collaborator

It would greatly simplify backports if key functions of this API that can be aliases to the ones they replace in main, were added to v3.11.x and v3.10.x.

@michaelklishin michaelklishin merged commit d0dc951 into main Feb 14, 2023
@michaelklishin michaelklishin deleted the add-node-lists-functions-to-clarify-intent branch February 14, 2023 02:06
@michaelklishin
Copy link
Collaborator

#7282 is an example of a small change that failed to backport because of rabbit_db_*

@dumbbell
Copy link
Collaborator Author

Perhaps it should not be backported in the first place? It doesn't look like a bug fix to me.

dumbbell added a commit that referenced this pull request Feb 21, 2023
…e/0`

... instead of using an internal implementation.

References #7058.
dumbbell added a commit that referenced this pull request Feb 21, 2023
…e/0`

... instead of using an internal implementation.

References #7058.
dumbbell added a commit that referenced this pull request Feb 21, 2023
... instead of using an internal implementation.

The plugins are `rabbitmq_peer_discovery_common` and
`rabbitmq_sharding`.

References #7058.
dumbbell added a commit that referenced this pull request Feb 21, 2023
... instead of using an internal implementation.

The plugins are `rabbitmq_peer_discovery_common` and
`rabbitmq_sharding`.

References #7058.
dumbbell added a commit that referenced this pull request Feb 21, 2023
... instead of using an internal implementation.

The plugins are `rabbitmq_peer_discovery_common` and
`rabbitmq_sharding`.

References #7058.
dumbbell added a commit that referenced this pull request Feb 22, 2023
... instead of using an internal implementation.

The plugins are `rabbitmq_peer_discovery_common` and
`rabbitmq_sharding`.

References #7058.
dumbbell added a commit that referenced this pull request Mar 21, 2023
…ic code

[Why]
The partition detection code defines a partitioned node as an Erlang
node running RabbitMQ but which is not among the Mnesia running nodes.

Since  #7058, `rabbit_node_monitor` uses the list functions exported by
`rabbit_nodes` for everything, except the partition detection code which
is Mnesia-specific and relies on `rabbit_mnesia:cluster_nodes/1`.

Unfortunately, we saw regressions in the Jepsen testsuite during the
3.12.0 release cycle only because that testsuite is not executed on
`main`.

It happens that the partition detection code is using `rabbit_nodes`
list functions in two places where it should have continued to use
`rabbit_mnesia`.

[How]
The fix bug fix simply consists of reverting the two calls to
`rabbit_nodes` back to calls to `rabbit_mnesia` as it used to do. This
seems to improve the situation a lot in the manual testing.

This code will go away with our use of Mnesia in the future, so it's not
a problem to call `rabbit_mnesia` here.
dumbbell added a commit that referenced this pull request Mar 21, 2023
…ic code

[Why]
The partition detection code defines a partitioned node as an Erlang
node running RabbitMQ but which is not among the Mnesia running nodes.

Since  #7058, `rabbit_node_monitor` uses the list functions exported by
`rabbit_nodes` for everything, except the partition detection code which
is Mnesia-specific and relies on `rabbit_mnesia:cluster_nodes/1`.

Unfortunately, we saw regressions in the Jepsen testsuite during the
3.12.0 release cycle only because that testsuite is not executed on
`main`.

It happens that the partition detection code is using `rabbit_nodes`
list functions in two places where it should have continued to use
`rabbit_mnesia`.

[How]
The fix bug fix simply consists of reverting the two calls to
`rabbit_nodes` back to calls to `rabbit_mnesia` as it used to do. This
seems to improve the situation a lot in the manual testing.

This code will go away with our use of Mnesia in the future, so it's not
a problem to call `rabbit_mnesia` here.
mergify bot pushed a commit that referenced this pull request Mar 21, 2023
…ic code

[Why]
The partition detection code defines a partitioned node as an Erlang
node running RabbitMQ but which is not among the Mnesia running nodes.

Since  #7058, `rabbit_node_monitor` uses the list functions exported by
`rabbit_nodes` for everything, except the partition detection code which
is Mnesia-specific and relies on `rabbit_mnesia:cluster_nodes/1`.

Unfortunately, we saw regressions in the Jepsen testsuite during the
3.12.0 release cycle only because that testsuite is not executed on
`main`.

It happens that the partition detection code is using `rabbit_nodes`
list functions in two places where it should have continued to use
`rabbit_mnesia`.

[How]
The fix bug fix simply consists of reverting the two calls to
`rabbit_nodes` back to calls to `rabbit_mnesia` as it used to do. This
seems to improve the situation a lot in the manual testing.

This code will go away with our use of Mnesia in the future, so it's not
a problem to call `rabbit_mnesia` here.

(cherry picked from commit 22516aa)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants