-
Notifications
You must be signed in to change notification settings - Fork 4k
rabbit_nodes: Add list functions to clarify which nodes we are interested in #7058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rabbit_nodes: Add list functions to clarify which nodes we are interested in #7058
Conversation
18156b3 to
7c19e23
Compare
7c19e23 to
5ca86d9
Compare
c87d10b to
ff6a4ac
Compare
|
I will wait for #6821 to be merged before in order to not interfere with the review and rebase of that branch. |
e2ab72b to
1c2c156
Compare
ee94e5e to
1d7db96
Compare
|
This patch is ready for review again. I hope it doesn't introduce any regression, it's difficult to test. If you think of a better naming, please share! |
77a0574 to
6a9e057
Compare
…sted in So far, we had the following functions to list nodes in a RabbitMQ cluster: * `rabbit_mnesia:cluster_nodes/1` to get members of the Mnesia cluster; the argument was used to select members (all members or only those running Mnesia and participating in the cluster) * `rabbit_nodes:all/0` to get all members of the Mnesia cluster * `rabbit_nodes:all_running/0` to get all members who currently run Mnesia Basically: * `rabbit_nodes:all/0` calls `rabbit_mnesia:cluster_nodes(all)` * `rabbit_nodes:all_running/0` calls `rabbit_mnesia:cluster_nodes(running)` We also have: * `rabbit_node_monitor:alive_nodes/1` which filters the given list of nodes to only select those currently running Mnesia * `rabbit_node_monitor:alive_rabbit_nodes/1` which filters the given list of nodes to only select those currently running RabbitMQ Most of the code uses `rabbit_mnesia:cluster_nodes/1` or the `rabbit_nodes:all*/0` functions. `rabbit_mnesia:cluster_nodes(running)` or `rabbit_nodes:all_running/0` is often used as a close approximation of "all cluster members running RabbitMQ". This list might be incorrect in times where a node is joining the clustered or is being worked on (i.e. Mnesia is running but not RabbitMQ). With Khepri, there won't be the same possible approximation because we will try to keep Khepri/Ra running even if RabbitMQ is stopped to expand/shrink the cluster. So in order to clarify what we want when we query a list of nodes, this patch introduces the following functions: * `rabbit_nodes:list_members/0` to get all cluster members, regardless of their state * `rabbit_nodes:list_reachable/0` to get all cluster members we can reach using Erlang distribution, regardless of the state of RabbitMQ * `rabbit_nodes:list_running/0` to get all cluster members who run RabbitMQ, regardless of the maintenance state * `rabbit_nodes:list_serving/0` to get all cluster members who run RabbitMQ and are accepting clients In addition to the list functions, there are the corresponding `rabbit_nodes:is_*(Node)` checks and `rabbit_nodes:filter_*(Nodes)` filtering functions. The code is modified to use these new functions. One possible significant change is that the new list functions will perform RPC calls to query the nodes' state, unlike `rabbit_mnesia:cluster_nodes(running)`.
6a9e057 to
d656371
Compare
|
It would greatly simplify backports if key functions of this API that can be aliases to the ones they replace in |
|
#7282 is an example of a small change that failed to backport because of |
|
Perhaps it should not be backported in the first place? It doesn't look like a bug fix to me. |
…e/0` ... instead of using an internal implementation. References #7058.
…e/0` ... instead of using an internal implementation. References #7058.
... instead of using an internal implementation. The plugins are `rabbitmq_peer_discovery_common` and `rabbitmq_sharding`. References #7058.
... instead of using an internal implementation. The plugins are `rabbitmq_peer_discovery_common` and `rabbitmq_sharding`. References #7058.
... instead of using an internal implementation. The plugins are `rabbitmq_peer_discovery_common` and `rabbitmq_sharding`. References #7058.
... instead of using an internal implementation. The plugins are `rabbitmq_peer_discovery_common` and `rabbitmq_sharding`. References #7058.
…ic code [Why] The partition detection code defines a partitioned node as an Erlang node running RabbitMQ but which is not among the Mnesia running nodes. Since #7058, `rabbit_node_monitor` uses the list functions exported by `rabbit_nodes` for everything, except the partition detection code which is Mnesia-specific and relies on `rabbit_mnesia:cluster_nodes/1`. Unfortunately, we saw regressions in the Jepsen testsuite during the 3.12.0 release cycle only because that testsuite is not executed on `main`. It happens that the partition detection code is using `rabbit_nodes` list functions in two places where it should have continued to use `rabbit_mnesia`. [How] The fix bug fix simply consists of reverting the two calls to `rabbit_nodes` back to calls to `rabbit_mnesia` as it used to do. This seems to improve the situation a lot in the manual testing. This code will go away with our use of Mnesia in the future, so it's not a problem to call `rabbit_mnesia` here.
…ic code [Why] The partition detection code defines a partitioned node as an Erlang node running RabbitMQ but which is not among the Mnesia running nodes. Since #7058, `rabbit_node_monitor` uses the list functions exported by `rabbit_nodes` for everything, except the partition detection code which is Mnesia-specific and relies on `rabbit_mnesia:cluster_nodes/1`. Unfortunately, we saw regressions in the Jepsen testsuite during the 3.12.0 release cycle only because that testsuite is not executed on `main`. It happens that the partition detection code is using `rabbit_nodes` list functions in two places where it should have continued to use `rabbit_mnesia`. [How] The fix bug fix simply consists of reverting the two calls to `rabbit_nodes` back to calls to `rabbit_mnesia` as it used to do. This seems to improve the situation a lot in the manual testing. This code will go away with our use of Mnesia in the future, so it's not a problem to call `rabbit_mnesia` here.
…ic code [Why] The partition detection code defines a partitioned node as an Erlang node running RabbitMQ but which is not among the Mnesia running nodes. Since #7058, `rabbit_node_monitor` uses the list functions exported by `rabbit_nodes` for everything, except the partition detection code which is Mnesia-specific and relies on `rabbit_mnesia:cluster_nodes/1`. Unfortunately, we saw regressions in the Jepsen testsuite during the 3.12.0 release cycle only because that testsuite is not executed on `main`. It happens that the partition detection code is using `rabbit_nodes` list functions in two places where it should have continued to use `rabbit_mnesia`. [How] The fix bug fix simply consists of reverting the two calls to `rabbit_nodes` back to calls to `rabbit_mnesia` as it used to do. This seems to improve the situation a lot in the manual testing. This code will go away with our use of Mnesia in the future, so it's not a problem to call `rabbit_mnesia` here. (cherry picked from commit 22516aa)
So far, we had the following functions to list nodes in a RabbitMQ cluster:
rabbit_mnesia:cluster_nodes/1to get members of the Mnesia cluster; the argument was used to select members (all members or only those running Mnesia and participating in the cluster)rabbit_nodes:all/0to get all members of the Mnesia clusterrabbit_nodes:all_running/0to get all members who currently run MnesiaBasically:
rabbit_nodes:all/0callsrabbit_mnesia:cluster_nodes(all)rabbit_nodes:all_running/0callsrabbit_mnesia:cluster_nodes(running)We also have:
rabbit_node_monitor:alive_nodes/1which filters the given list of nodes to only select those currently running Mnesiarabbit_node_monitor:alive_rabbit_nodes/1which filters the given list of nodes to only select those currently running RabbitMQMost of the code uses
rabbit_mnesia:cluster_nodes/1or therabbit_nodes:all*/0functions.rabbit_mnesia:cluster_nodes(running)orrabbit_nodes:all_running/0is often used as a close approximation of "all cluster members running RabbitMQ". This list might be incorrect in times where a node is joining the clustered or is being worked on (i.e. Mnesia is running but not RabbitMQ).With Khepri, there won't be the same possible approximation because we will try to keep Khepri/Ra running even if RabbitMQ is stopped to expand/shrink the cluster.
So in order to clarify what we want when we query a list of nodes, this patch introduces the following functions:
rabbit_nodes:list_members/0to get all cluster members, regardless of their staterabbit_nodes:list_reachable/0to get all cluster members we can reach using Erlang distribution, regardless of the state of RabbitMQrabbit_nodes:list_running/0to get all cluster members who run RabbitMQ, regardless of the maintenance staterabbit_nodes:list_serving/0to get all cluster members who run RabbitMQ and are accepting clientsIn addition to the list functions, there are the corresponding
rabbit_nodes:is_*(Node)checks andrabbit_nodes:filter_*(Nodes)filtering functions.The code is modified to use these new functions. One possible significant change is that the new list functions will perform RPC calls to query the nodes' state, unlike
rabbit_mnesia:cluster_nodes(running).