Skip to content

Conversation

dumbbell
Copy link
Collaborator

Why

The feature flag controller that is responsible for enabling a feature flag may be on a node that doesn't know this feature flag. This is supported by there is a bug when it queries the callback definition for that feature flag: it uses its own registry which does not have anything about this feature flag.

This leads to a crash because the run_callback/5 funtion tries to use the undefined atom returned by the registry as a map:

crasher:
  initial call: rabbit_ff_controller:init/1
  pid: <0.374.0>
  registered_name: rabbit_ff_controller
  exception error: bad map: undefined
    in function  rabbit_ff_controller:run_callback/5
    in call from rabbit_ff_controller:do_enable/3 (rabbit_ff_controller.erl, line 1244)
    in call from rabbit_ff_controller:update_feature_state_and_enable/2 (rabbit_ff_controller.erl, line 1180)
    in call from rabbit_ff_controller:enable_with_registry_locked/2 (rabbit_ff_controller.erl, line 1050)
    in call from rabbit_ff_controller:enable_many_locked/2 (rabbit_ff_controller.erl, line 991)
    in call from rabbit_ff_controller:enable_many/2 (rabbit_ff_controller.erl, line 979)
    in call from rabbit_ff_controller:updating_feature_flag_states/3 (rabbit_ff_controller.erl, line 307)
    in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 3735)

How

The callback definition is now queried from the first node in the list given as argument. For the common use case where all nodes know about a feature flag, the first node is the local one, so there should be no latency caused by the RPC.

See #12963.

@dumbbell dumbbell added the bug label Dec 17, 2024
@dumbbell dumbbell self-assigned this Dec 17, 2024
@dumbbell dumbbell force-pushed the take-feature-flag-callback-definition-from-correct-node branch 3 times, most recently from ceb0e89 to 3eba0f5 Compare December 18, 2024 14:13
[Why]
The feature flag controller that is responsible for enabling a feature
flag may be on a node that doesn't know this feature flag. This is
supported by there is a bug when it queries the callback definition for
that feature flag: it uses its own registry which does not have anything
about this feature flag.

This leads to a crash because the `run_callback/5` funtion tries to use
the `undefined` atom returned by the registry as a map:

    crasher:
      initial call: rabbit_ff_controller:init/1
      pid: <0.374.0>
      registered_name: rabbit_ff_controller
      exception error: bad map: undefined
        in function  rabbit_ff_controller:run_callback/5
        in call from rabbit_ff_controller:do_enable/3 (rabbit_ff_controller.erl, line 1244)
        in call from rabbit_ff_controller:update_feature_state_and_enable/2 (rabbit_ff_controller.erl, line 1180)
        in call from rabbit_ff_controller:enable_with_registry_locked/2 (rabbit_ff_controller.erl, line 1050)
        in call from rabbit_ff_controller:enable_many_locked/2 (rabbit_ff_controller.erl, line 991)
        in call from rabbit_ff_controller:enable_many/2 (rabbit_ff_controller.erl, line 979)
        in call from rabbit_ff_controller:updating_feature_flag_states/3 (rabbit_ff_controller.erl, line 307)
        in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 3735)

[How]
The callback definition is now queried from the first node in the list
given as argument. For the common use case where all nodes know about a
feature flag, the first node is the local one, so there should be no
latency caused by the RPC.

See #12963.
@dumbbell dumbbell force-pushed the take-feature-flag-callback-definition-from-correct-node branch from 3eba0f5 to 3325def Compare December 19, 2024 12:45
@dumbbell dumbbell marked this pull request as ready for review December 19, 2024 15:27
@dumbbell dumbbell merged commit 3bf8c81 into main Dec 19, 2024
273 checks passed
@dumbbell dumbbell deleted the take-feature-flag-callback-definition-from-correct-node branch December 19, 2024 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant