Skip to content

QQ: crash after snapshot installation #12635

@kjnilsson

Description

@kjnilsson

Describe the bug

Occasional crash of member after a snapshot installation due to attempt to read a command for an already consumed message. The reproduction steps are highly artificial but this crash has been seen in the wild a couple of times and could happen if a follower member on a node with consumers that come and go runs slowly.

2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> **  [{lists,zipwith,
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>             [#Fun<rabbit_fifo.60.126061837>,[],
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>              [{1,[7352901|4]},{2,[7352904|4]}],
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>              fail],
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>             [{file,"lists.erl"},{line,844}]},
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>      {lists,zipwith,4,[{file,"lists.erl"},{line,845}]},
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>      {rabbit_fifo,'-delivery_effect/3-anonymous-5-',4,
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>                   [{file,"rabbit_fifo.erl"},{line,2062}]},
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>      {ra_server_proc,handle_effect,5,
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>                      [{file,"src/ra_server_proc.erl"},{line,1385}]},
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>      {lists,foldl,3,[{file,"lists.erl"},{line,2146}]},
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>      {ra_server_proc,handle_effects,5,
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>                      [{file,"src/ra_server_proc.erl"},{line,1301}]},
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>      {lists,foldl_1,3,[{file,"lists.erl"},{line,2151}]},
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>      {ra_server_proc,handle_effects,5,
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0>                      [{file,"src/ra_server_proc.erl"},{line,1301}]}]

Reproduction steps

This is easiest to re-create on 4.0.x but can happen on 3.13.x also

  1. create a quorum queue "q1" in a 3 node cluster with the leader on rabbit-1 with the quorum_min_checkpoint_interval application config set to 1.
  2. stop the member on rabbit-3: e.g. ra:stop_server(quorum_queues, {'%2F_q1', node()}).
  3. publish 2 messages
  4. trigger a checkpoint for the leader member: ra:cast_aux_command({'%2F_q1', 'rabbit-1@HOST'}, force_checkpoint).
  5. publish 1 more message
  6. Attach then detach a consumer for the queue connected to rabbit-3 (no message should be delivered but they will show as unacked)
  7. purge the queue
  8. restart the member on rabbit-3 ra:restart_server(quorum_queues, {'%2F_q1', node()}).
  9. Observer a member crash on rabbit-3

The member may recover after step 9 - this is also, in fact, a bug.

Expected behavior

No crash

Additional context

currently a queue that experiences this error can be fixed by removing the faulty member from the quorum queue cluster, wait a bit and then re-adding it back using rabbitmq-queues delete_member and rabbitmq-queues add_member

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions