-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Describe the bug
Occasional crash of member after a snapshot installation due to attempt to read a command for an already consumed message. The reproduction steps are highly artificial but this crash has been seen in the wild a couple of times and could happen if a follower member on a node with consumers that come and go runs slowly.
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> ** [{lists,zipwith,
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> [#Fun<rabbit_fifo.60.126061837>,[],
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> [{1,[7352901|4]},{2,[7352904|4]}],
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> fail],
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> [{file,"lists.erl"},{line,844}]},
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> {lists,zipwith,4,[{file,"lists.erl"},{line,845}]},
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> {rabbit_fifo,'-delivery_effect/3-anonymous-5-',4,
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> [{file,"rabbit_fifo.erl"},{line,2062}]},
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> {ra_server_proc,handle_effect,5,
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> [{file,"src/ra_server_proc.erl"},{line,1385}]},
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> {lists,foldl,3,[{file,"lists.erl"},{line,2146}]},
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> {ra_server_proc,handle_effects,5,
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> [{file,"src/ra_server_proc.erl"},{line,1301}]},
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> {lists,foldl_1,3,[{file,"lists.erl"},{line,2151}]},
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> {ra_server_proc,handle_effects,5,
2024-11-01 08:46:29.102566+00:00 [error] <0.23583.0> [{file,"src/ra_server_proc.erl"},{line,1301}]}]
Reproduction steps
This is easiest to re-create on 4.0.x but can happen on 3.13.x also
- create a quorum queue "q1" in a 3 node cluster with the leader on rabbit-1 with the
quorum_min_checkpoint_intervalapplication config set to 1. - stop the member on rabbit-3: e.g.
ra:stop_server(quorum_queues, {'%2F_q1', node()}). - publish 2 messages
- trigger a checkpoint for the leader member:
ra:cast_aux_command({'%2F_q1', 'rabbit-1@HOST'}, force_checkpoint). - publish 1 more message
- Attach then detach a consumer for the queue connected to rabbit-3 (no message should be delivered but they will show as unacked)
- purge the queue
- restart the member on rabbit-3
ra:restart_server(quorum_queues, {'%2F_q1', node()}). - Observer a member crash on rabbit-3
The member may recover after step 9 - this is also, in fact, a bug.
Expected behavior
No crash
Additional context
currently a queue that experiences this error can be fixed by removing the faulty member from the quorum queue cluster, wait a bit and then re-adding it back using rabbitmq-queues delete_member and rabbitmq-queues add_member