Skip to content

Mirrored queue crash with out of sync ACKs #749

@dcorbacho

Description

@dcorbacho

Using the patch for #714, in a 3-node cluster configured to test #545, the GM might eventually crash when processing an activity message:

** {{case_clause,
        {{{value,
              {33059,
               {publish,<9042.1151.1>,flow,
                   {message_properties,undefined,false,2048},
                   {basic_message,
                       {resource,<<"/">>,exchange,<<"testExchange">>},
                       [<<>>],
                       {content,60,
....
    [{gm,find_common,3,[{file,"src/gm.erl"},{line,1369}]},
     {gm,'-handle_msg/2-fun-2-',7,[{file,"src/gm.erl"},{line,881}]},
     {gm,with_member_acc,3,[{file,"src/gm.erl"},{line,1386}]},
     {lists,foldl,3,[{file,"lists.erl"},{line,1262}]},
     {gm,handle_msg,2,[{file,"src/gm.erl"},{line,871}]},
     {gm,handle_cast,2,[{file,"src/gm.erl"},{line,661}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1049}]},
     {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,250}]}]}

which I believe leads in the other nodes to:

=ERROR REPORT==== 12-Apr-2016::16:18:05 ===
** Generic server <0.19461.2> terminating
** Last message in was {'$gen_cast',join}
** When Server state == {state,{9,<0.19461.2>},
                               {{9,<0.19461.2>},undefined},
                               {{9,<0.19461.2>},undefined},
                               {resource,<<"/">>,queue,<<"myQuueue_a_2">>},
                               rabbit_mirror_queue_slave,undefined,-1,
                               undefined,
                               [<0.19460.2>],
                               {[],[]},
                               [],0,undefined,
                               #Fun<rabbit_misc.execute_mnesia_transaction.1>,
                               false}
** Reason for termination == 
** {{bad_return_value,
        {bad_flying_ets_update,1,2,
            {<<212,124,127,183,143,75,237,208,132,9,251,34,112,92,244,166>>,
             <<202,95,0,178,134,57,152,103,126,177,128,73,15,248,54,106>>}}},
    {gen_server2,call,
        [<5629.28766.2>,{add_on_right,{9,<0.19461.2>}},infinity]}}

and

=ERROR REPORT==== 12-Apr-2016::16:16:38 ===
** Generic server <0.27327.2> terminating
** Last message in was go
** When Server state == {not_started,
                            {amqqueue,
                                {resource,<<"/">>,queue,<<"myQuueue_a_2">>},
                                true,false,none,[],<0.26808.2>,[],[],[],
                                [{vhost,<<"/">>},
                                 {name,<<"all">>},
                                 {pattern,<<>>},
                                 {'apply-to',<<"all">>},
                                 {definition,
                                     [{<<"ha-mode">>,<<"all">>},
                                      {<<"ha-sync-mode">>,<<"automatic">>}]},
                                 {priority,0}],
                                [{<32227.15062.2>,<32227.14697.2>},
                                 {<0.26809.2>,<0.26808.2>}],
                                [],live}}
** Reason for termination == 
** {duplicate_live_master,'rabbit@t-srv-rabbit04'}

This is not suspected to have been introduced by #714, but a consequence of the deadlock being resolved. Thus, the system continues running on partial partitions with pause_minority, and eventually reaches an inconsistent state.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions