-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Labels
enhancementNew feature or requestNew feature or request
Description
Issue
On takeover flow the replica sends a DFLY TAKEOVER to master. Then on the master side, a check point message is placed to all active connections (internally we just iterate over all the connections and place a Checkpoint msg in each dispatch queue). Then we wait for all the connections to process the Checkpoint msg or time out. On the latter case, the takeover fails.
This works well most of the time but the main issue here is is a pipeline is blocked then the checkpoint message won't be processed and dragonfly will fail to takeover.
Reproduce
Use test_send_timeout and remove the timeout (so the connection remains idle but not killed). Then, initiate a take over. It shall fail because the async fiber is blocked.
Remark
- At any point, if
async fiberblocks for an extended period of time, then dispatch_q_ processing stops as well so any flow that relies and waits on connection dispatch_q_ processing will timeout.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request