chore: force shutdown connections on takeover #6142

kostasrim · 2025-12-01T13:05:05Z

TakeOver algorithm as is

Imagine the simple setup: node 1 master, node 2 replica and node 2 attempts takeover on node 1.

For node 1:

Steps that affect TakeOver result/state:

Switch state to TAKEN_OVER -> connections will get their commands rejected
Wait for all connections to finish dispatches -> send Checkpoint message on each conn and wait for it to be processed

New step here from this PR. Optionally force shutdown all open connections. This does not break data parity. The connection is stuck on send, will break with an error but the transaction data will get replicated anyway (we already wrote to the journal the change).

Wait for the replica to catch up (reach the same LSN as master) -> data parity
Reply with OK. At this point the TAKEOVER is considered complete, node 2 is master.

Optional steps that improve other flows:

Optionally: if there are more nodes registered as replicas, wait for those to catch up. This does not affect the takeover but allows waiting for all of the replicas to fully sync. (Also needed for partial sync data integrity)
Optionally: Take a snapshot
Optionally: for cluster mode do not shutdown such that the node can redirect requests

Notes

Dragonfly is still accepting new connections during the takeover. However, once dragonfly switches to TAKEN_OVER state, connections won’t be able to execute commands and they will get “busy loading” error
Checkpoints are needed for data integrity. Once all the connections process the checkpoints it means that any other command in their dispatch queue will fail with busy loading. Hence, after step(2) there won’t be any change in the state (storage) and after step(3) the taking over node is on data parity.

This PR:

A step between 2 and 3 to fix non responding connections and force the takeover by forcefully shutting them down.

Follows up #6135. Add a flag experimental_force_takeover. If takeover times out and this flag is set dragonfly will force shutdown open connections such that it doesn't fail the takeover.

Resolves: #6114

Signed-off-by: Kostas Kyrimis <[email protected]>

kostasrim · 2025-12-01T13:05:35Z

src/server/dflycmd.cc

 #include <utility>

 #include "absl/cleanup/cleanup.h"
+#include "absl/debugging/stacktrace.h"


What the hell 😮‍💨

augmentcode

Review completed. 1 suggestion posted.

Comment augment review to trigger a new review at any time.

src/server/dflycmd.cc

Signed-off-by: Kostas Kyrimis <[email protected]>

kostasrim · 2025-12-01T13:14:01Z

src/server/dflycmd.cc

+                << ", is_sending=" << dfly_conn->IsSending() << ", idle_read=" << idle_read
+                << ", stuck_sending=" << stuck_sending;
+    }
+    conn_refs.push_back(dfly_conn->Borrow());


I shut the connections anyway regardless of if they are stuck on send or not. The rationale behind this is that a connection can become active at any point if the client starts reading from the socket. If that's the case, we will fail again because the connection won't show up as stuck. To mitigate this, I force shutdown all connections that were not closed previously

kostasrim · 2025-12-01T13:14:51Z

src/server/dflycmd.cc

+      ForceShutdownStuckConnections(uint64_t(timeout));
+
+      // Safety net.
+      facade::DispatchTracker tracker{sf_->GetNonPriviligedListeners(), cntx->conn(), false, false};


Not really needed to wait again tbh. We forcefully kill connections and the ones that were in good state were already shutdown (because of the previous dispatch tracker

Connections stuck on socket writes are not all the kinds of bad connections

+1, but I kill them all anyway. Even for connections stuck on sent, it could be that the client reads so the send gets unstuck. We should shutdown the connection regardless as we can't know when the client will read

romange · 2025-12-01T16:40:54Z

src/server/dflycmd.cc

+  for (auto& ref : conn_refs) {
+    facade::Connection* conn = ref.Get();
+    if (conn) {
+      conn->ShutdownSelfBlocking();


I think you should do it from the connection thread.

Yes, its unsafe to do so from foreign threads, WeakRef has it in the comments

+1. I will also sent a separate PR as we have this bug in another place as well

romange · 2025-12-01T16:41:28Z

src/server/dflycmd.cc

+  vector<facade::Connection::WeakRef> conn_refs;
+  auto cb = [&](unsigned thread_index, util::Connection* conn) {
+    facade::Connection* dcon = static_cast<facade::Connection*>(conn);
+    LOG(INFO) << dcon->DebugInfo();


maybe vlog? and more meaningful message

dranikpg

I thought about it today. I haven't edited the code in years, but it seems that we don't really have clearly defined boundaries anymore in what the process should be in theory

The rationale a long time ago was really simple: once we set TAKEN_OVER state, not all connections might have finished running transactions. So to keep data integrity/operation atomicity, we have to wait for all of them to finish. Nothing more.

It was easiest to base this mechanism just on a new message type, because we already had the queue set up with different types.

dranikpg · 2025-12-01T17:01:16Z

src/server/dflycmd.cc

+  for (auto& ref : conn_refs) {
+    facade::Connection* conn = ref.Get();
+    if (conn) {
+      conn->ShutdownSelfBlocking();


Yes, its unsafe to do so from foreign threads, WeakRef has it in the comments

dranikpg · 2025-12-01T17:01:50Z

src/server/dflycmd.cc

+      return;
+    }
+
+    bool idle_read = phase == Phase::READ_SOCKET && dfly_conn->idle_time() > timeout;


is idle_read really an interesting state?

I see, being blocked on pipeline overflow

Yes but that's for the logs, we kill all the connections

I do not understand actually. How idle read can happen for stuck connections?

@romange because in DispatchSingle we block when pipeline is over limits but we set last_interaction_ = time(nullptr); after we block. This shall reflect that.

Either way I kill all the connections unconditionally. I added the logs because I wanted us to be aware of those stucked cases

dranikpg · 2025-12-01T17:10:09Z

src/server/dflycmd.cc

+      ForceShutdownStuckConnections(uint64_t(timeout));
+
+      // Safety net.
+      facade::DispatchTracker tracker{sf_->GetNonPriviligedListeners(), cntx->conn(), false, false};


Connections stuck on socket writes are not all the kinds of bad connections

dranikpg · 2025-12-01T17:32:44Z

src/server/dflycmd.cc

+      shard_set->pool()->AwaitFiberOnAll([&](unsigned index, auto* pb) {
+        sf_->CancelBlockingOnThread();
+        tracker.TrackOnThread();
+      });


You can shut them down from this callback. The most important detail is that we care only about connections that have been running before taken over was set. We can't track them anymore here, but for example, if a connection connected in-between ForceShutdownStuckConnections and this callback and caused you trouble - we don't care about it - it can't do any mutable operations

We can't track them anymore here, but for example, if a connection connected in-between ForceShutdownStuckConnections and this callback and caused you trouble - we don't care about i

That's true but is there a point to "filter/select" these? Just kill them all. We are on force path anyway

dranikpg · 2025-12-01T17:36:47Z

src/server/dflycmd.cc

+      return;
+    }
+
+    bool idle_read = phase == Phase::READ_SOCKET && dfly_conn->idle_time() > timeout;


I see, being blocked on pipeline overflow

dranikpg · 2025-12-01T17:37:18Z

src/server/dflycmd.h

+  // Helper for takeover flow. Sometimes connections get stuck on send (because of pipelines)
+  // and this causes the takeover flow to fail because checkpoint messages are not processed.
+  // This function force shuts down those connection and allows the node to complete the takeover.
+  void ForceShutdownStuckConnections(uint64_t timeout);


It isn't necessarily a pipeline, it can also be a connection stuck on a large send

dranikpg · 2025-12-01T17:39:29Z

I will work and it will be more reliable than what we currently have, I just have this feeling that we overcomplicated because of not changing the message approach. Please see the small comments

However, this case will be fixed once we switch to asynchronous execution of commands

Maybe this will be a good moment to transition to a better approach

kostasrim · 2025-12-02T13:33:10Z

src/server/dflycmd.cc

 ABSL_DECLARE_FLAG(bool, info_replication_valkey_compatible);
 ABSL_DECLARE_FLAG(uint32_t, replication_timeout);
 ABSL_DECLARE_FLAG(uint32_t, shard_repl_backlog_len);
+ABSL_FLAG(bool, experimental_force_takeover, false,


@romange @dranikpg

Maybe instead introduce a subcommand to takeover, like force instead of a flag ? That way if we have a failed Takeover we can follow up with Forced one ?

I prefer we have the flag and improve TAKEOVER than adding those micro behaviors.

kostasrim · 2025-12-02T13:36:55Z

It was easiest to base this mechanism just on a new message type, because we already had the queue set up with different types.

Are you suggesting this or was it informational ? I mean, we got Checkpoint which we don't process because we are stuck. Am I missing something ?

romange · 2025-12-02T15:56:14Z

@kostasrim can you please write in the PR description the existing algorithm behind takeover? specifically, Is it a stateful flow operation where we tell listener to stop accepting new connections, mark all connections as shutting down and trying to break the existing flows or we just try to break the existing connections while new ones may be added, during this time ? Maybe it's time to revisit the algorithm and to implement something sound?

romange · 2025-12-02T15:58:22Z

The rationale a long time ago was really simple: once we set TAKEN_OVER state, not all connections might have finished running transactions. So to keep data integrity/operation atomicity, we have to wait for all of them to finish. Nothing more.

I read it after I wrote my comment and I approve this message :)
What is the process now and what it should actually be - let's start with high level english.

kostasrim · 2025-12-03T11:49:20Z

@kostasrim can you please write in the PR description the existing algorithm behind takeover? specifically, Is it a stateful flow operation where we tell listener to stop accepting new connections, mark all connections as shutting down and trying to break the existing flows or we just try to break the existing connections while new ones may be added, during this time ?

I updated the description with the algorithm + some small explanations.

Maybe it's time to revisit the algorithm and to implement something sound?

Yes but plz read what I wrote in "thinking out loud" #6114 (comment). I would like to revisit this and I am happy to write a small doc internally on what to do but I would first wait for the changes in the connection fiber/pipelining we are implementing. Once that's in place, we will adjust/redesign accordingly. When we become fully asynchronous for example, a Checkpoint message will be processed immediately since now the connection fiber is a multiplexer of different events. However, there might be multiple in-flight commands. Obviously the interactions there are slightly different and I don't want to rush any decisions without having the asynchronicity first set in stone.

My rationale for this PR is to for now have a temporary solution and allow dragonfly to complete the takeover in an intrusive/forceful way until we slowly transition to a new approach.

romange · 2025-12-03T12:22:57Z

I added additional logs which I hope will shed more light on why takeover fails BESIDES the stuck send explanation.
Btw, stuck send is relatively rare phenomena for us and we have a solution for this which should decrease the impact on takeover drastically. All in all I suggest we take it offline as it's not super urgent to solve now.

Signed-off-by: Kostas Kyrimis <[email protected]>

kostasrim · 2025-12-03T12:57:45Z

@dranikpg hope I didn't forget anything

chore: add takeover test with stuck connections

308ee6b

Signed-off-by: Kostas Kyrimis <[email protected]>

kostasrim self-assigned this Dec 1, 2025

kostasrim commented Dec 1, 2025

View reviewed changes

kostasrim changed the base branch from main to kpr33 December 1, 2025 13:06

augmentcode bot reviewed Dec 1, 2025

View reviewed changes

src/server/dflycmd.cc Show resolved Hide resolved

kostasrim force-pushed the kpr34 branch from cc48737 to 5f3de98 Compare December 1, 2025 13:07

chore: force shutdown connections on takeover

08018d0

Signed-off-by: Kostas Kyrimis <[email protected]>

kostasrim force-pushed the kpr34 branch from 1bd816f to 08018d0 Compare December 1, 2025 13:12

kostasrim commented Dec 1, 2025

View reviewed changes

kostasrim requested review from dranikpg and romange December 1, 2025 14:15

romange reviewed Dec 1, 2025

View reviewed changes

dranikpg reviewed Dec 1, 2025

View reviewed changes

kostasrim commented Dec 2, 2025

View reviewed changes

comments

35ec0b7

Signed-off-by: Kostas Kyrimis <[email protected]>

kostasrim requested review from dranikpg and romange December 3, 2025 12:54

kostasrim force-pushed the kpr33 branch from 308ee6b to 686f51d Compare December 3, 2025 14:27

chore: force shutdown connections on takeover #6142

Are you sure you want to change the base?

chore: force shutdown connections on takeover #6142

Conversation

kostasrim commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TakeOver algorithm as is

Steps that affect TakeOver result/state:

Optional steps that improve other flows:

Notes

This PR:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dranikpg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dranikpg commented Dec 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kostasrim commented Dec 2, 2025

Uh oh!

romange commented Dec 2, 2025

Uh oh!

romange commented Dec 2, 2025

Uh oh!

kostasrim commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romange commented Dec 3, 2025

Uh oh!

kostasrim commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

kostasrim commented Dec 1, 2025 •

edited

Loading

kostasrim commented Dec 3, 2025 •

edited

Loading