Skip to content

Conversation

@kostasrim
Copy link
Contributor

@kostasrim kostasrim commented Dec 1, 2025

TakeOver algorithm as is

Imagine the simple setup: node 1 master, node 2 replica and node 2 attempts takeover on node 1.

For node 1:

Steps that affect TakeOver result/state:

  1. Switch state to TAKEN_OVER -> connections will get their commands rejected
  2. Wait for all connections to finish dispatches -> send Checkpoint message on each conn and wait for it to be processed
  • New step here from this PR. Optionally force shutdown all open connections. This does not break data parity. The connection is stuck on send, will break with an error but the transaction data will get replicated anyway (we already wrote to the journal the change).
  1. Wait for the replica to catch up (reach the same LSN as master) -> data parity
  2. Reply with OK. At this point the TAKEOVER is considered complete, node 2 is master.

Optional steps that improve other flows:

  1. Optionally: if there are more nodes registered as replicas, wait for those to catch up. This does not affect the takeover but allows waiting for all of the replicas to fully sync. (Also needed for partial sync data integrity)
  2. Optionally: Take a snapshot
  3. Optionally: for cluster mode do not shutdown such that the node can redirect requests

Notes

  • Dragonfly is still accepting new connections during the takeover. However, once dragonfly switches to TAKEN_OVER state, connections won’t be able to execute commands and they will get “busy loading” error

  • Checkpoints are needed for data integrity. Once all the connections process the checkpoints it means that any other command in their dispatch queue will fail with busy loading. Hence, after step(2) there won’t be any change in the state (storage) and after step(3) the taking over node is on data parity.

This PR:

A step between 2 and 3 to fix non responding connections and force the takeover by forcefully shutting them down.

Follows up #6135. Add a flag experimental_force_takeover. If takeover times out and this flag is set dragonfly will force shutdown open connections such that it doesn't fail the takeover.

Resolves: #6114

@kostasrim kostasrim self-assigned this Dec 1, 2025
#include <utility>

#include "absl/cleanup/cleanup.h"
#include "absl/debugging/stacktrace.h"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the hell 😮‍💨

@kostasrim kostasrim changed the base branch from main to kpr33 December 1, 2025 13:06
Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.

Comment augment review to trigger a new review at any time.

<< ", is_sending=" << dfly_conn->IsSending() << ", idle_read=" << idle_read
<< ", stuck_sending=" << stuck_sending;
}
conn_refs.push_back(dfly_conn->Borrow());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shut the connections anyway regardless of if they are stuck on send or not. The rationale behind this is that a connection can become active at any point if the client starts reading from the socket. If that's the case, we will fail again because the connection won't show up as stuck. To mitigate this, I force shutdown all connections that were not closed previously

ForceShutdownStuckConnections(uint64_t(timeout));

// Safety net.
facade::DispatchTracker tracker{sf_->GetNonPriviligedListeners(), cntx->conn(), false, false};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really needed to wait again tbh. We forcefully kill connections and the ones that were in good state were already shutdown (because of the previous dispatch tracker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connections stuck on socket writes are not all the kinds of bad connections

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, but I kill them all anyway. Even for connections stuck on sent, it could be that the client reads so the send gets unstuck. We should shutdown the connection regardless as we can't know when the client will read

@kostasrim kostasrim requested review from dranikpg and romange December 1, 2025 14:15
for (auto& ref : conn_refs) {
facade::Connection* conn = ref.Get();
if (conn) {
conn->ShutdownSelfBlocking();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should do it from the connection thread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, its unsafe to do so from foreign threads, WeakRef has it in the comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I will also sent a separate PR as we have this bug in another place as well

vector<facade::Connection::WeakRef> conn_refs;
auto cb = [&](unsigned thread_index, util::Connection* conn) {
facade::Connection* dcon = static_cast<facade::Connection*>(conn);
LOG(INFO) << dcon->DebugInfo();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe vlog? and more meaningful message

Copy link
Contributor

@dranikpg dranikpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it today. I haven't edited the code in years, but it seems that we don't really have clearly defined boundaries anymore in what the process should be in theory

The rationale a long time ago was really simple: once we set TAKEN_OVER state, not all connections might have finished running transactions. So to keep data integrity/operation atomicity, we have to wait for all of them to finish. Nothing more.

It was easiest to base this mechanism just on a new message type, because we already had the queue set up with different types.

for (auto& ref : conn_refs) {
facade::Connection* conn = ref.Get();
if (conn) {
conn->ShutdownSelfBlocking();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, its unsafe to do so from foreign threads, WeakRef has it in the comments

return;
}

bool idle_read = phase == Phase::READ_SOCKET && dfly_conn->idle_time() > timeout;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is idle_read really an interesting state?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, being blocked on pipeline overflow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but that's for the logs, we kill all the connections

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand actually. How idle read can happen for stuck connections?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romange because in DispatchSingle we block when pipeline is over limits but we set last_interaction_ = time(nullptr); after we block. This shall reflect that.

Either way I kill all the connections unconditionally. I added the logs because I wanted us to be aware of those stucked cases

ForceShutdownStuckConnections(uint64_t(timeout));

// Safety net.
facade::DispatchTracker tracker{sf_->GetNonPriviligedListeners(), cntx->conn(), false, false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connections stuck on socket writes are not all the kinds of bad connections

Comment on lines 570 to 573
shard_set->pool()->AwaitFiberOnAll([&](unsigned index, auto* pb) {
sf_->CancelBlockingOnThread();
tracker.TrackOnThread();
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can shut them down from this callback. The most important detail is that we care only about connections that have been running before taken over was set. We can't track them anymore here, but for example, if a connection connected in-between ForceShutdownStuckConnections and this callback and caused you trouble - we don't care about it - it can't do any mutable operations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't track them anymore here, but for example, if a connection connected in-between ForceShutdownStuckConnections and this callback and caused you trouble - we don't care about i

That's true but is there a point to "filter/select" these? Just kill them all. We are on force path anyway

return;
}

bool idle_read = phase == Phase::READ_SOCKET && dfly_conn->idle_time() > timeout;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, being blocked on pipeline overflow

Comment on lines +186 to +189
// Helper for takeover flow. Sometimes connections get stuck on send (because of pipelines)
// and this causes the takeover flow to fail because checkpoint messages are not processed.
// This function force shuts down those connection and allows the node to complete the takeover.
void ForceShutdownStuckConnections(uint64_t timeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't necessarily a pipeline, it can also be a connection stuck on a large send

@dranikpg
Copy link
Contributor

dranikpg commented Dec 1, 2025

I will work and it will be more reliable than what we currently have, I just have this feeling that we overcomplicated because of not changing the message approach. Please see the small comments

However, this case will be fixed once we switch to asynchronous execution of commands

Maybe this will be a good moment to transition to a better approach

ABSL_DECLARE_FLAG(bool, info_replication_valkey_compatible);
ABSL_DECLARE_FLAG(uint32_t, replication_timeout);
ABSL_DECLARE_FLAG(uint32_t, shard_repl_backlog_len);
ABSL_FLAG(bool, experimental_force_takeover, false,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romange @dranikpg

Maybe instead introduce a subcommand to takeover, like force instead of a flag ? That way if we have a failed Takeover we can follow up with Forced one ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer we have the flag and improve TAKEOVER than adding those micro behaviors.

@kostasrim
Copy link
Contributor Author

It was easiest to base this mechanism just on a new message type, because we already had the queue set up with different types.

Are you suggesting this or was it informational ? I mean, we got Checkpoint which we don't process because we are stuck. Am I missing something ?

@romange
Copy link
Collaborator

romange commented Dec 2, 2025

@kostasrim can you please write in the PR description the existing algorithm behind takeover? specifically, Is it a stateful flow operation where we tell listener to stop accepting new connections, mark all connections as shutting down and trying to break the existing flows or we just try to break the existing connections while new ones may be added, during this time ? Maybe it's time to revisit the algorithm and to implement something sound?

@romange
Copy link
Collaborator

romange commented Dec 2, 2025

The rationale a long time ago was really simple: once we set TAKEN_OVER state, not all connections might have finished running transactions. So to keep data integrity/operation atomicity, we have to wait for all of them to finish. Nothing more.

I read it after I wrote my comment and I approve this message :)
What is the process now and what it should actually be - let's start with high level english.

@kostasrim
Copy link
Contributor Author

kostasrim commented Dec 3, 2025

@kostasrim can you please write in the PR description the existing algorithm behind takeover? specifically, Is it a stateful flow operation where we tell listener to stop accepting new connections, mark all connections as shutting down and trying to break the existing flows or we just try to break the existing connections while new ones may be added, during this time ?

I updated the description with the algorithm + some small explanations.

Maybe it's time to revisit the algorithm and to implement something sound?

Yes but plz read what I wrote in "thinking out loud" #6114 (comment). I would like to revisit this and I am happy to write a small doc internally on what to do but I would first wait for the changes in the connection fiber/pipelining we are implementing. Once that's in place, we will adjust/redesign accordingly. When we become fully asynchronous for example, a Checkpoint message will be processed immediately since now the connection fiber is a multiplexer of different events. However, there might be multiple in-flight commands. Obviously the interactions there are slightly different and I don't want to rush any decisions without having the asynchronicity first set in stone.

My rationale for this PR is to for now have a temporary solution and allow dragonfly to complete the takeover in an intrusive/forceful way until we slowly transition to a new approach.

@romange
Copy link
Collaborator

romange commented Dec 3, 2025

I added additional logs which I hope will shed more light on why takeover fails BESIDES the stuck send explanation.
Btw, stuck send is relatively rare phenomena for us and we have a solution for this which should decrease the impact on takeover drastically. All in all I suggest we take it offline as it's not super urgent to solve now.

Signed-off-by: Kostas Kyrimis <[email protected]>
@kostasrim kostasrim requested review from dranikpg and romange December 3, 2025 12:54
@kostasrim
Copy link
Contributor Author

@dranikpg hope I didn't forget anything

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve TakeOver flow

4 participants