feat: High availabilty via RAFT #2987

alpe · 2026-01-16T20:36:43Z

Replaces #2954

* main: fix: remove duplicate error logging in light node shutdown (#2841) chore: fix incorrect function name in comment (#2840) chore: remove sequencer go.mod (#2837)

* main: build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851) build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844) build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)

(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)

* main: chore: reduce log noise (#2864) fix: sync service for non zero height starts with empty store (#2834) build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861) chore: minor improvement for docs (#2862)

* main: chore: bump da (#2866) chore: bump core (#2865)

* main: chore: fix some comments (#2874) chore: bump node in evm-single (#2875) refactor(syncer,cache): use compare and swap loop and add comments (#2873) refactor: use state da height as well (#2872) refactor: retrieve highest da height in cache (#2870) chore: change from event count to start and end height (#2871)

* main: chore: remove extra github action yml file (#2882) fix(execution/evm): verify payload status (#2863) feat: fetch included da height from store (#2880) chore: better output on errors (#2879) refactor!: create da client and split cache interface (#2878) chore!: rename `evm-single` and `grpc-single` (#2839) build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876) chore: parallel cache de/serialization (#2868) chore: bump blob size (#2877)

* main: build(deps): Bump mdast-util-to-hast from 13.2.0 to 13.2.1 in /docs in the npm_and_yarn group across 1 directory (#2900) refactor(block): centralize timeout in client (#2903) build(deps): Bump the all-go group across 2 directories with 3 updates (#2898) chore: bump default timeout (#2902) fix: revert default db (#2897) refactor: remove obsolete // +build tag (#2899) fix:da visualiser namespace (#2895) refactor: omit unnecessary reassignment (#2892) build(deps): Bump the all-go group across 5 directories with 6 updates (#2881) chore: fix inconsistent method name in retryWithBackoffOnPayloadStatus comment (#2889) fix: ensure consistent network ID usage in P2P subscriber (#2884) build(deps): Bump golangci/golangci-lint-action from 9.0.0 to 9.1.0 (#2885) build(deps): Bump actions/checkout from 5 to 6 (#2886)

* main: (34 commits) feat: make reaper poll duration configurable (#2951) chore!: move sequencers to pkg (#2931) feat: Ensure Header integrity on DA (#2948) feat(testda): add header support with GetHeaderByHeight method (#2946) chore: improve code comments clarity (#2947) chore(sequencers): optimize store check (#2945) fix: make evm_execution more robust (#2942) fix(sequencers/single): deterministic queue (#2938) fix(block): fix init logic sequencer for da epoch fetching (#2926) feat: use DA timestamp (#2939) chore: improve code comments clarity (#2943) build(deps): bump libp2p (#2937) build(deps): Bump actions/cache from 4 to 5 (#2934) build(deps): Bump actions/download-artifact from 6 to 7 (#2933) build(deps): Bump actions/upload-artifact from 5 to 6 (#2932) feat: DA Client remove interface part 3, replace types with new code (#2910) DA Client remove interface: Part 2.5, create e2e test to validate that a blob is posted in DA layer. (#2920) fix(syncing): skip forced txs checks for p2p blocks (#2922) build(deps): Bump the all-go group across 5 directories with 5 updates (#2919) chore: loosen syncer state check (#2927) ...

* main: feat(tracing): add tracing to EngineClient (#2959) chore: inject W3C headers into engine client and eth client (#2958) feat: adding tracing for Executor and added initial configuration (#2957)

* main: fix: Fail fast when executor ahead (#2966) feat(block): async epoch fetching (#2952) perf: tune badger defaults and add db bench (#2950)

* main: chore: adding syncing tracing (#2981) feat(tracing): adding block production tracing (#2980) feat(tracing): Add Store, P2P and Config tracing (#2972) chore: fix upgrade test (#2979) build(deps): Bump github.com/ethereum/go-ethereum from 1.16.7 to 1.16.8 in /execution/evm/test in the go_modules group across 1 directory (#2974) feat(tracing): adding tracing to DA client (#2968) chore: create onboarding skill (#2971) test: add e2e tests for force inclusion (part 2) (#2970) feat(tracing): adding eth client tracing (#2960) test: add e2e tests for force inclusion (#2964) build(deps): Bump the all-go group across 4 directories with 10 updates (#2969)

* main: fix: inconsistent state detection and rollback (#2983) chore: improve graceful shutdown restarts (#2985) feat(submitting): add posting strategies (#2973)

claude · 2026-01-16T20:37:00Z

Claude finished @alpe's task in 4m 56s —— View job

Code Review: High Availability via RAFT

I've completed a comprehensive review of this RAFT high availability implementation. This is a significant and well-structured feature adding consensus-based failover capabilities. Below is my detailed feedback organized by focus area.

Summary

This PR introduces RAFT consensus to enable high availability with automatic leader election and failover. The implementation is generally solid with good separation of concerns, comprehensive testing, and careful attention to safety. I've identified several areas for improvement around error handling, configuration validation, and edge case handling.

Key Strengths:

Clean architecture with proper interface boundaries
Comprehensive E2E testing for failover scenarios
Safety checks to prevent double-signing and split-brain scenarios
Good use of atomic operations and proper synchronization

Areas for Improvement:

Some error handling gaps in critical paths
Configuration validation could be more robust
Missing nil checks in a few places
Documentation gaps on RAFT-specific behavior

1. Code Quality & Architecture

Strengths

Clean Package Structure (pkg/raft/)

Well-organized with clear separation: node.go, election.go, types.go
Proper use of interfaces (Runnable, sourceNode) for testability
Zero-dependency FSM implementation following Hashicorp RAFT patterns

Modular Design (node/failover.go)

Factory pattern for leader/follower mode creation is excellent
failoverState properly encapsulates mode-specific components
Good separation between static (singleRoleElector) and dynamic (DynamicLeaderElection) election strategies

Issues & Recommendations

🟡 MEDIUM: Missing error propagation in FSM Apply (pkg/raft/node.go:309-337)

func (f *FSM) Apply(log *raft.Log) interface{} {
    // ...
    if err := assertValid(f.state.Load(), &state); err != nil {
        return err  // ✅ Good - errors returned
    }
    // ...
    if f.applyCh != nil {
        select {
        case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
        default:
            f.logger.Warn().Msg("apply channel full, dropping message")  // ⚠️ Silent drop
        }
    }
    return nil
}

Recommendation: While non-blocking is correct for RAFT performance, consider:

Adding metrics to track dropped messages
Implementing backpressure or alerting when channel is consistently full
Document this behavior clearly as it could cause followers to fall behind

🟡 MEDIUM: Bootstrap mode is hardcoded (pkg/raft/node.go:108-115)

func (n *Node) Start(_ context.Context) error {
    if !n.config.Bootstrap {
        return fmt.Errorf("raft cluster requires bootstrap mode")  // ⚠️ Non-bootstrap not supported
    }

Recommendation: Either:

Implement non-bootstrap mode for joining existing clusters dynamically
Remove the Bootstrap config field if only bootstrap mode is supported
Document why bootstrap-only is acceptable for the current use case

🟢 MINOR: Typo in log message (pkg/raft/node.go:122)

n.logger.Info().Msg("Boostrap raft cluster")  // ⚠️ Typo: "Boostrap" -> "Bootstrap"

🟡 MEDIUM: Leadership transfer errors not handled (pkg/raft/election.go:78,100,123)
Multiple locations call d.node.leadershipTransfer() and ignore errors:

_ = d.node.leadershipTransfer()  // ⚠️ Error ignored

Recommendation: Log errors consistently. Leadership transfer failures should be visible for debugging cluster issues.

🟢 MINOR: Magic number for ticker interval (pkg/raft/election.go:86)

ticker := time.NewTicker(300 * time.Millisecond)  // ⚠️ Magic number

Recommendation: Extract to a named constant or configuration parameter.

2. Error Handling & Resilience

Strengths

Excellent Safety Checks (pkg/raft/election.go:96-105)

// Critical Safety Check: Wait for FSM to apply all committed logs up to LastIndex.
// If we start leader operations with stale FSM, we risk Double Signing
if err := d.node.waitForMsgsLanded(d.node.Config().SendTimeout); err != nil {
    d.logger.Error().Err(err).Msg("failed to wait for messages to land - FSM lagging, abdicating to prevent safety violation")
    if tErr := d.node.leadershipTransfer(); tErr != nil {
        d.logger.Error().Err(tErr).Msg("failed to transfer leadership")
    }
    continue
}

This is excellent defensive programming to prevent double-signing attacks.

Proper Rollback Detection (block/internal/common/replay.go:71-88)
Automatic rollback with Rollbackable interface is well-designed.

Issues & Recommendations

🔴 HIGH: Missing context cancellation check in broadcast (pkg/raft/node.go:234-252)

func (n *Node) Broadcast(ctx context.Context, state *RaftBlockState) error {
    if !n.IsLeader() {
        return fmt.Errorf("not leader")
    }
    // ...
    if ctx.Err() != nil {
        return ctx.Err()  // ✅ Check before Apply
    }
    future := n.raft.Apply(data, n.config.SendTimeout)
    if err := future.Error(); err != nil {
        return fmt.Errorf("apply log: %w", err)
    }
    return nil
}

Issue: After raft.Apply() is called, there's no check if context was cancelled during the operation. The RAFT apply will complete even if the caller's context is cancelled.

Recommendation: Either:

Document this behavior (RAFT operations are not cancellable mid-flight)
Add context monitoring: select { case <-ctx.Done(): return ctx.Err(); case <-future.Done(): ... }

🟡 MEDIUM: Snapshot restore doesn't validate state (pkg/raft/node.go:345-361)

func (f *FSM) Restore(rc io.ReadCloser) error {
    // ...
    if err := proto.Unmarshal(data, &state); err != nil {
        return fmt.Errorf("decode snapshot: %w", err)
    }
    f.state.Store(&state)  // ⚠️ No validation
    return nil
}

Recommendation: Add validation before restoring:

Check height is monotonic
Validate timestamp
Consider calling assertValid or similar validation

🟡 MEDIUM: Error handling in syncer rollback (block/internal/syncing/syncer.go:480-520)
The RecoverFromRaft method has complex error handling that might not properly clean up on partial failures. Consider transaction-like semantics or more explicit rollback on error.

3. Security

Strengths

Quorum Check Before Block Production (block/internal/executing/executor.go:411-413)

if e.raftNode != nil && !e.raftNode.HasQuorum() {
    return errors.New("raft cluster does not have quorum")
}

Excellent - prevents producing blocks without consensus availability.

Hash Verification After Replay (block/internal/executing/executor.go:305-321)
Double-checking content matches RAFT state prevents "Dual-Store Conflict" scenarios.

Issues & Recommendations

🟡 MEDIUM: JWT secret handling in tests (test/e2e/failover_e2e_test.go:57-63)
Tests handle JWT secrets but file paths aren't validated. Ensure test secrets don't leak to logs or error messages.

🟢 MINOR: Consider TLS for RAFT transport (pkg/raft/node.go:88-91)

transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)

Currently uses unencrypted TCP. For production, consider:

Adding TLS support for RAFT inter-node communication
Mutual TLS for cluster authentication
Document this as a deployment consideration

🟢 MINOR: Error messages could leak sensitive info (pkg/raft/node.go:132)
Ensure peer addresses in error messages don't expose internal network topology in logs accessible to untrusted parties.

4. Performance & Resource Efficiency

Strengths

Non-blocking FSM Apply (pkg/raft/node.go:328-333)
Using select with default case prevents RAFT cluster from blocking on slow consumers.

Proper Connection Pooling (block/internal/syncing/raft_retriever.go:62)
Buffered channel (100) provides reasonable buffering without excessive memory.

Issues & Recommendations

🟡 MEDIUM: Unbounded goroutine in election loop (pkg/raft/election.go:75-84)

go func(childCtx context.Context) {
    defer wg.Done()
    if err := workerFunc(childCtx); err != nil && !errors.Is(err, context.Canceled) {
        _ = d.node.leadershipTransfer()
        select {
        case errCh <- fmt.Errorf(name+" worker exited unexpectedly: %s", err):
        default: // do not block
        }
    }
}(workerCtx)

Recommendation: While current implementation is safe, consider:

Adding timeout for worker shutdown in startWorker
Metrics for worker lifecycle (started, stopped, errors)

🟢 MINOR: Consider connection limits (pkg/raft/node.go:88)

transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)

The hardcoded 3 is the max pool size. Consider making this configurable for larger clusters.

🟢 MINOR: Snapshot retention (pkg/raft/node.go:78)

snapshotStore, err := raft.NewFileSnapshotStore(cfg.RaftDir, int(cfg.SnapCount), os.Stderr)

cfg.SnapCount determines retention. Document disk space implications for users.

5. Testing & Reliability

Strengths

Excellent E2E Coverage (test/e2e/failover_e2e_test.go)

Comprehensive failover test with 3-node cluster
Tests leader shutdown, election, transaction propagation, and recovery
Good use of require.Eventually for asynchronous assertions

Good Unit Test Coverage (pkg/raft/election_test.go, pkg/raft/node_test.go)

Table-driven tests
Mock usage is appropriate
Edge cases covered (empty strings, duplicates, etc.)

Issues & Recommendations

🟡 MEDIUM: Missing RAFT split-brain test
The current E2E tests don't verify behavior when network partitions occur. Consider adding:

Test for network partition healing
Test for minority partition behavior
Test for simultaneous leader crashes

🟡 MEDIUM: E2E test cleanup (test/e2e/failover_e2e_test.go:116-125)

var leaderNode string
require.EventuallyWithT(t, func(collect *assert.CollectT) {
    leaderNode = clusterNodes.Leader(collect)
}, 5*time.Second, 200*time.Millisecond)

Recommendation: Add explicit cleanup/shutdown of nodes in test failure paths to prevent resource leaks in CI.

🟢 MINOR: Test has unused context (pkg/raft/node_test.go:50-51)

ctx := t.Context()
_ = ctx // keep to follow guideline

Consider using t.Context() for timeout enforcement in tests.

🟢 MINOR: Consider chaos testing
For a consensus system, consider:

Jepsen-style testing (if feasible)
Random failure injection
Clock skew simulation

6. Documentation & Developer Experience

Strengths

Good inline comments (pkg/raft/election.go:96-98)
Safety-critical code sections have excellent explanatory comments.

Clear interface contracts (pkg/raft/election.go:16-24)
Runnable interface is well-defined with clear method signatures.

Issues & Recommendations

🟡 MEDIUM: Missing package-level documentation
pkg/raft/ lacks a package doc explaining:

RAFT's role in the system
When to enable RAFT
Cluster size recommendations
Disaster recovery procedures

🟡 MEDIUM: Configuration documentation (pkg/config/config.go)
New RAFT config fields lack detailed comments:

type RaftConfig struct {
    Enable             bool
    NodeID             string
    RaftAddr           string
    RaftDir            string
    Bootstrap          bool
    Peers              string
    SnapCount          uint64
    SendTimeout        time.Duration
    HeartbeatTimeout   time.Duration
    LeaderLeaseTimeout time.Duration
}

Recommendation: Add godoc comments explaining:

NodeID: must be unique across cluster
Peers: format is "nodeID@address" comma-separated
Bootstrap: currently only true is supported
Timeout implications and tuning guidance

🟢 MINOR: Add migration guide
For operators upgrading to RAFT-enabled nodes, provide:

Step-by-step cluster setup
Rolling upgrade procedure (if supported)
Rollback strategy
Monitoring recommendations

🟢 MINOR: Proto file documentation (proto/evnode/v1/state.proto:23-32)
RaftBlockState message could benefit from field-level comments explaining the consensus state tracking.

7. Configuration & Validation

Issues & Recommendations

🟡 MEDIUM: Peer configuration validation is weak (pkg/raft/node.go:129-137)

for _, peer := range n.config.Peers {
    addr, err := splitPeerAddr(peer)
    if err != nil {
        return fmt.Errorf("peer %q : %w", peer, err)
    }
    if addr != thisNode {
        cfg.Servers = append(cfg.Servers, addr)
    }
}

Recommendation: Add validation for:

Minimum cluster size (3 for RAFT quorum)
Maximum recommended cluster size
Duplicate detection earlier (before BootstrapCluster)
Reachability checks (optional, for better UX)

🟡 MEDIUM: RAFT config defaults (pkg/config/defaults.go)
RAFT config has defaults but critical values like HeartbeatTimeout and LeaderLeaseTimeout should be validated against each other:

if cfg.Raft.LeaderLeaseTimeout < cfg.Raft.HeartbeatTimeout {
    return fmt.Errorf("leader lease timeout must be >= heartbeat timeout")
}

🟢 MINOR: RaftDir creation (pkg/raft/node.go:51)

if err := os.MkdirAll(cfg.RaftDir, 0755); err != nil {

Consider checking for:

Disk space availability
Write permissions
Existing data compatibility

8. Concurrency & Synchronization

Strengths

Proper atomic usage (pkg/raft/node.go:45,62)

state: &atomic.Pointer[RaftBlockState]

Correct use of atomic.Pointer for lock-free reads.

Good WaitGroup pattern (node/failover.go:58-64,176-231)
Proper goroutine lifecycle management with cleanup.

Issues & Recommendations

🟢 MINOR: Race in IsRunning (pkg/raft/election.go:227-229)

func (d *DynamicLeaderElection) IsRunning() bool {
    return d.running.Load()
}

This is fine, but document that it's eventually consistent (may return stale value briefly during state transitions).

🟢 MINOR: Channel buffer sizes (block/internal/syncing/raft_retriever.go:62)

applyCh := make(chan raft.RaftApplyMsg, 100)

100 is reasonable, but consider:

Making it configurable for high-throughput scenarios
Adding metrics for channel depth

9. Specific File Comments

`pkg/raft/node.go`

🟡 Line 183: Leadership transfer during shutdown is best-effort. Consider adding timeout and warning if transfer fails.

🟡 Line 256: proto.Clone creates a defensive copy - good. Consider documenting thread-safety guarantees of GetState().

🟡 Line 268-279: AddPeer logic for removing then re-adding is subtle. Add comment explaining why this is necessary.

`block/internal/syncing/raft_retriever.go`

🟡 Line 109: raftBlockPreProcessor is called but errors might not stop processing. Review error handling strategy.

🟡 Line 133-135: Validation errors are logged but processing continues (return nil). Clarify if this is intentional or should be return err.

`node/failover.go`

🟡 Line 247-249: Recovery error returns immediately, but comment says "we are stuck". Consider retry logic or more graceful degradation.

`test/e2e/failover_e2e_test.go`

🟢 Line 138: Good practice killing the leader. Consider also testing voluntary leadership transfer (graceful) vs forced (crash).

10. Final Recommendations

Before Merge (High Priority)

Fix critical error handling gaps (RED issues above)
Add package-level documentation for pkg/raft/
Validate RAFT config relationships (timeouts, cluster size)
Add metrics for RAFT operations (elections, applies, drops)

Follow-up (Can be separate PRs)

Implement non-bootstrap mode or remove the config option
Add TLS support for RAFT transport
Create operator documentation (setup, upgrade, monitoring)
Add split-brain and partition E2E tests
Implement RAFT membership changes (dynamic add/remove nodes)

Conclusion

This is a well-engineered implementation of RAFT-based high availability. The architecture is sound, testing is comprehensive, and safety considerations are thoughtfully addressed. The identified issues are mostly medium-priority improvements that would increase robustness and operational visibility.

Recommendation: ✅ Approve with minor changes

The core functionality is solid. Address the HIGH priority issues (error handling gaps) and consider the MEDIUM issues (config validation, documentation) before merging.

Great work on this complex feature! The attention to safety (double-signing prevention, quorum checks, rollback detection) demonstrates strong distributed systems engineering.

github-actions · 2026-01-16T20:37:04Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Jan 16, 2026, 8:37 PM

gemini-code-assist · 2026-01-16T20:39:53Z

Summary of Changes

Hello @alpe, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request fundamentally changes how Rollkit nodes handle high availability by integrating the RAFT consensus algorithm. The primary goal is to ensure continuous operation and data consistency across a cluster of sequencer nodes, even in the event of node failures. The changes involve a significant architectural shift to support dynamic role management (leader/follower) and robust state synchronization, allowing nodes to automatically recover and maintain a consistent view of the blockchain state. This feature is crucial for building resilient and fault-tolerant Rollkit deployments.

Highlights

RAFT Consensus Integration: Introduces the Raft consensus algorithm for high availability, enabling dynamic leader election and state replication among sequencer nodes using the hashicorp/raft library.
Dynamic Leader Election & Failover: Implements a DynamicLeaderElection mechanism that allows nodes to seamlessly transition between leader and follower roles. This includes logic for handling leader loss, ensuring FSM synchronization before leadership, and attempting recovery if a new leader is not synced.
Enhanced State Synchronization & Recovery: The Executor and Syncer components are updated to verify local state against the Raft replicated state. This includes automatic rollbacks if the execution layer is ahead, checks for block existence in the store before syncing, and robust recovery mechanisms from Raft state during failover scenarios.
Refactored Node Architecture: The node's P2P client creation and management are refactored, and a new failoverState structure is introduced to encapsulate components that need to be reset during mode switches (aggregator/sync). This simplifies the node's lifecycle management in a HA setup.
New RPC Endpoint for RAFT Status: Adds a /raft/node HTTP endpoint to query the Raft node's leadership status and ID, providing visibility into the consensus state.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed feature: high availability for the sequencer using the Raft consensus algorithm. The changes are extensive, refactoring core components to support dynamic leader election and failover between aggregator (leader) and sync (follower) roles. The new architecture appears robust, with strong consistency checks to prevent issues like state divergence or stale leaders. The addition of comprehensive end-to-end tests for failover scenarios is particularly commendable. My review includes a few suggestions to improve configuration handling and error messages, aiming to enhance the operator experience and debuggability of this critical new functionality.

gemini-code-assist · 2026-01-16T20:42:41Z

block/internal/common/replay.go

+		prevState, err = s.store.GetStateAtHeight(ctx, height-1)
 		if err != nil {
 			return fmt.Errorf("failed to get previous state: %w", err)
 		}


For better debuggability, it would be helpful to include the specific height that was being queried in the error message when getting the previous state fails.

Suggested change

prevState, err = s.store.GetStateAtHeight(ctx, height-1)

if err != nil {

return fmt.Errorf("failed to get previous state: %w", err)

}

prevState, err = s.store.GetStateAtHeight(ctx, height-1)

if err != nil {

return fmt.Errorf("failed to get previous state at height %d: %w", height-1, err)

}

gemini-code-assist · 2026-01-16T20:42:42Z

node/full.go

+		return nil, fmt.Errorf("raft config must be used in sequencer setup only")
 	}


The error message "raft config must be used in sequencer setup only" could be more explicit. It seems the design requires a node to be configured as a potential aggregator to participate in a Raft cluster. A clearer message would improve user experience when configuring nodes.

Suggested change

return nil, fmt.Errorf("raft config must be used in sequencer setup only")

}

return nil, fmt.Errorf("raft can only be enabled for aggregator nodes (aggregator flag must be true)")

gemini-code-assist · 2026-01-16T20:42:42Z

pkg/raft/node.go

+	if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
+		return fmt.Errorf("duplicate peers found in config: %v", cfg.Servers)
+	}


Instead of returning an error for duplicate peers in the configuration, it would be more user-friendly to log a warning and proceed with the deduplicated list. The deduplicateServers function already handles the removal of duplicates, so failing fast might be overly strict for what is likely a minor configuration mistake.

Suggested change

if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {

return fmt.Errorf("duplicate peers found in config: %v", cfg.Servers)

}

if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {

n.logger.Warn().Msgf("duplicate peers found in config, using deduplicated list: %v", svrs)

cfg.Servers = svrs

}

alpe added 30 commits November 12, 2025 15:16

Add high availabiliy via raft

d4c7e7d

Prefent duplicate peers

820b9a3

Merge branch 'main' into alex/adr023_ha-failover_raft7

35cf0c8

* main: fix: remove duplicate error logging in light node shutdown (#2841) chore: fix incorrect function name in comment (#2840) chore: remove sequencer go.mod (#2837)

Go mod tidy

45a6e27

Minor updates

4562ded

Merge branch 'main' into alex/adr023_ha-failover_raft7

4d47754

Fix test

0804346

Leadership fix

1c2e3f1

(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)

Fix test

c320a7a

Remove sleep

72ec5fe

Increase buffer size for raft msgs

b34d7a0

Review feedback

406377e

Merge branch 'main' into alex/adr023_ha-failover_raft7

0ff25b8

* main: chore: bump da (#2866) chore: bump core (#2865)

Add replace to go.mod

406ad8b

Fix merge conflicts

abfa8d3

Linter fix

aebbdff

Minor test update

50d7caf

Remove store.Sync

5de9f0e

Merge

695324e

Add last submitted da heights to raft

2463919

Make test compile again

3e1cbca

Complete merge

640924b

Merge branch 'main' into alex/adr023_ha-failover_raft8

11038cd

Fix integration test

095f882

alpe added 16 commits January 7, 2026 16:11

Review feedback and linting

99afdad

Convert raft state into proto type

3f7f431

Fix import and e2e tests

25a6185

Cufigurable leader_lease_timeout

adc05ed

Merge branch 'main' into alex/adr023_ha-failover_raft8

9b0bff4

* main: feat(tracing): add tracing to EngineClient (#2959) chore: inject W3C headers into engine client and eth client (#2958) feat: adding tracing for Executor and added initial configuration (#2957)

Log raft state height when passing leadership

49ba50f

Recover with raft state

00ae050

Ensure block hash in state check

490b286

Merge branch 'main' into alex/adr023_ha-failover_raft8

d7fbf79

* main: fix: Fail fast when executor ahead (#2966) feat(block): async epoch fetching (#2952) perf: tune badger defaults and add db bench (#2950)

x

7edca17

More safety features, including db.sync

6502a47

Ensure safe restart

a52d538

x

3203ce6

Merge branch 'main' into alex/adr023_ha-failover_raft11

1cf1167

* main: fix: inconsistent state detection and rollback (#2983) chore: improve graceful shutdown restarts (#2985) feat(submitting): add posting strategies (#2973)

Resolve conflicts

e4e94f2

github-actions bot assigned alpe Jan 16, 2026

alpe mentioned this pull request Jan 16, 2026

feat: High availabilty via RAFT #2954

Closed

gemini-code-assist bot reviewed Jan 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: High availabilty via RAFT #2987

feat: High availabilty via RAFT #2987

Uh oh!

alpe commented Jan 16, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

gemini-code-assist bot commented Jan 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return nil, fmt.Errorf("raft config must be used in sequencer setup only")
		}

	return nil, fmt.Errorf("raft config must be used in sequencer setup only")
	}
	return nil, fmt.Errorf("raft can only be enabled for aggregator nodes (aggregator flag must be true)")

feat: High availabilty via RAFT #2987

Are you sure you want to change the base?

feat: High availabilty via RAFT #2987

Uh oh!

Conversation

alpe commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: High Availability via RAFT

Summary

1. Code Quality & Architecture

Strengths

Issues & Recommendations

2. Error Handling & Resilience

Strengths

Issues & Recommendations

3. Security

Strengths

Issues & Recommendations

4. Performance & Resource Efficiency

Strengths

Issues & Recommendations

5. Testing & Reliability

Strengths

Issues & Recommendations

6. Documentation & Developer Experience

Strengths

Issues & Recommendations

7. Configuration & Validation

Issues & Recommendations

8. Concurrency & Synchronization

Strengths

Issues & Recommendations

9. Specific File Comments

pkg/raft/node.go

block/internal/syncing/raft_retriever.go

node/failover.go

test/e2e/failover_e2e_test.go

10. Final Recommendations

Before Merge (High Priority)

Follow-up (Can be separate PRs)

Conclusion

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

gemini-code-assist bot commented Jan 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alpe commented Jan 16, 2026 •

edited

Loading

claude bot commented Jan 16, 2026 •

edited

Loading

`pkg/raft/node.go`

`block/internal/syncing/raft_retriever.go`

`node/failover.go`

`test/e2e/failover_e2e_test.go`