Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Jan 16, 2026

Replaces #2954

alpe added 30 commits November 12, 2025 15:16
* main:
  fix: remove duplicate error logging in light node shutdown (#2841)
  chore: fix incorrect function name in comment (#2840)
  chore: remove sequencer go.mod (#2837)
* main:
  build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851)
  build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844)
  build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)
(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)
* main:
  chore: reduce log noise (#2864)
  fix: sync service for non zero height starts with empty store (#2834)
  build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861)
  chore: minor improvement for docs (#2862)
* main:
  chore: bump da (#2866)
  chore: bump  core (#2865)
* main:
  chore: fix some comments (#2874)
  chore: bump node in evm-single (#2875)
  refactor(syncer,cache): use compare and swap loop and add comments (#2873)
  refactor: use state da height as well (#2872)
  refactor: retrieve highest da height in cache (#2870)
  chore: change from event count to start and end height (#2871)
* main:
  chore: remove extra github action yml file (#2882)
  fix(execution/evm): verify payload status (#2863)
  feat: fetch included da height from store (#2880)
  chore: better output on errors (#2879)
  refactor!: create da client and split cache interface (#2878)
  chore!: rename `evm-single` and `grpc-single` (#2839)
  build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876)
  chore: parallel cache de/serialization (#2868)
  chore: bump blob size (#2877)
* main:
  build(deps): Bump mdast-util-to-hast from 13.2.0 to 13.2.1 in /docs in the npm_and_yarn group across 1 directory (#2900)
  refactor(block): centralize timeout in client (#2903)
  build(deps): Bump the all-go group across 2 directories with 3 updates (#2898)
  chore: bump default timeout (#2902)
  fix: revert default db (#2897)
  refactor: remove obsolete // +build tag (#2899)
  fix:da visualiser namespace  (#2895)
  refactor: omit unnecessary reassignment (#2892)
  build(deps): Bump the all-go group across 5 directories with 6 updates (#2881)
  chore: fix inconsistent method name in retryWithBackoffOnPayloadStatus comment (#2889)
  fix: ensure consistent network ID usage in P2P subscriber (#2884)
  build(deps): Bump golangci/golangci-lint-action from 9.0.0 to 9.1.0 (#2885)
  build(deps): Bump actions/checkout from 5 to 6 (#2886)
* main: (34 commits)
  feat: make reaper poll duration configurable (#2951)
  chore!: move sequencers to pkg (#2931)
  feat: Ensure Header integrity on DA (#2948)
  feat(testda): add header support with GetHeaderByHeight method (#2946)
  chore: improve code comments clarity (#2947)
  chore(sequencers): optimize store check (#2945)
  fix: make evm_execution more robust (#2942)
  fix(sequencers/single): deterministic queue (#2938)
  fix(block): fix init logic sequencer for da epoch fetching (#2926)
  feat: use DA timestamp (#2939)
  chore: improve code comments clarity (#2943)
  build(deps): bump libp2p (#2937)
  build(deps): Bump actions/cache from 4 to 5 (#2934)
  build(deps): Bump actions/download-artifact from 6 to 7 (#2933)
  build(deps): Bump actions/upload-artifact from 5 to 6 (#2932)
  feat: DA Client remove interface part 3, replace types with new code (#2910)
  DA Client remove interface: Part 2.5, create e2e test to validate that a blob is posted in DA layer. (#2920)
  fix(syncing): skip forced txs checks for p2p blocks (#2922)
  build(deps): Bump the all-go group across 5 directories with 5 updates (#2919)
  chore: loosen syncer state check (#2927)
  ...
alpe added 16 commits January 7, 2026 16:11
* main:
  feat(tracing): add tracing to EngineClient (#2959)
  chore: inject W3C headers into engine client and eth client (#2958)
  feat: adding tracing for Executor and added initial configuration (#2957)
* main:
  fix: Fail fast when executor ahead (#2966)
  feat(block): async epoch fetching (#2952)
  perf: tune badger defaults and add db bench (#2950)
* main:
  chore: adding syncing tracing (#2981)
  feat(tracing): adding block production tracing (#2980)
  feat(tracing): Add Store, P2P and Config tracing (#2972)
  chore: fix upgrade test (#2979)
  build(deps): Bump github.com/ethereum/go-ethereum from 1.16.7 to 1.16.8 in /execution/evm/test in the go_modules group across 1 directory (#2974)
  feat(tracing): adding tracing to DA client (#2968)
  chore: create onboarding skill  (#2971)
  test: add e2e tests for force inclusion (part 2) (#2970)
  feat(tracing): adding eth client tracing (#2960)
  test: add e2e tests for force inclusion (#2964)
  build(deps): Bump the all-go group across 4 directories with 10 updates (#2969)
* main:
  fix: inconsistent state detection and rollback (#2983)
  chore: improve graceful shutdown restarts (#2985)
  feat(submitting): add posting strategies (#2973)
@claude
Copy link
Contributor

claude bot commented Jan 16, 2026

Claude finished @alpe's task in 4m 56s —— View job


Code Review: High Availability via RAFT

I've completed a comprehensive review of this RAFT high availability implementation. This is a significant and well-structured feature adding consensus-based failover capabilities. Below is my detailed feedback organized by focus area.


Summary

This PR introduces RAFT consensus to enable high availability with automatic leader election and failover. The implementation is generally solid with good separation of concerns, comprehensive testing, and careful attention to safety. I've identified several areas for improvement around error handling, configuration validation, and edge case handling.

Key Strengths:

  • Clean architecture with proper interface boundaries
  • Comprehensive E2E testing for failover scenarios
  • Safety checks to prevent double-signing and split-brain scenarios
  • Good use of atomic operations and proper synchronization

Areas for Improvement:

  • Some error handling gaps in critical paths
  • Configuration validation could be more robust
  • Missing nil checks in a few places
  • Documentation gaps on RAFT-specific behavior

1. Code Quality & Architecture

Strengths

Clean Package Structure (pkg/raft/)

  • Well-organized with clear separation: node.go, election.go, types.go
  • Proper use of interfaces (Runnable, sourceNode) for testability
  • Zero-dependency FSM implementation following Hashicorp RAFT patterns

Modular Design (node/failover.go)

  • Factory pattern for leader/follower mode creation is excellent
  • failoverState properly encapsulates mode-specific components
  • Good separation between static (singleRoleElector) and dynamic (DynamicLeaderElection) election strategies

Issues & Recommendations

🟡 MEDIUM: Missing error propagation in FSM Apply (pkg/raft/node.go:309-337)

func (f *FSM) Apply(log *raft.Log) interface{} {
    // ...
    if err := assertValid(f.state.Load(), &state); err != nil {
        return err  // ✅ Good - errors returned
    }
    // ...
    if f.applyCh != nil {
        select {
        case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
        default:
            f.logger.Warn().Msg("apply channel full, dropping message")  // ⚠️ Silent drop
        }
    }
    return nil
}

Recommendation: While non-blocking is correct for RAFT performance, consider:

  • Adding metrics to track dropped messages
  • Implementing backpressure or alerting when channel is consistently full
  • Document this behavior clearly as it could cause followers to fall behind

🟡 MEDIUM: Bootstrap mode is hardcoded (pkg/raft/node.go:108-115)

func (n *Node) Start(_ context.Context) error {
    if !n.config.Bootstrap {
        return fmt.Errorf("raft cluster requires bootstrap mode")  // ⚠️ Non-bootstrap not supported
    }

Recommendation: Either:

  • Implement non-bootstrap mode for joining existing clusters dynamically
  • Remove the Bootstrap config field if only bootstrap mode is supported
  • Document why bootstrap-only is acceptable for the current use case

🟢 MINOR: Typo in log message (pkg/raft/node.go:122)

n.logger.Info().Msg("Boostrap raft cluster")  // ⚠️ Typo: "Boostrap" -> "Bootstrap"

🟡 MEDIUM: Leadership transfer errors not handled (pkg/raft/election.go:78,100,123)
Multiple locations call d.node.leadershipTransfer() and ignore errors:

_ = d.node.leadershipTransfer()  // ⚠️ Error ignored

Recommendation: Log errors consistently. Leadership transfer failures should be visible for debugging cluster issues.

🟢 MINOR: Magic number for ticker interval (pkg/raft/election.go:86)

ticker := time.NewTicker(300 * time.Millisecond)  // ⚠️ Magic number

Recommendation: Extract to a named constant or configuration parameter.


2. Error Handling & Resilience

Strengths

Excellent Safety Checks (pkg/raft/election.go:96-105)

// Critical Safety Check: Wait for FSM to apply all committed logs up to LastIndex.
// If we start leader operations with stale FSM, we risk Double Signing
if err := d.node.waitForMsgsLanded(d.node.Config().SendTimeout); err != nil {
    d.logger.Error().Err(err).Msg("failed to wait for messages to land - FSM lagging, abdicating to prevent safety violation")
    if tErr := d.node.leadershipTransfer(); tErr != nil {
        d.logger.Error().Err(tErr).Msg("failed to transfer leadership")
    }
    continue
}

This is excellent defensive programming to prevent double-signing attacks.

Proper Rollback Detection (block/internal/common/replay.go:71-88)
Automatic rollback with Rollbackable interface is well-designed.

Issues & Recommendations

🔴 HIGH: Missing context cancellation check in broadcast (pkg/raft/node.go:234-252)

func (n *Node) Broadcast(ctx context.Context, state *RaftBlockState) error {
    if !n.IsLeader() {
        return fmt.Errorf("not leader")
    }
    // ...
    if ctx.Err() != nil {
        return ctx.Err()  // ✅ Check before Apply
    }
    future := n.raft.Apply(data, n.config.SendTimeout)
    if err := future.Error(); err != nil {
        return fmt.Errorf("apply log: %w", err)
    }
    return nil
}

Issue: After raft.Apply() is called, there's no check if context was cancelled during the operation. The RAFT apply will complete even if the caller's context is cancelled.

Recommendation: Either:

  • Document this behavior (RAFT operations are not cancellable mid-flight)
  • Add context monitoring: select { case <-ctx.Done(): return ctx.Err(); case <-future.Done(): ... }

🟡 MEDIUM: Snapshot restore doesn't validate state (pkg/raft/node.go:345-361)

func (f *FSM) Restore(rc io.ReadCloser) error {
    // ...
    if err := proto.Unmarshal(data, &state); err != nil {
        return fmt.Errorf("decode snapshot: %w", err)
    }
    f.state.Store(&state)  // ⚠️ No validation
    return nil
}

Recommendation: Add validation before restoring:

  • Check height is monotonic
  • Validate timestamp
  • Consider calling assertValid or similar validation

🟡 MEDIUM: Error handling in syncer rollback (block/internal/syncing/syncer.go:480-520)
The RecoverFromRaft method has complex error handling that might not properly clean up on partial failures. Consider transaction-like semantics or more explicit rollback on error.


3. Security

Strengths

Quorum Check Before Block Production (block/internal/executing/executor.go:411-413)

if e.raftNode != nil && !e.raftNode.HasQuorum() {
    return errors.New("raft cluster does not have quorum")
}

Excellent - prevents producing blocks without consensus availability.

Hash Verification After Replay (block/internal/executing/executor.go:305-321)
Double-checking content matches RAFT state prevents "Dual-Store Conflict" scenarios.

Issues & Recommendations

🟡 MEDIUM: JWT secret handling in tests (test/e2e/failover_e2e_test.go:57-63)
Tests handle JWT secrets but file paths aren't validated. Ensure test secrets don't leak to logs or error messages.

🟢 MINOR: Consider TLS for RAFT transport (pkg/raft/node.go:88-91)

transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)

Currently uses unencrypted TCP. For production, consider:

  • Adding TLS support for RAFT inter-node communication
  • Mutual TLS for cluster authentication
  • Document this as a deployment consideration

🟢 MINOR: Error messages could leak sensitive info (pkg/raft/node.go:132)
Ensure peer addresses in error messages don't expose internal network topology in logs accessible to untrusted parties.


4. Performance & Resource Efficiency

Strengths

Non-blocking FSM Apply (pkg/raft/node.go:328-333)
Using select with default case prevents RAFT cluster from blocking on slow consumers.

Proper Connection Pooling (block/internal/syncing/raft_retriever.go:62)
Buffered channel (100) provides reasonable buffering without excessive memory.

Issues & Recommendations

🟡 MEDIUM: Unbounded goroutine in election loop (pkg/raft/election.go:75-84)

go func(childCtx context.Context) {
    defer wg.Done()
    if err := workerFunc(childCtx); err != nil && !errors.Is(err, context.Canceled) {
        _ = d.node.leadershipTransfer()
        select {
        case errCh <- fmt.Errorf(name+" worker exited unexpectedly: %s", err):
        default: // do not block
        }
    }
}(workerCtx)

Recommendation: While current implementation is safe, consider:

  • Adding timeout for worker shutdown in startWorker
  • Metrics for worker lifecycle (started, stopped, errors)

🟢 MINOR: Consider connection limits (pkg/raft/node.go:88)

transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)

The hardcoded 3 is the max pool size. Consider making this configurable for larger clusters.

🟢 MINOR: Snapshot retention (pkg/raft/node.go:78)

snapshotStore, err := raft.NewFileSnapshotStore(cfg.RaftDir, int(cfg.SnapCount), os.Stderr)

cfg.SnapCount determines retention. Document disk space implications for users.


5. Testing & Reliability

Strengths

Excellent E2E Coverage (test/e2e/failover_e2e_test.go)

  • Comprehensive failover test with 3-node cluster
  • Tests leader shutdown, election, transaction propagation, and recovery
  • Good use of require.Eventually for asynchronous assertions

Good Unit Test Coverage (pkg/raft/election_test.go, pkg/raft/node_test.go)

  • Table-driven tests
  • Mock usage is appropriate
  • Edge cases covered (empty strings, duplicates, etc.)

Issues & Recommendations

🟡 MEDIUM: Missing RAFT split-brain test
The current E2E tests don't verify behavior when network partitions occur. Consider adding:

  • Test for network partition healing
  • Test for minority partition behavior
  • Test for simultaneous leader crashes

🟡 MEDIUM: E2E test cleanup (test/e2e/failover_e2e_test.go:116-125)

var leaderNode string
require.EventuallyWithT(t, func(collect *assert.CollectT) {
    leaderNode = clusterNodes.Leader(collect)
}, 5*time.Second, 200*time.Millisecond)

Recommendation: Add explicit cleanup/shutdown of nodes in test failure paths to prevent resource leaks in CI.

🟢 MINOR: Test has unused context (pkg/raft/node_test.go:50-51)

ctx := t.Context()
_ = ctx // keep to follow guideline

Consider using t.Context() for timeout enforcement in tests.

🟢 MINOR: Consider chaos testing
For a consensus system, consider:

  • Jepsen-style testing (if feasible)
  • Random failure injection
  • Clock skew simulation

6. Documentation & Developer Experience

Strengths

Good inline comments (pkg/raft/election.go:96-98)
Safety-critical code sections have excellent explanatory comments.

Clear interface contracts (pkg/raft/election.go:16-24)
Runnable interface is well-defined with clear method signatures.

Issues & Recommendations

🟡 MEDIUM: Missing package-level documentation
pkg/raft/ lacks a package doc explaining:

  • RAFT's role in the system
  • When to enable RAFT
  • Cluster size recommendations
  • Disaster recovery procedures

🟡 MEDIUM: Configuration documentation (pkg/config/config.go)
New RAFT config fields lack detailed comments:

type RaftConfig struct {
    Enable             bool
    NodeID             string
    RaftAddr           string
    RaftDir            string
    Bootstrap          bool
    Peers              string
    SnapCount          uint64
    SendTimeout        time.Duration
    HeartbeatTimeout   time.Duration
    LeaderLeaseTimeout time.Duration
}

Recommendation: Add godoc comments explaining:

  • NodeID: must be unique across cluster
  • Peers: format is "nodeID@address" comma-separated
  • Bootstrap: currently only true is supported
  • Timeout implications and tuning guidance

🟢 MINOR: Add migration guide
For operators upgrading to RAFT-enabled nodes, provide:

  • Step-by-step cluster setup
  • Rolling upgrade procedure (if supported)
  • Rollback strategy
  • Monitoring recommendations

🟢 MINOR: Proto file documentation (proto/evnode/v1/state.proto:23-32)
RaftBlockState message could benefit from field-level comments explaining the consensus state tracking.


7. Configuration & Validation

Issues & Recommendations

🟡 MEDIUM: Peer configuration validation is weak (pkg/raft/node.go:129-137)

for _, peer := range n.config.Peers {
    addr, err := splitPeerAddr(peer)
    if err != nil {
        return fmt.Errorf("peer %q : %w", peer, err)
    }
    if addr != thisNode {
        cfg.Servers = append(cfg.Servers, addr)
    }
}

Recommendation: Add validation for:

  • Minimum cluster size (3 for RAFT quorum)
  • Maximum recommended cluster size
  • Duplicate detection earlier (before BootstrapCluster)
  • Reachability checks (optional, for better UX)

🟡 MEDIUM: RAFT config defaults (pkg/config/defaults.go)
RAFT config has defaults but critical values like HeartbeatTimeout and LeaderLeaseTimeout should be validated against each other:

if cfg.Raft.LeaderLeaseTimeout < cfg.Raft.HeartbeatTimeout {
    return fmt.Errorf("leader lease timeout must be >= heartbeat timeout")
}

🟢 MINOR: RaftDir creation (pkg/raft/node.go:51)

if err := os.MkdirAll(cfg.RaftDir, 0755); err != nil {

Consider checking for:

  • Disk space availability
  • Write permissions
  • Existing data compatibility

8. Concurrency & Synchronization

Strengths

Proper atomic usage (pkg/raft/node.go:45,62)

state: &atomic.Pointer[RaftBlockState]

Correct use of atomic.Pointer for lock-free reads.

Good WaitGroup pattern (node/failover.go:58-64,176-231)
Proper goroutine lifecycle management with cleanup.

Issues & Recommendations

🟢 MINOR: Race in IsRunning (pkg/raft/election.go:227-229)

func (d *DynamicLeaderElection) IsRunning() bool {
    return d.running.Load()
}

This is fine, but document that it's eventually consistent (may return stale value briefly during state transitions).

🟢 MINOR: Channel buffer sizes (block/internal/syncing/raft_retriever.go:62)

applyCh := make(chan raft.RaftApplyMsg, 100)

100 is reasonable, but consider:

  • Making it configurable for high-throughput scenarios
  • Adding metrics for channel depth

9. Specific File Comments

pkg/raft/node.go

🟡 Line 183: Leadership transfer during shutdown is best-effort. Consider adding timeout and warning if transfer fails.

🟡 Line 256: proto.Clone creates a defensive copy - good. Consider documenting thread-safety guarantees of GetState().

🟡 Line 268-279: AddPeer logic for removing then re-adding is subtle. Add comment explaining why this is necessary.

block/internal/syncing/raft_retriever.go

🟡 Line 109: raftBlockPreProcessor is called but errors might not stop processing. Review error handling strategy.

🟡 Line 133-135: Validation errors are logged but processing continues (return nil). Clarify if this is intentional or should be return err.

node/failover.go

🟡 Line 247-249: Recovery error returns immediately, but comment says "we are stuck". Consider retry logic or more graceful degradation.

test/e2e/failover_e2e_test.go

🟢 Line 138: Good practice killing the leader. Consider also testing voluntary leadership transfer (graceful) vs forced (crash).


10. Final Recommendations

Before Merge (High Priority)

  1. Fix critical error handling gaps (RED issues above)
  2. Add package-level documentation for pkg/raft/
  3. Validate RAFT config relationships (timeouts, cluster size)
  4. Add metrics for RAFT operations (elections, applies, drops)

Follow-up (Can be separate PRs)

  1. Implement non-bootstrap mode or remove the config option
  2. Add TLS support for RAFT transport
  3. Create operator documentation (setup, upgrade, monitoring)
  4. Add split-brain and partition E2E tests
  5. Implement RAFT membership changes (dynamic add/remove nodes)

Conclusion

This is a well-engineered implementation of RAFT-based high availability. The architecture is sound, testing is comprehensive, and safety considerations are thoughtfully addressed. The identified issues are mostly medium-priority improvements that would increase robustness and operational visibility.

Recommendation:Approve with minor changes

The core functionality is solid. Address the HIGH priority issues (error handling gaps) and consider the MEDIUM issues (config validation, documentation) before merging.

Great work on this complex feature! The attention to safety (double-signing prevention, quorum checks, rollback detection) demonstrates strong distributed systems engineering.


@github-actions
Copy link
Contributor

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedJan 16, 2026, 8:37 PM

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request fundamentally changes how Rollkit nodes handle high availability by integrating the RAFT consensus algorithm. The primary goal is to ensure continuous operation and data consistency across a cluster of sequencer nodes, even in the event of node failures. The changes involve a significant architectural shift to support dynamic role management (leader/follower) and robust state synchronization, allowing nodes to automatically recover and maintain a consistent view of the blockchain state. This feature is crucial for building resilient and fault-tolerant Rollkit deployments.

Highlights

  • RAFT Consensus Integration: Introduces the Raft consensus algorithm for high availability, enabling dynamic leader election and state replication among sequencer nodes using the hashicorp/raft library.
  • Dynamic Leader Election & Failover: Implements a DynamicLeaderElection mechanism that allows nodes to seamlessly transition between leader and follower roles. This includes logic for handling leader loss, ensuring FSM synchronization before leadership, and attempting recovery if a new leader is not synced.
  • Enhanced State Synchronization & Recovery: The Executor and Syncer components are updated to verify local state against the Raft replicated state. This includes automatic rollbacks if the execution layer is ahead, checks for block existence in the store before syncing, and robust recovery mechanisms from Raft state during failover scenarios.
  • Refactored Node Architecture: The node's P2P client creation and management are refactored, and a new failoverState structure is introduced to encapsulate components that need to be reset during mode switches (aggregator/sync). This simplifies the node's lifecycle management in a HA setup.
  • New RPC Endpoint for RAFT Status: Adds a /raft/node HTTP endpoint to query the Raft node's leadership status and ID, providing visibility into the consensus state.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed feature: high availability for the sequencer using the Raft consensus algorithm. The changes are extensive, refactoring core components to support dynamic leader election and failover between aggregator (leader) and sync (follower) roles. The new architecture appears robust, with strong consistency checks to prevent issues like state divergence or stale leaders. The addition of comprehensive end-to-end tests for failover scenarios is particularly commendable. My review includes a few suggestions to improve configuration handling and error messages, aiming to enhance the operator experience and debuggability of this critical new functionality.

Comment on lines +163 to 166
prevState, err = s.store.GetStateAtHeight(ctx, height-1)
if err != nil {
return fmt.Errorf("failed to get previous state: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better debuggability, it would be helpful to include the specific height that was being queried in the error message when getting the previous state fails.

Suggested change
prevState, err = s.store.GetStateAtHeight(ctx, height-1)
if err != nil {
return fmt.Errorf("failed to get previous state: %w", err)
}
prevState, err = s.store.GetStateAtHeight(ctx, height-1)
if err != nil {
return fmt.Errorf("failed to get previous state at height %d: %w", height-1, err)
}

Comment on lines +121 to 122
return nil, fmt.Errorf("raft config must be used in sequencer setup only")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error message "raft config must be used in sequencer setup only" could be more explicit. It seems the design requires a node to be configured as a potential aggregator to participate in a Raft cluster. A clearer message would improve user experience when configuring nodes.

Suggested change
return nil, fmt.Errorf("raft config must be used in sequencer setup only")
}
return nil, fmt.Errorf("raft can only be enabled for aggregator nodes (aggregator flag must be true)")

Comment on lines +139 to +141
if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
return fmt.Errorf("duplicate peers found in config: %v", cfg.Servers)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of returning an error for duplicate peers in the configuration, it would be more user-friendly to log a warning and proceed with the deduplicated list. The deduplicateServers function already handles the removal of duplicates, so failing fast might be overly strict for what is likely a minor configuration mistake.

Suggested change
if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
return fmt.Errorf("duplicate peers found in config: %v", cfg.Servers)
}
if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
n.logger.Warn().Msgf("duplicate peers found in config, using deduplicated list: %v", svrs)
cfg.Servers = svrs
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants