Skip to content

Conversation

cheatfate
Copy link
Contributor

@cheatfate cheatfate commented Oct 7, 2025

This is high level description of new syncing algorithm.

First of all lets define some terms.

  1. peerStatusCheckpoint - Peer's latest finalized Checkpoint reported via status request.
  2. peerStatusHead - Peer's latest head BlockId reported via status request.
  3. lastSeenCheckpoint - Its the latest finalized Checkpoint reported by our current set of peers, e.g. max(peerStatusCheckpoint.epoch).
  4. lastSeenHead - Its the latest head BlockId reported by our current set of peers, e.g. max(peerStatusHead.slot).
  5. finalizedDistance = lastSeenCheckpoint.epoch - dag.headState.finalizedCheckpoint.epoch.
  6. wallSyncDistance = beaconClock.now().slotOrZero - dag.head.slot.

Every peer we get from PeerPool will start its loop:

  1. Updates Peer status information if its too "old", and "old" depends on current situation:
    1.1. Update status information when forward syncing is active - every 10 * SECONDS_PER_SLOT seconds.
    1.2. Update status information every SECONDS_PER_SLOT period when peerStatusHead.slot.epoch - peerStatusCheckpoint.epoch >= 3 (which means that there is some period of non-finality).
    1.3. In all other cases node updates status information every 5 * SECONDS_PER_SLOT seconds.
  2. Perform some by root requests, where roots are received from sync_dag module. If finalizedDistance() < 4 epochs it will do:
    2.1. Request by root blocks in range of [PeerStatusCheckpoint.epoch.start_slot, PeerStatusHead.slot].
    2.2. Request by root sidecars in range [getForwardSidecarSlot(), PeerStatusHead.slot].
  3. If finalizedDistance() > 1 epochs it will do:
    3.1. Request by range blocks in range of [dag.finalizedHead.slot, lastSeenCheckpoint.epoch.start_slot].
    3.2. Request by range sidecars in range [dag.finalizedHead.slot, lastSeenCheckpoint.epoch.start_slot].
  4. If node needs backfill process and if wallSyncDistance() < 1 (backfill process should not affect syncing status, so we pause backfill if node lost synced status) it will do:
    3.1. Request by range blocks in range of [dag.backfill.slot, getFrontfillSlot()].
    3.2. Request by range sidecars in range of [dag.backfill.slot, getBackfillSidecarSlot()].
  5. Do some pause (to avoid endless loops) which will do:
    5.1. In case when peer providing use with some information - no pause.
    5.2. In case when endless loop detected (for some unknown reason peer not provided any information) - 1.seconds pause.
    5.3. In case when we finished syncing - N seconds up to next slot.

Also new SyncOverseer catches number of EventBus events, so it could maintain sync_dag structures.

  1. Block from gossip monitoring loop. This event will be fired only when block comes from gossip.
  2. Block monitoring loop. This event will be fired for any block added to processor (blocks from gossip, blocks from proposer, blocks from sync).
  3. Finalization monitoring loop.

SyncManager and RequestManager got deprecated and removed from codebase.
The core problem of SyncManager is that it could work with BlobSidecars, but could not work with DataColumnSidecar. Because not all columns are available immediately, so it impossible to download blocks and columns in one step, like it was done in SyncManager.

Same problem exists in RequestManager, right now RequestManager when have missing parent just randomly selects 2 peers (without any filtering) and tries to download blocks and sidecars from this peers. If in BlobSidecar age it will work in most of the cases, in DataColumnSidecar age the probability of success is much more lower...

Copy link

github-actions bot commented Oct 7, 2025

Unit Test Results

       15 files  ±  0    3 035 suites  +5   1h 33m 58s ⏱️ -47s
12 066 tests +  5  11 496 ✔️ +  5  570 💤 ±0  0 ±0 
76 506 runs  +25  75 654 ✔️ +25  852 💤 ±0  0 ±0 

Results for commit f8a79ea. ± Comparison against base commit d07084a.

♻️ This comment has been updated with latest results.

@cheatfate cheatfate marked this pull request as draft October 8, 2025 11:59
Copy link
Contributor

@etan-status etan-status left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help to change (parts of) holesky/sepolia/hoodi over to this branch?

Back for goerli/prater, I found this very helpful for testing, as merging to unstable (even with subsequent revert) was sketchy, but not having it deployed anywhere was also not very fruitful.

The status-im/infra-nimbus repo controls the branch that is used, and it is automatically rebuilt daily. One can pick the branch also for a subset of nodes (in around ~25% increments), and there is also a command to resync those nodes.

my scratchpad from goerli / holesky times, with instructions on how to connect to those servers, how to view the logs, how to restart them, and how to monitor their metrics:

FLEET:

Hostnames: https://metrics.status.im/d/pgeNfj2Wz23/nimbus-fleet-testnets?orgId=1&var-instance=geth-03.ih-eu-mda1.nimbus.holesky&var-container=beacon-node-holesky-testing&from=now-24h&to=now&refresh=15m

look at the instance/container dropdowns
the pattern should be fairly clear
then, to SSH to them, add .status.im

get a SSH access from jakub, tell him your SSH key (the correct half), and connect using -i the_other_half to etan@unstable-large-01.aws-eu-central-1a.nimbus.prater.statusim.net

> geth-01.ih-eu-mda1.nimbus.holesky.statusim.net   (was renamed to status.im)
  geth-01.ih-eu-mda1.nimbus.holesky.status.im

https://github.com/status-im/infra-nimbus/blob/0814b659654bb77f50aac7d456767b1794145a63/ansible/group_vars/all.yml#L23
sudo systemctl --no-block start build-beacon-node-holesky-unstable && journalctl -fu build-beacon-node-holesky-unstable

restart fleet

for a in {erigon,neth,geth}-{01..10}.ih-eu-mda1.nimbus.holesky.statusim.net; do ssh -o StrictHostKeychecking=no $a 'sudo systemctl --no-block start build-beacon-node-holesky-unstable'; done


tail -f /data/beacon-node-prater-unstable/logs/service.log

@jakubgs
Copy link
Member

jakubgs commented Oct 14, 2025

I've opened an issue for testing of this branch:

Please comment in it when you think the branch is ready for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants