BN: New syncing algorithm #7578

cheatfate · 2025-10-07T13:24:27Z

This is high level description of new syncing algorithm.

First of all lets define some terms.

peerStatusCheckpoint - Peer's latest finalized Checkpoint reported via status request.
peerStatusHead - Peer's latest head BlockId reported via status request.
lastSeenCheckpoint - Its the latest finalized Checkpoint reported by our current set of peers, e.g. max(peerStatusCheckpoint.epoch).
lastSeenHead - Its the latest head BlockId reported by our current set of peers, e.g. max(peerStatusHead.slot).
finalizedDistance = lastSeenCheckpoint.epoch - dag.headState.finalizedCheckpoint.epoch.
wallSyncDistance = beaconClock.now().slotOrZero - dag.head.slot.

Every peer we get from PeerPool will start its loop:

Updates Peer status information if its too "old", and "old" depends on current situation:
1.1. Update status information when forward syncing is active - every 10 * SECONDS_PER_SLOT seconds.
1.2. Update status information every SECONDS_PER_SLOT period when peerStatusHead.slot.epoch - peerStatusCheckpoint.epoch >= 3 (which means that there is some period of non-finality).
1.3. In all other cases node updates status information every 5 * SECONDS_PER_SLOT seconds.
Perform some by root requests, where roots are received from sync_dag module. If finalizedDistance() < 4 epochs it will do:
2.1. Request by root blocks in range of [PeerStatusCheckpoint.epoch.start_slot, PeerStatusHead.slot].
2.2. Request by root sidecars in range [getForwardSidecarSlot(), PeerStatusHead.slot].
If finalizedDistance() > 1 epochs it will do:
3.1. Request by range blocks in range of [dag.finalizedHead.slot, lastSeenCheckpoint.epoch.start_slot].
3.2. Request by range sidecars in range [dag.finalizedHead.slot, lastSeenCheckpoint.epoch.start_slot].
If node needs backfill process and if wallSyncDistance() < 1 (backfill process should not affect syncing status, so we pause backfill if node lost synced status) it will do:
3.1. Request by range blocks in range of [dag.backfill.slot, getFrontfillSlot()].
3.2. Request by range sidecars in range of [dag.backfill.slot, getBackfillSidecarSlot()].
Do some pause (to avoid endless loops) which will do:
5.1. In case when peer providing use with some information - no pause.
5.2. In case when endless loop detected (for some unknown reason peer not provided any information) - 1.seconds pause.
5.3. In case when we finished syncing - N seconds up to next slot.

Also new SyncOverseer catches number of EventBus events, so it could maintain sync_dag structures.

Block from gossip monitoring loop. This event will be fired only when block comes from gossip.
Block monitoring loop. This event will be fired for any block added to processor (blocks from gossip, blocks from proposer, blocks from sync).
Finalization monitoring loop.

SyncManager and RequestManager got deprecated and removed from codebase.
The core problem of SyncManager is that it could work with BlobSidecars, but could not work with DataColumnSidecar. Because not all columns are available immediately, so it impossible to download blocks and columns in one step, like it was done in SyncManager.

Same problem exists in RequestManager, right now RequestManager when have missing parent just randomly selects 2 peers (without any filtering) and tries to download blocks and sidecars from this peers. If in BlobSidecar age it will work in most of the cases, in DataColumnSidecar age the probability of success is much more lower...

github-actions · 2025-10-07T16:05:28Z

Unit Test Results

      15 files ±  0   3 035 suites +5 1h 33m 58s ⏱️ -47s
12 066 tests +  5 11 496 ✔️ +  5 570 💤 ±0 0 ❌ ±0
76 506 runs +25 75 654 ✔️ +25 852 💤 ±0 0 ❌ ±0

Results for commit f8a79ea. ± Comparison against base commit d07084a.

♻️ This comment has been updated with latest results.

etan-status

Would it help to change (parts of) holesky/sepolia/hoodi over to this branch?

Back for goerli/prater, I found this very helpful for testing, as merging to unstable (even with subsequent revert) was sketchy, but not having it deployed anywhere was also not very fruitful.

The status-im/infra-nimbus repo controls the branch that is used, and it is automatically rebuilt daily. One can pick the branch also for a subset of nodes (in around ~25% increments), and there is also a command to resync those nodes.

https://github.com/status-im/infra-nimbus/pull/179/files

my scratchpad from goerli / holesky times, with instructions on how to connect to those servers, how to view the logs, how to restart them, and how to monitor their metrics:

FLEET:

Hostnames: https://metrics.status.im/d/pgeNfj2Wz23/nimbus-fleet-testnets?orgId=1&var-instance=geth-03.ih-eu-mda1.nimbus.holesky&var-container=beacon-node-holesky-testing&from=now-24h&to=now&refresh=15m

look at the instance/container dropdowns
the pattern should be fairly clear
then, to SSH to them, add .status.im

get a SSH access from jakub, tell him your SSH key (the correct half), and connect using -i the_other_half to etan@unstable-large-01.aws-eu-central-1a.nimbus.prater.statusim.net

> geth-01.ih-eu-mda1.nimbus.holesky.statusim.net   (was renamed to status.im)
  geth-01.ih-eu-mda1.nimbus.holesky.status.im

https://github.com/status-im/infra-nimbus/blob/0814b659654bb77f50aac7d456767b1794145a63/ansible/group_vars/all.yml#L23
sudo systemctl --no-block start build-beacon-node-holesky-unstable && journalctl -fu build-beacon-node-holesky-unstable

restart fleet

for a in {erigon,neth,geth}-{01..10}.ih-eu-mda1.nimbus.holesky.statusim.net; do ssh -o StrictHostKeychecking=no $a 'sudo systemctl --no-block start build-beacon-node-holesky-unstable'; done


tail -f /data/beacon-node-prater-unstable/logs/service.log

Remove any changes to callbacks.

Remove some debugging log statements.

…progress.

jakubgs · 2025-10-14T08:07:55Z

I've opened an issue for testing of this branch:

Test new Nimbus syncing algorithm infra-nimbus#265

Please comment in it when you think the branch is ready for that.

…sing PeerEntry's column map.

…p() to blob_quarantine. Add incl()/excl() functions to ColumnMap. Fix peer columns detection logic in doRangeSidecarStep().

Refactoring doPeerUpdateRootsSidecars().

…ors.

Log when anonymous gossip messages incoming. Log blocks and sidecars by root differently.

cheatfate marked this pull request as draft October 8, 2025 11:59

cheatfate force-pushed the syncv3 branch from 71f84e7 to af756f3 Compare October 13, 2025 09:17

etan-status reviewed Oct 13, 2025

View reviewed changes

cheatfate force-pushed the syncv3 branch from f2a6e8f to d1d6feb Compare October 13, 2025 09:48

cheatfate added 25 commits October 14, 2025 10:35

More work

f979b6d

More changes.

03126e8

Add sync access to engine events.

43a0d63

Remove any changes to callbacks.

Add groupSidecars(DataColumnsByRootIdentifier).

6b9504f

Addressing some TODOs in overseer.

296f878

Add missing shortLog()

ffb65cf

Replace one more TODO.

4c6fa90

Addressing column intersection TODO.

337217a

Add event handlers.

1c727ca

Update events implementation.

5a83c4f

Add some debugging logs.

a9842af

Fix crash.

b9474c6

Fix some issues and add some more debug logging.

c3cca2f

Move all the blocks received on range step to the BlockBuffer.

02abb12

Upgrade BlockBuffer.

0b7cbad

Fix assertion crash.

2962f84

Fix how sidecars checking procedures.

7ac49c8

Do fixes of byroot sync.

18b4bca

Fix compilation issues.

4ff448a

Add loop pause when there is no work to do.

694e512

Fix incremental math.

54a9b91

Make blobs and columns lists in logs smaller.

95826c4

Remove some debugging log statements.

Actively reload how columns/blobs are logged.

0c9fd14

Investigation of peer's endless loop issue.

86fe989

Add performance meters.

98f1ae8

cheatfate added 12 commits October 14, 2025 10:35

Add peerLog logging.

ec997c4

Fix sidecars syncer conditions.

2f7935a

Remove some debugging log statements.

Remove peer_log.

7dd3f31

Remove some debugging log statements.

c84a345

Missing sidecars helper functions.

1f6e69b

Simplify getMissingSidecarIndices(columns).

f22c507

Post rebase fixes.

cf9a7f7

Add missing sidecar indices to logs, so it possible to track columns …

b9484ca

…progress.

Fix test_quarantine.

edfdd92

Add some columns debugging statements.

3d6369d

Attempt to fix weird chronicles assertion.

2a5936c

Fix column distribution and rate logging.

d9979f5

cheatfate force-pushed the syncv3 branch from 1352c68 to d9979f5 Compare October 14, 2025 07:36

jakubgs mentioned this pull request Oct 14, 2025

Test new Nimbus syncing algorithm status-im/infra-nimbus#265

Open

cheatfate added 15 commits October 14, 2025 12:56

Fix You should not pop so many requests assertion crash and start u…

71145a7

…sing PeerEntry's column map.

Add quarantine shortLog to check what is happening.

21e0524

Add shortLog(columns).

887fa6d

Do not request columns if we already have it.

4ceada8

Fix new columns calculations.

8c80cb1

One more fix.

f85269a

Optimize getMissingSidecarIndices() and introduce getMissingColumnsMa…

a447516

…p() to blob_quarantine. Add incl()/excl() functions to ColumnMap. Fix peer columns detection logic in doRangeSidecarStep().

Some updates to blob_quarantine.

9262f8d

Refactoring doPeerUpdateRootsSidecars().

Add SyncDag path to main debug log statement.

dde7f8f

Investigating blobs in columns age, more logs and fixes.

ad95013

Still unclear where columns are lost.

0989bcc

Remove blob quarantine processing after finalization.

d2181b2

Fix: Do not remove blobs/columns on MissingSidecars/MissingParent err…

68342e6

…ors.

Fix compilation issue.

f16ff47

Log full root map to understand why there missing blocks.

f8a79ea

Log when anonymous gossip messages incoming. Log blocks and sidecars by root differently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BN: New syncing algorithm #7578

BN: New syncing algorithm #7578

cheatfate commented Oct 7, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

etan-status left a comment

Uh oh!

jakubgs commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BN: New syncing algorithm #7578

Are you sure you want to change the base?

BN: New syncing algorithm #7578

Conversation

cheatfate commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

etan-status left a comment

Choose a reason for hiding this comment

Uh oh!

jakubgs commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cheatfate commented Oct 7, 2025 •

edited

Loading

github-actions bot commented Oct 7, 2025 •

edited

Loading