-
Notifications
You must be signed in to change notification settings - Fork 293
BN: New syncing algorithm #7578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help to change (parts of) holesky/sepolia/hoodi over to this branch?
Back for goerli/prater, I found this very helpful for testing, as merging to unstable
(even with subsequent revert) was sketchy, but not having it deployed anywhere was also not very fruitful.
The status-im/infra-nimbus
repo controls the branch that is used, and it is automatically rebuilt daily. One can pick the branch also for a subset of nodes (in around ~25% increments), and there is also a command to resync those nodes.
my scratchpad from goerli / holesky times, with instructions on how to connect to those servers, how to view the logs, how to restart them, and how to monitor their metrics:
FLEET:
Hostnames: https://metrics.status.im/d/pgeNfj2Wz23/nimbus-fleet-testnets?orgId=1&var-instance=geth-03.ih-eu-mda1.nimbus.holesky&var-container=beacon-node-holesky-testing&from=now-24h&to=now&refresh=15m
look at the instance/container dropdowns
the pattern should be fairly clear
then, to SSH to them, add .status.im
get a SSH access from jakub, tell him your SSH key (the correct half), and connect using -i the_other_half to etan@unstable-large-01.aws-eu-central-1a.nimbus.prater.statusim.net
> geth-01.ih-eu-mda1.nimbus.holesky.statusim.net (was renamed to status.im)
geth-01.ih-eu-mda1.nimbus.holesky.status.im
https://github.com/status-im/infra-nimbus/blob/0814b659654bb77f50aac7d456767b1794145a63/ansible/group_vars/all.yml#L23
sudo systemctl --no-block start build-beacon-node-holesky-unstable && journalctl -fu build-beacon-node-holesky-unstable
restart fleet
for a in {erigon,neth,geth}-{01..10}.ih-eu-mda1.nimbus.holesky.statusim.net; do ssh -o StrictHostKeychecking=no $a 'sudo systemctl --no-block start build-beacon-node-holesky-unstable'; done
tail -f /data/beacon-node-prater-unstable/logs/service.log
Remove any changes to callbacks.
Remove some debugging log statements.
Remove some debugging log statements.
I've opened an issue for testing of this branch: Please comment in it when you think the branch is ready for that. |
…sing PeerEntry's column map.
…p() to blob_quarantine. Add incl()/excl() functions to ColumnMap. Fix peer columns detection logic in doRangeSidecarStep().
Refactoring doPeerUpdateRootsSidecars().
Log when anonymous gossip messages incoming. Log blocks and sidecars by root differently.
This is high level description of new syncing algorithm.
First of all lets define some terms.
peerStatusCheckpoint
- Peer's latest finalizedCheckpoint
reported viastatus
request.peerStatusHead
- Peer's latest headBlockId
reported viastatus
request.lastSeenCheckpoint
- Its the latest finalizedCheckpoint
reported by our current set of peers, e.g.max(peerStatusCheckpoint.epoch)
.lastSeenHead
- Its the latest headBlockId
reported by our current set of peers, e.g.max(peerStatusHead.slot)
.finalizedDistance
=lastSeenCheckpoint.epoch - dag.headState.finalizedCheckpoint.epoch
.wallSyncDistance
=beaconClock.now().slotOrZero - dag.head.slot
.Every peer we get from PeerPool will start its loop:
status
information if its too "old", and "old" depends on current situation:1.1. Update
status
information when forward syncing is active - every10 * SECONDS_PER_SLOT
seconds.1.2. Update
status
information everySECONDS_PER_SLOT
period whenpeerStatusHead.slot.epoch - peerStatusCheckpoint.epoch >= 3
(which means that there is some period of non-finality).1.3. In all other cases node updates
status
information every5 * SECONDS_PER_SLOT
seconds.by root
requests, where roots are received fromsync_dag
module. IffinalizedDistance() < 4
epochs it will do:2.1. Request by root blocks in range of
[PeerStatusCheckpoint.epoch.start_slot, PeerStatusHead.slot]
.2.2. Request by root sidecars in range
[getForwardSidecarSlot(), PeerStatusHead.slot]
.finalizedDistance() > 1
epochs it will do:3.1. Request by range blocks in range of
[dag.finalizedHead.slot, lastSeenCheckpoint.epoch.start_slot]
.3.2. Request by range sidecars in range
[dag.finalizedHead.slot, lastSeenCheckpoint.epoch.start_slot]
.wallSyncDistance() < 1
(backfill process should not affect syncing status, so we pause backfill if node lost synced status) it will do:3.1. Request by range blocks in range of
[dag.backfill.slot, getFrontfillSlot()]
.3.2. Request by range sidecars in range of
[dag.backfill.slot, getBackfillSidecarSlot()]
.5.1. In case when peer providing use with some information -
no pause
.5.2. In case when endless loop detected (for some unknown reason peer not provided any information) -
1.seconds
pause.5.3. In case when we finished syncing -
N seconds
up to next slot.Also new SyncOverseer catches number of EventBus events, so it could maintain
sync_dag
structures.SyncManager
andRequestManager
got deprecated and removed from codebase.The core problem of
SyncManager
is that it could work withBlobSidecar
s, but could not work withDataColumnSidecar
. Because not all columns are available immediately, so it impossible to download blocks and columns in one step, like it was done inSyncManager
.Same problem exists in
RequestManager
, right nowRequestManager
when have missing parent just randomly selects 2 peers (without any filtering) and tries to download blocks and sidecars from this peers. If inBlobSidecar
age it will work in most of the cases, inDataColumnSidecar
age the probability of success is much more lower...