Skip to content

Conversation

jwasinger
Copy link
Contributor

@jwasinger jwasinger commented Mar 27, 2024

Lifting this test-case from #29281 to provide some visibility on what I am stuck on and trying to debug.

The test syncs a chain, then syncs to a fork of the chain. Expect that the sync origin after this is at the fork block.

Logs indicate that the first sync succeeds. The sync to the fork succeeds in retrieving the skeleton but later fails after trying to start the backfiller:

=== RUN   TestBeaconForkedSyncProgress68Full
INFO [03-26|17:47:14.592] Syncing beacon headers                   downloaded=512 left=902 eta=5.757ms
INFO [03-26|17:47:14.597] Syncing beacon headers                   downloaded=1414 left=0   eta=0s
ERROR[03-26|17:47:14.597] Latest filled block is not available
INFO [03-26|17:47:14.597] Block synchronisation started
INFO [03-26|17:47:14.603] Imported new chain segment               number=5 hash=15e2c1..49705f blocks=5 txs=1 mgas=0.021 elapsed="454.291µs" mgasps=46.226 age=55y1w2d snapdiffs=269.00B triedirty=2.54KiB
INFO [03-26|17:47:14.650] Imported new chain segment               number=545 hash=9b529f..9ddd47 blocks=540 txs=24 mgas=0.504 elapsed=43.664ms    mgasps=11.543 age=55y1w1d snapdiffs=25.46KiB triedirty=60.58KiB
INFO [03-26|17:47:14.724] Imported new chain segment               number=1415 hash=8d10d7..7c444b blocks=870 txs=31 mgas=0.651 elapsed=72.514ms    mgasps=8.978  age=55y1w1d snapdiffs=67.57KiB triedirty=69.76KiB
syncing to fork B
INFO [03-26|17:47:14.724] Restarting sync cycle                    reason="chain forked, ancestor: 1414, hash: 0x2111945c4e6dd0c253a680dafef776c51bf9f2774f49d26d3e61bd4292175e6a, want: 0x7f84d7eb6d375f86682bfa635bf6b52b7147ff64b61246bf3af8baa90a3c870c"
INFO [03-26|17:47:14.727] Syncing beacon headers                   downloaded=1605 left=0   eta=0s
ERROR[03-26|17:47:14.727] Pivot header is not found                number=1351
ERROR[03-26|17:47:14.727] Beacon backfilling failed                err="pivot header is not found"
    downloader_test.go:1390: Failed to sync chain in three seconds

@rjl493456442
Copy link
Member

It detects a bug in skeleton sync code.

Specifically cleanStales will be called at the end of a skeleton sync cycle. However, if the filled header passed into function is the CurrentSnapBlock of last sync cycle, and the new skeleton head has the same height or even below the filled header, the entire skeleton header batch will be discarded, https://github.com/ethereum/go-ethereum/blob/master/eth/downloader/skeleton.go#L1150 Therefore the pivot block header is not accessible as the corresponding skeleton header is already removed.

I have to think how to properly fix it (or if you want to fix it by yourself, go ahead!). It's a nice uni test!

@jwasinger
Copy link
Contributor Author

jwasinger commented Mar 27, 2024

Great insight! I will attempt a fix.

@jwasinger jwasinger force-pushed the downloader-fork-beacon-test branch from 96f95fa to 70be2dd Compare April 7, 2024 18:54
@jwasinger
Copy link
Contributor Author

I have pushed a fix for this test-case:

When a skeleton sync has just linked, we assume that if the filled block is in the range of the skeleton it means there is a fork. Rewind the blockchain to the common ancestor of the skeleton and the filled block before proceeding with state/block backfilling.

I've noticed that this case triggers for any fork that occurs where the fork is larger than 1 block.

@jwasinger jwasinger changed the title eth/downloader: Add Test case which Beacon Syncs two forks eth/downloader: Fix Case where Skeleton Reorgs Below the Filled Block Apr 8, 2024
@jwasinger jwasinger marked this pull request as ready for review April 8, 2024 14:57
@karalabe
Copy link
Member

This seems to fix test and it seems to be the correct fix to me:

diff --git a/eth/downloader/skeleton.go b/eth/downloader/skeleton.go
index 873ee950b6..9f8953f4db 100644
--- a/eth/downloader/skeleton.go
+++ b/eth/downloader/skeleton.go
@@ -1132,6 +1132,15 @@ func (s *skeleton) cleanStales(filled *types.Header) error {
        if number+1 == s.progress.Subchains[0].Tail {
                return nil
        }
+       // If the latest fill was on a different subchain, it means the backfiller
+       // was interrupted before it got to do any meaningful work, no cleanup
+       header := rawdb.ReadSkeletonHeader(s.db, filled.Number.Uint64())
+       if header == nil {
+               return fmt.Errorf("filled header outside of skeleton range")
+       } else if header.Hash() != filled.Hash() {
+               log.Debug("Filled header on different sidechain", "number", number, "filled", filled.Hash(), "skeleton", header.Hash())
+               return nil
+       }
        var (
                start uint64
                end   uint64

I.e. When doing the skeleton cleanup, we check if the filled header is acually within the range of what we were meant to backfill. If not, it means the backfill was a noop (possibly because we started and stopped it so quickly that it didn't have time to do any meaningful work). In that case, just don't clean up anything.

An alternative would be for the backfiller itself to mandatorily do a chain reorg (in such cases), but that leaks a bit of chain management into the backfiller, which for now is completely handled by blcokchain.ImportXXX triggered by the downloader. So whilst triggering a chain reorg in the backfiller should also work just as fine, it introduces a second chain mutation point and I'd rather keep it to one.

What's unclear a bit is whether it can happen that the backfiller returns a header above the skeleton chain. In theory not (because that would mean it filled something it didn't know about), but that would error out with the new code so we should make sure.

@rjl493456442
Copy link
Member

What's unclear a bit is whether it can happen that the backfiller returns a header above the skeleton chain. In theory not (because that would mean it filled something it didn't know about), but that would error out with the new code so we should make sure.

It's totally possible.

If I understand correctly, there are four potential scenarios could happen for stale header clean.

(a) filled.number + 1 < skeleton.tail.
In this case the skeleton is above the local chain and for sure it's not linked with the local chain at all. It's an abnormal case in which we shouldn't clean up anything and bail out an error log

(b) filled.number + 1 == skeleton.tail
In this case a new skeleton segment has been retrieved but not yet consumed by the backfiller. Nothing to clean.

(c) skeleton.tail <= filled.number <= skeleton.head
In this case a reorg happens and skeleton connects to an ancestor of last filled header. In this case nothing should be cleaned as the chain split.

(d) filled.number > skeleton.head
In this case a reorg happens and skeleton connects to an ancestor of last filled header, but the new fork is shorter than local one . In this case nothing should be cleaned as the chain split.


The fix looks good to me, although i would vote a tiny change. If the filled.number > skeleton.head (Filled header outside of skeleton range), we should print out a log instead of returning the error.

diff --git a/eth/downloader/skeleton.go b/eth/downloader/skeleton.go
index 873ee950b6..04421a2bf5 100644
--- a/eth/downloader/skeleton.go
+++ b/eth/downloader/skeleton.go
@@ -1132,6 +1132,16 @@ func (s *skeleton) cleanStales(filled *types.Header) error {
 	if number+1 == s.progress.Subchains[0].Tail {
 		return nil
 	}
+	// If the latest fill was on a different subchain, it means the backfiller
+	// was interrupted before it got to do any meaningful work, no cleanup
+	header := rawdb.ReadSkeletonHeader(s.db, filled.Number.Uint64())
+	if header == nil {
+		log.Debug("Filled header outside of skeleton range", "number", number, "head", s.progress.Subchains[0].Head, "tail", s.progress.Subchains[0].Tail)
+		return nil
+	} else if header.Hash() != filled.Hash() {
+		log.Debug("Filled header on different sidechain", "number", number, "filled", filled.Hash(), "skeleton", header.Hash())
+		return nil
+	}
 	var (
 		start uint64
 		end   uint64

@jwasinger jwasinger force-pushed the downloader-fork-beacon-test branch from 4ad005f to 6ab5719 Compare April 12, 2024 14:49
@jwasinger jwasinger force-pushed the downloader-fork-beacon-test branch from 6ab5719 to d1aa652 Compare April 12, 2024 14:50
…nis non-existent or different than the corresponding skeleton header. add test case that beacon syncs a chain, beacon syncs to a separate fork

Co-authored-by: Péter Szilágyi <[email protected]>
@jwasinger jwasinger force-pushed the downloader-fork-beacon-test branch from d1aa652 to 9f6de6d Compare April 12, 2024 14:55
@jwasinger
Copy link
Contributor Author

Okay this should be good for merge.

Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@holiman holiman changed the title eth/downloader: Fix Case where Skeleton Reorgs Below the Filled Block eth/downloader: fix case where skeleton reorgs below the filled block Apr 23, 2024
@holiman holiman added this to the 1.14.0 milestone Apr 24, 2024
@holiman holiman merged commit 5f3c58f into ethereum:master Apr 24, 2024
jorgemmsilva pushed a commit to iotaledger/go-ethereum that referenced this pull request Jun 17, 2024
…ethereum#29358)

This change adds a testcase and fixes a corner-case in the skeleton sync.

With this change, when doing the skeleton cleanup, we check if the filled header is acually within the range of what we were meant to backfill. If not, it means the backfill was a noop (possibly because we started and stopped it so quickly that it didn't have time to do any meaningful work). In that case, just don't clean up anything.

---------

Co-authored-by: Péter Szilágyi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants