-
Notifications
You must be signed in to change notification settings - Fork 196
Implement Paimon Source Incremental Sync #780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| return false; | ||
| } | ||
|
|
||
| // Check 3: Verify a snapshot exists at or before the instant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this check simply check that the earliestSnapshot.timeMillis <= timeInMillis?
| assertTrue(filesDiff.getFilesAdded().size() > 0); | ||
|
|
||
| // Verify removed files collection exists (size may vary based on compaction behavior) | ||
| assertNotNull(filesDiff.getFilesRemoved()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to configure the test so that the sizes are predictable and we can assert on them?
Also should we assert the files removed is non-zero?
|
|
||
| CommitsBacklog<Snapshot> backlog = conversionSource.getCommitsBacklog(instantsForSync); | ||
|
|
||
| // Verify we get at least the second snapshot (may get more if insertRows creates multiple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with Paimon, what would cause a single round of inserts to create multiple snapshots?
|
|
||
| // Verify we get at least the second snapshot (may get more if insertRows creates multiple) | ||
| assertNotNull(backlog); | ||
| assertTrue(backlog.getCommitsToProcess().size() >= 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we cannot know the size for certain upfront, we may want to assert that the first snapshot is not in the list of commits to process
| @Test | ||
| void testIsIncrementalSyncSafeFromReturnsFalse() { | ||
| Instant testInstant = Instant.now(); | ||
| void testIsIncrementalSyncSafeFromReturnsTrueForValidInstant() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test where the IsIncrementalSyncSafeFrom returns false since the instant is before the first snapshot?
|
|
||
| // Insert more data to create a second snapshot | ||
| testTable.insertRows(3); | ||
| org.apache.paimon.Snapshot secondSnapshot = paimonTable.snapshotManager().latestSnapshot(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we import org.apache.paimon.Snapshot here?
Implements #754
What is the purpose of the pull request
Implement Paimon Source Incremental Sync
Brief change log
• Implemented getTableChangeForCommit() to extract file changes (added/removed) from delta manifests for
incremental sync
• Implemented getCommitsBacklog() to identify snapshots that need to be processed since the last sync instant
• Implemented isIncrementalSyncSafeFrom() to validate if incremental sync is safe from a given instant by checking
snapshot availability
Verify this pull request
• Added tests in TestPaimonDataFileExtractor to cover the
extractFilesDiff()logic.• Added tests in TestPaimonConversionSource to cover the incremental sync methods.
• Existing integration tests in
ITConversionControllerverify end-to-end incremental sync behavior