-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Search before asking
- I searched in the issues and found nothing similar.
Paimon version
Master
Compute Engine
Spark
Minimal reproduce step
// first part
spark.sql("CREATE TABLE t (id INT, data INT) TBLPROPERTIES ('row-tracking.enabled' = 'true')")
spark.sql("INSERT INTO t SELECT /*+ REPARTITION(1) */ id, id AS data FROM range(1, 4)")
// second part
spark.sql("UPDATE t SET data = 22 WHERE id = 2")
// third part
spark.sql("INSERT INTO t VALUES (4, 4), (5, 5)")
spark.sql("SELECT *, _ROW_ID, _SEQUENCE_NUMBER FROM t").show
/* the result of select
+---+----+-------+----------------+
| id|data|_ROW_ID|_SEQUENCE_NUMBER|
+---+----+-------+----------------+
| 1| 1| 0| 1|
| 2| 22| 1| 2|
| 3| 3| 2| 1|
| 4| 4| 6| 3|
| 5| 5| 7| 3|
+---+----+-------+----------------+
*/
What doesn't meet your expectations?
When the second part of the code above (the update operation) is executed, the original data is read from the old file and written to a new file, along with _ROW_ID and _SEQUENCE_NUMBER. At this point, the new file contains both _ROW_ID and _SEQUENCE_NUMBER, but the firstRowId in the file metadata is null. Later, during the commit phase, the firstRowId in the file metadata is assigned based on the nextRowId from the snapshot. This leads to a mismatch between the rowIds in the file and the metadata. As a result, if we want to query data by rowId, some records may be missed, because paimon core skips certain files according to the firstRowId in the metadata when generating a scan plan.
Additionally, when the third part of the code (the insert operation) is executed, this issue also causes the newly inserted rows to have unexpected _ROW_ID.
A visualization of this issue is provided below.

This issue likewise exists for the merge into operation (when only 'row-tracking.enabled' = 'true' is set). To resolve this issue, it may be necessary to assign the firstRowId in the metadata during the write phase for update and merge into scenarios, rather than delaying it until the commit phase.
Anything else?
No response
Are you willing to submit a PR?
- I'm willing to submit a PR!