Skip to content

[Bug] RowId mismatch in file and metadata #6747

@Kkkaneki-k

Description

@Kkkaneki-k

Search before asking

  • I searched in the issues and found nothing similar.

Paimon version

Master

Compute Engine

Spark

Minimal reproduce step

// first part
spark.sql("CREATE TABLE t (id INT, data INT) TBLPROPERTIES ('row-tracking.enabled' = 'true')")
spark.sql("INSERT INTO t SELECT /*+ REPARTITION(1) */ id, id AS data FROM range(1, 4)")

// second part
spark.sql("UPDATE t SET data = 22 WHERE id = 2")

// third part
spark.sql("INSERT INTO t VALUES (4, 4), (5, 5)")
spark.sql("SELECT *, _ROW_ID, _SEQUENCE_NUMBER FROM t").show
/* the result of select
+---+----+-------+----------------+
| id|data|_ROW_ID|_SEQUENCE_NUMBER|
+---+----+-------+----------------+
|  1|   1|      0|               1|
|  2|  22|      1|               2|
|  3|   3|      2|               1|
|  4|   4|      6|               3|
|  5|   5|      7|               3|
+---+----+-------+----------------+
*/

What doesn't meet your expectations?

When the second part of the code above (the update operation) is executed, the original data is read from the old file and written to a new file, along with _ROW_ID and _SEQUENCE_NUMBER. At this point, the new file contains both _ROW_ID and _SEQUENCE_NUMBER, but the firstRowId in the file metadata is null. Later, during the commit phase, the firstRowId in the file metadata is assigned based on the nextRowId from the snapshot. This leads to a mismatch between the rowIds in the file and the metadata. As a result, if we want to query data by rowId, some records may be missed, because paimon core skips certain files according to the firstRowId in the metadata when generating a scan plan.
Additionally, when the third part of the code (the insert operation) is executed, this issue also causes the newly inserted rows to have unexpected _ROW_ID.
A visualization of this issue is provided below.
Image
This issue likewise exists for the merge into operation (when only 'row-tracking.enabled' = 'true' is set). To resolve this issue, it may be necessary to assign the firstRowId in the metadata during the write phase for update and merge into scenarios, rather than delaying it until the commit phase.

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions