-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[spark] paimon-spark supports row id push down #6697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It seems your modification breaks some tests. |
Yes, I am working on it. |
|
I have discovered some issues in paimon core and opened an issue #6747 to describe them. These issues will affect spark’s ability to read data based on rowIds, so I will temporarily move this PR to draft status. |
|
Thanks @Kkkaneki-k |
|
Hello, now row id pushdown only work for data evolution table. Can you continue to finish this? |
OK, I will continue to complete this PR. |
75220e5 to
44bfadf
Compare
|
@JingsongLi I've finished the modifications. PTAL if you have some time, thanks! |
|
@Kkkaneki-k Can you rebase latest master? |
| + "By default is the number of processors available to the Java virtual machine."); | ||
|
|
||
| public static final ConfigOption<Boolean> ROW_ID_PUSH_DOWN_ENABLED = | ||
| key("row-id-push-down.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this option, it is useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this option, it is useless.
I think this option is necessary, it disables rowId pushdown for non-data-evolution tables. Currently, due to the issue in #6747, rowId pushdown should not be enabled for non-data-evolution tables.
| table: InnerTable, | ||
| requiredSchema: StructType, | ||
| filters: Seq[Predicate], | ||
| override val rowIds: Seq[JLong], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rowIds -> pushedRowIds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rowIds -> pushedRowIds
done
| * AND _ROW_ID IN (1, 2)}). | ||
| * </ul> | ||
| */ | ||
| public class RowIdPredicateVisitor implements PredicateVisitor<Set<Long>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we return Set<Range>? In this way, we can support min max pushdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we return
Set<Range>? In this way, we can support min max pushdown.
done
44bfadf to
e58cf42
Compare
2971128 to
92d285e
Compare
| false, | ||
| Collections.singletonList(new DataField(-1, ROW_ID.name(), DataTypes.BIGINT()))) | ||
| val converterWithRowId = new SparkFilterConverter(rowTypeWithRowId) | ||
| val newPredicate = converterWithRowId.convertIgnoreFailure(filter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass the filter containing RowId into PaimonScanBuilder, and let Paimon Core parse it and generate the corresponding range itself? It feels redundant to reimplement this logic in every engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass the filter containing RowId into PaimonScanBuilder, and let Paimon Core parse it and generate the corresponding range itself? It feels redundant to reimplement this logic in every engine.
Thanks for your review! I've carefully considered your suggestion and think that this change might introduce the following two problems:
- When a filter containing
_ROW_IDcannot be consumed, we need to return it to the engine as a post-scan filter. This may be difficult to achieve if Paimon Core itself consumes it and generates the corresponding range. - Currently, ReadBuilder requires separate inputs of filters containing
_ROW_IDand filters without_ROW_IDduring the build process. This means we need to differentiate between these two types of filters in the engine and input them separately (unless we modify ReadBuilder to handle this differentiation automatically during the build process).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
It should pass the filter containing _ROW_ID as
pushedDataFiltersto Paimon, while still adding it to the postFilter for Spark to handle. Therefore, almost no changes are needed in Paimon's Spark connector—except possibly updatenew SparkFilterConverter(rowType)with requiredSchema or rowType with row id. -
Yes, that’s exactly where I intend to put it. CC @JingsongLi
ReadBuilderImpl:
private InnerTableScan configureScan(InnerTableScan scan) {
scan.withFilter(filter)
.withReadType(readType)
.withPartitionFilter(partitionFilter)
// calculate rowRanges from filter
.withRowRanges(rowRanges);
}I believe users would much prefer writing filters that include _ROW_ID when using the ReadBuilder API, rather than having to understand what List rowRanges is and convert to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It should pass the filter containing _ROW_ID as
pushedDataFiltersto Paimon, while still adding it to the postFilter for Spark to handle. Therefore, almost no changes are needed in Paimon's Spark connector—except possibly updatenew SparkFilterConverter(rowType)with requiredSchema or rowType with row id.- Yes, that’s exactly where I intend to put it. CC @JingsongLi
ReadBuilderImpl:
private InnerTableScan configureScan(InnerTableScan scan) { scan.withFilter(filter) .withReadType(readType) .withPartitionFilter(partitionFilter) // calculate rowRanges from filter .withRowRanges(rowRanges); }I believe users would much prefer writing filters that include _ROW_ID when using the ReadBuilder API, rather than having to understand what List rowRanges is and convert to it.
Thanks for your reply! I think your point is correct, and I will modify my code based on your suggestions.
7268230 to
976a4df
Compare
|
@Zouxxyy I've modified my code based on your suggestions. PTAL if you have some time, thanks! |
update format code
976a4df to
4b34e36
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
* upstream/master: (51 commits) [test] Fix unstable test: handle MiniCluster shutdown gracefully in collect method (apache#6913) [python] fix ray dataset not lazy loading issue when parallelism = 1 (apache#6916) [core] Refactor ExternalPathProviders abstraction [spark] fix Merge Into unstable tests (apache#6912) [core] Enable Entropy Inject for data file path to prevent being throttled by object storage (apache#6832) [iceberg] support millisecond timestamps in iceberg compatibility mode (apache#6352) [spark] Handle NPE for pushdown aggregate when a datasplit has a null max/min value (apache#6611) [test] Fix unstable case testLimitPushDown [core] Refactor row id pushdown to DataEvolutionFileStoreScan [spark] paimon-spark supports row id push down (apache#6697) [spark] Support compact_database procedure (apache#6328) (apache#6910) [lucene] Fix row count in IndexManifestEntry [test] Remove unstable test: AppendTableITCase.testFlinkMemoryPool [core] Refactor Global index writer and reader for Btree [core] Minor refactor to magic number into footer [core] Support btree global index in paimon-common (apache#6869) [spark] Optimize compact for data-evolution table, commit multiple times to avoid out of memory (apache#6907) [rest] Add fromSnapshot to rollback (apache#6905) [test] Fix unstable RowTrackingTestBase test [core] Simplify FileStoreCommitImpl to extract some classes (apache#6904) ...
Purpose
This PR is about to support row id push down for spark, following #6483
Linked issue: None
Tests
org.apache.paimon.spark.sql.RowIdPushDownTestBase
API and Format
Documentation