-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-48030][SQL] SPJ: cache rowOrdering and structType for InternalRowComparableWrapper #46265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Before applying changes in this PR, the benchmark code(in this PR) took: After this PR: |
|
@sunchao, @szehon-ho and @yabola would you mind to take a look at this and help review this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1024 should be sufficient, it could be a SQL configuration though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me to have a config, though not familiar with spark preference for these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a quick look, this looks good to me. In addition, is there a better expireAfterAccess configuration for NonFateSharingCache?
hmmm, maybe. However, the memory usage of cache should be relatively low. Let's wait for other people's opinions |
|
@advancedxy It appears that this PR can enhance performance when there are a large number of partitions. Could you please share the test results from a real DatasourceV2 table, such as Iceberg? |
I cannot share the exact numbers. However, I have described the estimated number of partitions(~N00_000, where N <=2)and time to plan in the jira. After this patch, the planning time should be dropped to seconds(from tens of minutes). Hope that helps. |
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good find and promising results on real table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me to have a config, though not familiar with spark preference for these things.
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It would be nice to have something to configure this but I don't think it is super important. I feel the default value should be more than enough for most use cases? Similarly for expireAfterAccess.
|
Thanks! merged to master. |
Thanks for reviewing this. |
### What changes were proposed in this pull request? This PR aims to regenerate benchmark results (except `ExternalAppendOnlyUnsafeRowArrayBenchmark`) as a preparation for Apache Spark 4.0.0-preview2. - During the testing, it's observed that `ExternalAppendOnlyUnsafeRowArrayBenchmark` hangs in both CI and local environment. SPARK-49228 is filed for its investigation. - In addition, `Storage Partition Join`-related benchmark are generated for the following commits. - #46265 - #47426 ### Why are the changes needed? To check the performance regression. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is generated by - https://github.com/dongjoon-hyun/spark/actions/runs/10364365815 (Java 17) - https://github.com/dongjoon-hyun/spark/actions/runs/10364368441 (Java 21) Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47743 from dongjoon-hyun/SPARK-49224. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…RowComparableWrapper ### What changes were proposed in this pull request? Cache rowOrdering and structType for InternalRowComparableWrapper ### Why are the changes needed? For performance improvement ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Added a new benchmark to verify the performance improvement ### Was this patch authored or co-authored using generative AI tooling? NO Closes apache#46265 from advancedxy/SPARK-48030. Authored-by: Xianjin <[email protected]> Signed-off-by: Chao Sun <[email protected]>
What changes were proposed in this pull request?
Cache rowOrdering and structType for InternalRowComparableWrapper
Why are the changes needed?
For performance improvement
Does this PR introduce any user-facing change?
NO
How was this patch tested?
Added a new benchmark to verify the performance improvement
Was this patch authored or co-authored using generative AI tooling?
NO