Skip to content

Conversation

@advancedxy
Copy link
Contributor

@advancedxy advancedxy commented Apr 28, 2024

What changes were proposed in this pull request?

Cache rowOrdering and structType for InternalRowComparableWrapper

Why are the changes needed?

For performance improvement

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Added a new benchmark to verify the performance improvement

Was this patch authored or co-authored using generative AI tooling?

NO

@github-actions github-actions bot added the SQL label Apr 28, 2024
@advancedxy
Copy link
Contributor Author

Before applying changes in this PR, the benchmark code(in this PR) took:

[info] OpenJDK 64-Bit Server VM 21.0.2 on Mac OS X 14.4.1
[info] Apple M1
[info] internal row comparable wrapper:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] toSet                                             10785          11369         825          0.0       53925.3       1.0X
[info] mergePartitions                                   21073          21447         528          0.0      105364.5       0.5X

After this PR:

[info] OpenJDK 64-Bit Server VM 21.0.2 on Mac OS X 14.4.1
[info] Apple M1
[info] internal row comparable wrapper:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] toSet                                                85            103          25          2.4         425.0       1.0X
[info] mergePartitions                                     163            167           4          1.2         816.3       0.5X

@advancedxy
Copy link
Contributor Author

@sunchao, @szehon-ho and @yabola would you mind to take a look at this and help review this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1024 should be sufficient, it could be a SQL configuration though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me to have a config, though not familiar with spark preference for these things.

Copy link
Contributor

@yabola yabola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a quick look, this looks good to me. In addition, is there a better expireAfterAccess configuration for NonFateSharingCache?

@advancedxy
Copy link
Contributor Author

In addition, is there a better expireAfterAccess configuration for NonFateSharingCache?

hmmm, maybe. However, the memory usage of cache should be relatively low. Let's wait for other people's opinions

@yabola
Copy link
Contributor

yabola commented Apr 29, 2024

@advancedxy It appears that this PR can enhance performance when there are a large number of partitions. Could you please share the test results from a real DatasourceV2 table, such as Iceberg?

@advancedxy
Copy link
Contributor Author

It appears that this PR can enhance performance when there are a large number of partitions. Could you please share the test results from a real DatasourceV2 table, such as Iceberg?

I cannot share the exact numbers. However, I have described the estimated number of partitions(~N00_000, where N <=2)and time to plan in the jira. After this patch, the planning time should be dropped to seconds(from tens of minutes). Hope that helps.

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good find and promising results on real table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me to have a config, though not familiar with spark preference for these things.

@sunchao sunchao changed the title [SPARK-48030][SQL]SPJ: cache rowOrdering and structType for InternalRowComparableWrapper [SPARK-48030][SQL] SPJ: cache rowOrdering and structType for InternalRowComparableWrapper Apr 30, 2024
Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It would be nice to have something to configure this but I don't think it is super important. I feel the default value should be more than enough for most use cases? Similarly for expireAfterAccess.

@sunchao sunchao closed this in e6217c1 Apr 30, 2024
@sunchao
Copy link
Member

sunchao commented Apr 30, 2024

Thanks! merged to master.

@advancedxy advancedxy deleted the SPARK-48030 branch April 30, 2024 06:28
@advancedxy
Copy link
Contributor Author

LGTM. It would be nice to have something to configure this but I don't think it is super important. I feel the default value should be more than enough for most use cases? Similarly for expireAfterAccess.

Thanks for reviewing this.

dongjoon-hyun added a commit that referenced this pull request Aug 13, 2024
### What changes were proposed in this pull request?

This PR aims to regenerate benchmark results (except `ExternalAppendOnlyUnsafeRowArrayBenchmark`) as a preparation for Apache Spark 4.0.0-preview2.

- During the testing, it's observed that `ExternalAppendOnlyUnsafeRowArrayBenchmark` hangs in both CI and local environment. SPARK-49228 is filed for its investigation.

- In addition, `Storage Partition Join`-related benchmark are generated for the following commits.
  - #46265
  - #47426

### Why are the changes needed?

To check the performance regression.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This is generated by
- https://github.com/dongjoon-hyun/spark/actions/runs/10364365815 (Java 17)
- https://github.com/dongjoon-hyun/spark/actions/runs/10364368441 (Java 21)

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47743 from dongjoon-hyun/SPARK-49224.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…RowComparableWrapper

### What changes were proposed in this pull request?
Cache rowOrdering and structType for InternalRowComparableWrapper

### Why are the changes needed?
For performance improvement

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Added a new benchmark to verify the performance improvement

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes apache#46265 from advancedxy/SPARK-48030.

Authored-by: Xianjin <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants