Skip to content

Conversation

@Madhukar525722
Copy link

@Madhukar525722 Madhukar525722 commented Sep 29, 2024

…all partitions from metastore

What changes were proposed in this pull request?

When there is any predicate missing in getPartitionsbyFilter and it tries to fetch all the partitions, the request is broken into smaller chunks as:

  1. Retrieve the names of all partitions using getPartitionNames
  2. Divide the partition names list into smaller batches.
  3. Fetch the partitions using their names with function getPartitionsByNames.

Why are the changes needed?

The change is to address the issue of heavy load on HMS, when there are huge number of partitions(~600,000), the metadata size exceeds the 2Gb limit on the thrift server buffer size. Hence we get socket time out and HMS crashes with OOM as well. Tried to replicate same behaviour as HIVE-27505

Does this PR introduce any user-facing change?

Yes
To enable batching they should be using parameters as:
spark.sql.hive.metastore.batchSize = 1000 , by default it is disabled
spark.sql.metastore.partition.batch.retry.count = 3

How was this patch tested?

Tested in local environment with following performance
With batch size = 1
24/09/28 18:11:21 INFO Shim_v2_3: Fetching all partitions completed in 718 ms

With batch size = -1
24/09/28 18:14:16 INFO Shim_v2_3: Fetching all partitions completed in 51 ms.

With batch size = 10
24/09/28 18:16:20 INFO Shim_v2_3: Fetching all partitions completed in 127 ms.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Sep 29, 2024
@Madhukar525722 Madhukar525722 changed the title [SPARK-49827][CORE] Adding batches with retry mechanism for fetching … [SPARK-49827][SQL] Adding batches with retry mechanism for fetching … Sep 30, 2024
@Madhukar525722
Copy link
Author

Madhukar525722 commented Sep 30, 2024

Please review @pan3793 @cloud-fan @HyukjinKwon

@pan3793
Copy link
Member

pan3793 commented Oct 1, 2024

the idea makes sense to me, we also have cases of accessing the table that has millions of partitions which causes high pressure on HMS.

given this is a new feature, please open the PR target to master branch, and the new configurations's version should be 4.0.0

@HyukjinKwon
Copy link
Member

Yeah let's target master branch

@Madhukar525722
Copy link
Author

Madhukar525722 commented Oct 4, 2024

Hi @pan3793 @HyukjinKwon . I have raised the request for master in #48337 . Please review.
In master the get all partitions request is migrated to function getAllPartitionsOf, which has the implementation of HIVE-27505. If hive is upgraded to 3 from 2.x, this change will not be required anymore. Therefore, I believe this fix is more relevant to the lower versions of Spark as well.
Thank you

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 27, 2025
@github-actions github-actions bot closed this Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants