-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Add support for hive partition style reads and writes #76802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for hive partition style reads and writes #76802
Conversation
I'll add tests soon |
This is good, and it could replace a problematic feature: #23051 Let's make sure names and values are URL-encoded. One potential problem is memory usage when writing to many partitions at the same time. Let's define some limit, so we will create only up to this limit number of buffers for writing to S3 at the same time. Do we control where exactly the path fragment with partition goes in the URL? |
Done
Ack |
As of now, it us up to the user to choose where in the path the partition goes by using the Maybe we can make it simpler: user is responsible for defining the table root, that's all. The rest (partition key location, filename and file extension) clickhouse will generate.
If the user specifies the partition_id placeholder and use_hive=1, we throw exception. What do you think? @alexey-milovidov |
I suppose that could be implemented, but perhaps we should leave it for a follow up PR? |
Yes, this is a great idea! This PR is good, but what I'd like to see in addition, before merging it, is - fixing the memory consumption problem with PARTITION BY. It's an old flaw of the current mechanism. Having this new feature will make it more frequently used, and the users will bump into this problem more frequently. |
Is there an issue that describes this issue in depth? I could look into that |
src/Storages/PartitionedSink.cpp
Outdated
return exp_analyzer->getRequiredColumns(); | ||
} | ||
|
||
static std::string formatToFileExtension(const std::string & format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be completed
Messed up, need to re-think a couple of things |
The shitty part about this is keeping backwards compatibility with |
@alexey-milovidov couple of questions:
|
Once I am done with this PR, I'll look into the max threads/streams thing |
4578805
to
f6a46a8
Compare
@arthurpassos from now on please avoid force pushes in this PR because, after I fixed conflicts in private synchronization, each force push later on requires me to fix it again. |
Ouch, I did not know that. I am sorry. Do you want me to revert it (if possible) and do it without force pushes? |
I understand, no problem.
I am not sure it will work, I will just fix it again then. |
@kssenii btw, once CI/CD finishes, I would appreciate if we could merge it. Object storage is a popular file in ClickHouse nowadays and I am having to fix merge conflicts almost daily. |
Integration test_dictionaries_all_layouts_separate_sources - #81246 Stateless tests (amd_binary, ParallelReplicas, s3 storage, sequential) - 00002_log_and_exception_messages_formatting - Failed in one of your PRs as well: #84463 Stateless tests (amd_asan, distributed plan, sequential) - 00002_log_and_exception_messages_formatting - Failed in one of your PRs as well: #84463 @kssenii can we merge it? |
I guess the status just did not auto-update for some reason, the sync is fixed, all commits are present and fully green. |
7af2979
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250729) * Fix build due to ClickHouse/ClickHouse#76802 * Fix build due to ClickHouse/ClickHouse#81837 * Fix build due to ClickHouse/ClickHouse#84011 * Fix gtest due to ClickHouse/ClickHouse#83599 --------- Co-authored-by: kyligence-git <[email protected]> Co-authored-by: Chang chen <[email protected]>
Dumb question. I don't understand this error: CREATE TABLE t0 (c0 Int) ENGINE = AzureBlobStorage(azure, blob_path = 'f0', format = 'CSV', compression = 'none', partition_strategy = 'hive', partition_columns_in_data_file = 1);
/*
DB::Exception: Received from localhost:9000. DB::Exception: Unexpected key `partition_columns_in_data_file` in
named collection. Required keys: container, blob_path, optional keys: storage_account_url, compression_method,
account_name, connection_string, structure, partition_strategy, account_key, compression, format. (BAD_ARGUMENTS)
*/ |
Not dumb, thanks for finding this. I forgot to specify it here and test it as well: https://github.com/ClickHouse/ClickHouse/pull/76802/files#diff-62aeafbb0f3222329276b3f14e00adcba1d258d3f44a71b9bf93e79091c0f06aR59 Fix: #85373 |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add support for hive partition style writes and refactor reads implementation (hive partition columns are no longer virtual).
Documentation entry for user-facing changes