Skip to content

Conversation

arthurpassos
Copy link
Contributor

@arthurpassos arthurpassos commented Feb 26, 2025

Changelog category (leave one):

  • Backward Incompatible Change

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add support for hive partition style writes and refactor reads implementation (hive partition columns are no longer virtual).

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@arthurpassos
Copy link
Contributor Author

I'll add tests soon

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Feb 26, 2025

This is good, and it could replace a problematic feature: #23051
Change to "New Feature".

Let's make sure names and values are URL-encoded.

One potential problem is memory usage when writing to many partitions at the same time. Let's define some limit, so we will create only up to this limit number of buffers for writing to S3 at the same time.

Do we control where exactly the path fragment with partition goes in the URL?
Do we control if the partition columns will be written or omitted from the files?

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Feb 26, 2025
Copy link

clickhouse-gh bot commented Feb 26, 2025

Workflow [PR], commit [27e26d9]

@clickhouse-gh clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Feb 26, 2025
@bharatnc bharatnc changed the title Add suport for hive partition style writes Add support for hive partition style writes Feb 26, 2025
@arthurpassos
Copy link
Contributor Author

arthurpassos commented Feb 27, 2025

Change to "New Feature".

Done

Let's make sure names and values are URL-encoded.

Ack

@arthurpassos
Copy link
Contributor Author

arthurpassos commented Feb 27, 2025

Do we control where exactly the path fragment with partition goes in the URL?

As of now, it us up to the user to choose where in the path the partition goes by using the {_partition_id} macro/ placeholder when creating the table.

Maybe we can make it simpler: user is responsible for defining the table root, that's all. The rest (partition key location, filename and file extension) clickhouse will generate.

engine = s3('bucket/table_root') partition by (year, country) -> 'bucket/table_root/year=2025/country=spain/<generated_uuid>.<format from table>'.

If the user specifies the partition_id placeholder and use_hive=1, we throw exception.

What do you think? @alexey-milovidov

@arthurpassos
Copy link
Contributor Author

Do we control if the partition columns will be written or omitted from the files?

I suppose that could be implemented, but perhaps we should leave it for a follow up PR?

@alexey-milovidov
Copy link
Member

Maybe we can make it simpler: user is responsible for defining the table root, that's all. The rest (partition key location, filename and file extension) clickhouse will generate.

Yes, this is a great idea!

This PR is good, but what I'd like to see in addition, before merging it, is - fixing the memory consumption problem with PARTITION BY. It's an old flaw of the current mechanism. Having this new feature will make it more frequently used, and the users will bump into this problem more frequently.

@arthurpassos
Copy link
Contributor Author

arthurpassos commented Feb 28, 2025

Maybe we can make it simpler: user is responsible for defining the table root, that's all. The rest (partition key location, filename and file extension) clickhouse will generate.

Yes, this is a great idea!

This PR is good, but what I'd like to see in addition, before merging it, is - fixing the memory consumption problem with PARTITION BY. It's an old flaw of the current mechanism. Having this new feature will make it more frequently used, and the users will bump into this problem more frequently.

Is there an issue that describes this issue in depth? I could look into that

@clickhouse-gh clickhouse-gh bot added pr-feature Pull request with new product feature and removed pr-improvement Pull request with some product improvements labels Mar 3, 2025
return exp_analyzer->getRequiredColumns();
}

static std::string formatToFileExtension(const std::string & format)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be completed

@arthurpassos
Copy link
Contributor Author

Messed up, need to re-think a couple of things

@arthurpassos
Copy link
Contributor Author

The shitty part about this is keeping backwards compatibility with {_partition_id}

@arthurpassos
Copy link
Contributor Author

@alexey-milovidov couple of questions:

  1. The existing use_hive_partitioning is used for something else, and it can be tweaked in-between table creation and data insertion. We need a new variable to control the partitioning style at a table level. Should it be a new setting or a new argument to table engines? e.g, partitioning_strategy=['hive' | 'simple', others in the future]. In case we vote for argument, it should be implemented for S3, File and URL table engines.
  2. We have settled on asking for the user to specify the table root and we generate the rest (partition style path, filename and file extension). Being that said, should we forbid the user to create a table with {_partition_id} macro in case partition strategy is hive style?

@arthurpassos
Copy link
Contributor Author

Once I am done with this PR, I'll look into the max threads/streams thing

@arthurpassos arthurpassos force-pushed the s3_hive_style_partitioned_writes branch from 4578805 to f6a46a8 Compare July 28, 2025 11:26
@kssenii
Copy link
Member

kssenii commented Jul 28, 2025

@arthurpassos from now on please avoid force pushes in this PR because, after I fixed conflicts in private synchronization, each force push later on requires me to fix it again.

@arthurpassos
Copy link
Contributor Author

arthurpassos commented Jul 28, 2025

@arthurpassos from now on please avoid force pushes in this PR because, after I fixed conflicts in private synchronization, each force push later on requires me to fix it again.

Ouch, I did not know that. I am sorry. Do you want me to revert it (if possible) and do it without force pushes?

@kssenii
Copy link
Member

kssenii commented Jul 28, 2025

Ouch, I did not now that. I am sorry.

I understand, no problem.

Do you want me to revert it (if possible)

I am not sure it will work, I will just fix it again then.

@arthurpassos
Copy link
Contributor Author

@kssenii btw, once CI/CD finishes, I would appreciate if we could merge it. Object storage is a popular file in ClickHouse nowadays and I am having to fix merge conflicts almost daily.

@arthurpassos
Copy link
Contributor Author

Integration test_dictionaries_all_layouts_separate_sources - #81246

Stateless tests (amd_binary, ParallelReplicas, s3 storage, sequential) - 00002_log_and_exception_messages_formatting - Failed in one of your PRs as well: #84463

Stateless tests (amd_asan, distributed plan, sequential) - 00002_log_and_exception_messages_formatting - Failed in one of your PRs as well: #84463

@kssenii can we merge it?

@kssenii
Copy link
Member

kssenii commented Jul 28, 2025

CH Inc sync — Failed. Needs manual intervention. See job ID for details:46853436543

I guess the status just did not auto-update for some reason, the sync is fixed, all commits are present and fully green.

@kssenii kssenii enabled auto-merge July 28, 2025 16:34
@kssenii kssenii added this pull request to the merge queue Jul 28, 2025
Merged via the queue into ClickHouse:master with commit 7af2979 Jul 28, 2025
122 of 125 checks passed
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 28, 2025
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Jul 29, 2025
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Jul 29, 2025
baibaichen pushed a commit to apache/incubator-gluten that referenced this pull request Jul 29, 2025
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250729)

* Fix build due to ClickHouse/ClickHouse#76802

* Fix build due to ClickHouse/ClickHouse#81837

* Fix build due to ClickHouse/ClickHouse#84011

* Fix gtest due to ClickHouse/ClickHouse#83599

---------

Co-authored-by: kyligence-git <[email protected]>
Co-authored-by: Chang chen <[email protected]>
@PedroTadim
Copy link
Member

Dumb question. I don't understand this error:

CREATE TABLE t0 (c0 Int) ENGINE = AzureBlobStorage(azure, blob_path = 'f0', format = 'CSV', compression = 'none', partition_strategy = 'hive', partition_columns_in_data_file = 1);
/*
DB::Exception: Received from localhost:9000. DB::Exception: Unexpected key `partition_columns_in_data_file` in
named collection. Required keys: container, blob_path, optional keys: storage_account_url, compression_method,
account_name, connection_string, structure, partition_strategy, account_key, compression, format. (BAD_ARGUMENTS)
*/

@arthurpassos
Copy link
Contributor Author

Dumb question. I don't understand this error:

CREATE TABLE t0 (c0 Int) ENGINE = AzureBlobStorage(azure, blob_path = 'f0', format = 'CSV', compression = 'none', partition_strategy = 'hive', partition_columns_in_data_file = 1);
/*
DB::Exception: Received from localhost:9000. DB::Exception: Unexpected key `partition_columns_in_data_file` in
named collection. Required keys: container, blob_path, optional keys: storage_account_url, compression_method,
account_name, connection_string, structure, partition_strategy, account_key, compression, format. (BAD_ARGUMENTS)
*/

Not dumb, thanks for finding this. I forgot to specify it here and test it as well: https://github.com/ClickHouse/ClickHouse/pull/76802/files#diff-62aeafbb0f3222329276b3f14e00adcba1d258d3f44a71b9bf93e79091c0f06aR59

Fix: #85373

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-backward-incompatible Pull request with backwards incompatible changes pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants