Add support for hive partition style reads and writes #76802

arthurpassos · 2025-02-26T14:48:03Z

Changelog category (leave one):

Backward Incompatible Change

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add support for hive partition style writes and refactor reads implementation (hive partition columns are no longer virtual).

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

src/Storages/ObjectStorage/StorageObjectStorageSink.h

src/Storages/PartitionedSink.cpp

arthurpassos · 2025-02-26T15:16:00Z

I'll add tests soon

alexey-milovidov · 2025-02-26T15:29:19Z

This is good, and it could replace a problematic feature: #23051
Change to "New Feature".

Let's make sure names and values are URL-encoded.

One potential problem is memory usage when writing to many partitions at the same time. Let's define some limit, so we will create only up to this limit number of buffers for writing to S3 at the same time.

Do we control where exactly the path fragment with partition goes in the URL?
Do we control if the partition columns will be written or omitted from the files?

clickhouse-gh · 2025-02-26T15:30:08Z

Workflow [PR], commit [27e26d9]

arthurpassos · 2025-02-27T17:41:44Z

Change to "New Feature".

Done

Let's make sure names and values are URL-encoded.

Ack

arthurpassos · 2025-02-27T19:49:34Z

Do we control where exactly the path fragment with partition goes in the URL?

As of now, it us up to the user to choose where in the path the partition goes by using the {_partition_id} macro/ placeholder when creating the table.

Maybe we can make it simpler: user is responsible for defining the table root, that's all. The rest (partition key location, filename and file extension) clickhouse will generate.

engine = s3('bucket/table_root') partition by (year, country) -> 'bucket/table_root/year=2025/country=spain/<generated_uuid>.<format from table>'.

If the user specifies the partition_id placeholder and use_hive=1, we throw exception.

What do you think? @alexey-milovidov

arthurpassos · 2025-02-28T13:05:02Z

Do we control if the partition columns will be written or omitted from the files?

I suppose that could be implemented, but perhaps we should leave it for a follow up PR?

alexey-milovidov · 2025-02-28T19:55:21Z

Maybe we can make it simpler: user is responsible for defining the table root, that's all. The rest (partition key location, filename and file extension) clickhouse will generate.

Yes, this is a great idea!

This PR is good, but what I'd like to see in addition, before merging it, is - fixing the memory consumption problem with PARTITION BY. It's an old flaw of the current mechanism. Having this new feature will make it more frequently used, and the users will bump into this problem more frequently.

arthurpassos · 2025-02-28T20:36:53Z

Maybe we can make it simpler: user is responsible for defining the table root, that's all. The rest (partition key location, filename and file extension) clickhouse will generate.

Yes, this is a great idea!

This PR is good, but what I'd like to see in addition, before merging it, is - fixing the memory consumption problem with PARTITION BY. It's an old flaw of the current mechanism. Having this new feature will make it more frequently used, and the users will bump into this problem more frequently.

Is there an issue that describes this issue in depth? I could look into that

alexey-milovidov · 2025-03-01T00:57:54Z

arthurpassos · 2025-03-03T18:39:44Z

src/Storages/PartitionedSink.cpp

    return exp_analyzer->getRequiredColumns();
 }

+static std::string formatToFileExtension(const std::string & format)


needs to be completed

arthurpassos · 2025-03-03T19:59:34Z

Messed up, need to re-think a couple of things

arthurpassos · 2025-03-04T12:25:04Z

The shitty part about this is keeping backwards compatibility with {_partition_id}

arthurpassos · 2025-03-05T16:49:28Z

@alexey-milovidov couple of questions:

The existing use_hive_partitioning is used for something else, and it can be tweaked in-between table creation and data insertion. We need a new variable to control the partitioning style at a table level. Should it be a new setting or a new argument to table engines? e.g, partitioning_strategy=['hive' | 'simple', others in the future]. In case we vote for argument, it should be implemented for S3, File and URL table engines.
We have settled on asking for the user to specify the table root and we generate the rest (partition style path, filename and file extension). Being that said, should we forbid the user to create a table with {_partition_id} macro in case partition strategy is hive style?

arthurpassos · 2025-03-05T16:50:04Z

Once I am done with this PR, I'll look into the max threads/streams thing

kssenii · 2025-07-28T12:03:09Z

@arthurpassos from now on please avoid force pushes in this PR because, after I fixed conflicts in private synchronization, each force push later on requires me to fix it again.

arthurpassos · 2025-07-28T12:04:27Z

@arthurpassos from now on please avoid force pushes in this PR because, after I fixed conflicts in private synchronization, each force push later on requires me to fix it again.

Ouch, I did not know that. I am sorry. Do you want me to revert it (if possible) and do it without force pushes?

kssenii · 2025-07-28T12:06:23Z

Ouch, I did not now that. I am sorry.

I understand, no problem.

Do you want me to revert it (if possible)

I am not sure it will work, I will just fix it again then.

arthurpassos · 2025-07-28T12:06:42Z

@kssenii btw, once CI/CD finishes, I would appreciate if we could merge it. Object storage is a popular file in ClickHouse nowadays and I am having to fix merge conflicts almost daily.

arthurpassos · 2025-07-28T16:15:59Z

Integration test_dictionaries_all_layouts_separate_sources - #81246

Stateless tests (amd_binary, ParallelReplicas, s3 storage, sequential) - 00002_log_and_exception_messages_formatting - Failed in one of your PRs as well: #84463

Stateless tests (amd_asan, distributed plan, sequential) - 00002_log_and_exception_messages_formatting - Failed in one of your PRs as well: #84463

@kssenii can we merge it?

kssenii · 2025-07-28T16:32:33Z

CH Inc sync — Failed. Needs manual intervention. See job ID for details:46853436543

I guess the status just did not auto-update for some reason, the sync is fixed, all commits are present and fully green.

* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250729) * Fix build due to ClickHouse/ClickHouse#76802 * Fix build due to ClickHouse/ClickHouse#81837 * Fix build due to ClickHouse/ClickHouse#84011 * Fix gtest due to ClickHouse/ClickHouse#83599 --------- Co-authored-by: kyligence-git <[email protected]> Co-authored-by: Chang chen <[email protected]>

PedroTadim · 2025-08-11T14:51:01Z

Dumb question. I don't understand this error:

CREATE TABLE t0 (c0 Int) ENGINE = AzureBlobStorage(azure, blob_path = 'f0', format = 'CSV', compression = 'none', partition_strategy = 'hive', partition_columns_in_data_file = 1);
/*
DB::Exception: Received from localhost:9000. DB::Exception: Unexpected key `partition_columns_in_data_file` in
named collection. Required keys: container, blob_path, optional keys: storage_account_url, compression_method,
account_name, connection_string, structure, partition_strategy, account_key, compression, format. (BAD_ARGUMENTS)
*/

arthurpassos · 2025-08-11T15:29:43Z

Dumb question. I don't understand this error:

CREATE TABLE t0 (c0 Int) ENGINE = AzureBlobStorage(azure, blob_path = 'f0', format = 'CSV', compression = 'none', partition_strategy = 'hive', partition_columns_in_data_file = 1);
/*
DB::Exception: Received from localhost:9000. DB::Exception: Unexpected key `partition_columns_in_data_file` in
named collection. Required keys: container, blob_path, optional keys: storage_account_url, compression_method,
account_name, connection_string, structure, partition_strategy, account_key, compression, format. (BAD_ARGUMENTS)
*/

Not dumb, thanks for finding this. I forgot to specify it here and test it as well: https://github.com/ClickHouse/ClickHouse/pull/76802/files#diff-62aeafbb0f3222329276b3f14e00adcba1d258d3f44a71b9bf93e79091c0f06aR59

Fix: #85373

arthurpassos added 2 commits February 26, 2025 11:42

Add suport for s3 hive partition style writes

440ccc7

storageurl and storagefile

4224845

arthurpassos commented Feb 26, 2025

View reviewed changes

src/Storages/ObjectStorage/StorageObjectStorageSink.h Outdated Show resolved Hide resolved

simplify code

b0ec6be

arthurpassos commented Feb 26, 2025

View reviewed changes

src/Storages/PartitionedSink.cpp Outdated Show resolved Hide resolved

reduce changes

13a67df

alexey-milovidov added the can be tested Allows running workflows for external contributors label Feb 26, 2025

clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Feb 26, 2025

bharatnc changed the title ~~Add suport for hive partition style writes~~ Add support for hive partition style writes Feb 26, 2025

add tests for s3, enforce some rules

00685e2

clickhouse-gh bot added pr-feature Pull request with new product feature and removed pr-improvement Pull request with some product improvements labels Mar 3, 2025

arthurpassos commented Mar 3, 2025

View reviewed changes

arthurpassos added 2 commits March 5, 2025 09:56

some refactoring

278c6da

extern not_implemented

97f67b5

arthurpassos added 2 commits March 5, 2025 14:32

copy sample bock

1c2748a

focus on engine s3 only, new argument to control partition style

2eb820d

try to fix build

f6a46a8

arthurpassos force-pushed the s3_hive_style_partitioned_writes branch from 4578805 to f6a46a8 Compare July 28, 2025 11:26

fix build again

48aa454

kssenii enabled auto-merge July 28, 2025 16:34

kssenii added this pull request to the merge queue Jul 28, 2025

Merged via the queue into ClickHouse:master with commit 7af2979 Jul 28, 2025
122 of 125 checks passed

robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 28, 2025

baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Jul 29, 2025

Fix build due to ClickHouse/ClickHouse#76802

2782363

baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Jul 29, 2025

Fix build due to ClickHouse/ClickHouse#76802

d7985af

arthurpassos mentioned this pull request Jul 31, 2025

use_hive_partitioning breaks INSERT INFO FUNCTION #81048

Open

alex-zaitsev mentioned this pull request Aug 4, 2025

Object storage hive partitioned writes with snowflake id Altinity/ClickHouse#820

Open

arthurpassos mentioned this pull request Aug 7, 2025

S3Queue ordered mode with hive partitioning #81040

Open

den-crane mentioned this pull request Aug 26, 2025

S3 Engine: Cannot specify 'partition_strategy' in named collection for Hive partitioning #85963

Closed

fm4v added the pr-backward-incompatible Pull request with backwards incompatible changes label Sep 4, 2025

arthurpassos mentioned this pull request Sep 11, 2025

simple export part Altinity/ClickHouse#1009

Merged

1 task

Enmk mentioned this pull request Sep 16, 2025

Antalya 25.6.5: Port of #76802 - Object storage hive reads & writes Altinity/ClickHouse#934

Merged

13 tasks

arthurpassos mentioned this pull request Sep 23, 2025

Can not read hive partitioned parquet files with s3 and RawBlob, One formats #87515

Closed

Algunenano mentioned this pull request Sep 24, 2025

Prohibited LowCardinality when importing tables with dates #87586

Open

This was referenced Oct 16, 2025

High memory usage when inserting into S3 with many partitions (PARTITION BY) #88666

Open

use hive partitioning in s3Cluster function #73910

Closed

Add support for hive partition style reads and writes #76802

Add support for hive partition style reads and writes #76802

Uh oh!

Conversation

arthurpassos commented Feb 26, 2025 • edited by fm4v Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

Uh oh!

Uh oh!

arthurpassos commented Feb 26, 2025

Uh oh!

alexey-milovidov commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-gh bot commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurpassos commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurpassos commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurpassos commented Feb 28, 2025

Uh oh!

alexey-milovidov commented Feb 28, 2025

Uh oh!

arthurpassos commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Mar 1, 2025

Uh oh!

arthurpassos Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

arthurpassos commented Mar 3, 2025

Uh oh!

arthurpassos commented Mar 4, 2025

Uh oh!

arthurpassos commented Mar 5, 2025

Uh oh!

arthurpassos commented Mar 5, 2025

Uh oh!

kssenii commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurpassos commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kssenii commented Jul 28, 2025

Uh oh!

arthurpassos commented Jul 28, 2025

Uh oh!

arthurpassos commented Jul 28, 2025

Uh oh!

kssenii commented Jul 28, 2025

Uh oh!

Uh oh!

PedroTadim commented Aug 11, 2025

Uh oh!

arthurpassos commented Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

arthurpassos commented Feb 26, 2025 •

edited by fm4v

Loading

alexey-milovidov commented Feb 26, 2025 •

edited

Loading

clickhouse-gh bot commented Feb 26, 2025 •

edited

Loading

arthurpassos commented Feb 27, 2025 •

edited

Loading

arthurpassos commented Feb 27, 2025 •

edited

Loading

arthurpassos commented Feb 28, 2025 •

edited

Loading

kssenii commented Jul 28, 2025 •

edited

Loading

arthurpassos commented Jul 28, 2025 •

edited

Loading