Skip to content

Object storage hive partitioned writes with snowflake id #820

@arthurpassos

Description

@arthurpassos

Is your feature request related to a problem? Please describe.
We need to be able to write multiple files to object storage in an efficient and concurrent way, that requires uniquely naming and organizing then.

Hive-style partitioning + snowflakeid names was picked.

Describe the solution you'd like

This feature works in the exact way I described a few months ago. There are two new function arguments: partition_strategy='wildcard|hive' and partition_columns_in_data_file=0|1.

The user is responsible for setting only the table root, no wildcard/ macro for partition id and not even the filename.

ClickHouse should produce filepaths in the following form: <prefix>/<key1=val1/key2=val2...>/<snowflakeid>.<toLower(file_format)>.

The partition_strategy='hive' argument requires the schema to be specified AND a partition by expression.

Examples can be found in the tests, but just a recap:

arthur :) CREATE TABLE t_03363_parquet (year UInt16, country String, counter UInt8)
ENGINE = S3(s3_conn, filename = 't_03363_parquet', format = Parquet, partition_strategy='hive')
PARTITION BY (year, country);

INSERT INTO t_03363_parquet VALUES
    (2022, 'USA', 1),
    (2022, 'Canada', 2),
    (2023, 'USA', 3),
    (2023, 'Mexico', 4),
    (2024, 'France', 5),
    (2024, 'Germany', 6),
    (2024, 'Germany', 7),
    (1999, 'Brazil', 8),
    (2100, 'Japan', 9),
    (2024, 'CN', 10),
    (2025, '', 11);

CREATE TABLE t_03363_parquet
(
    `year` UInt16,
    `country` String,
    `counter` UInt8
)
ENGINE = S3(s3_conn, filename = 't_03363_parquet', format = Parquet, partition_strategy = 'hive')
PARTITION BY (year, country)

Query id: 3f5b959d-978c-42bf-9c13-c9cff7590372

Ok.

0 rows in set. Elapsed: 0.028 sec. 


INSERT INTO t_03363_parquet FORMAT Values

Query id: 3ccbf8d1-bb3c-49e3-9b64-813d9b03e64f

Ok.

11 rows in set. Elapsed: 0.204 sec. 

arthur :) select _path, * from t_03363_parquet;

SELECT
    _path,
    *
FROM t_03363_parquet

Query id: c211dca1-2dff-4aae-a5e8-2583d2428c8b

    ┌─_path──────────────────────────────────────────────────────────────────────┬─year─┬─country─┬─counter─┐
 1. │ test/t_03363_parquet/year=2100/country=Japan/7329604473272971264.parquet   │ 2100 │ Japan   │       9 │
 2. │ test/t_03363_parquet/year=2024/country=France/7329604473323302912.parquet  │ 2024 │ France  │       5 │
 3. │ test/t_03363_parquet/year=2022/country=Canada/7329604473314914304.parquet  │ 2022 │ Canada  │       2 │
 4. │ test/t_03363_parquet/year=1999/country=Brazil/7329604473289748480.parquet  │ 1999 │ Brazil  │       8 │
 5. │ test/t_03363_parquet/year=2023/country=Mexico/7329604473293942784.parquet  │ 2023 │ Mexico  │       4 │
 6. │ test/t_03363_parquet/year=2023/country=USA/7329604473319108608.parquet     │ 2023 │ USA     │       3 │
 7. │ test/t_03363_parquet/year=2025/country=/7329604473327497216.parquet        │ 2025 │         │      11 │
 8. │ test/t_03363_parquet/year=2024/country=CN/7329604473310720000.parquet      │ 2024 │ CN      │      10 │
 9. │ test/t_03363_parquet/year=2022/country=USA/7329604473298137088.parquet     │ 2022 │ USA     │       1 │
10. │ test/t_03363_parquet/year=2024/country=Germany/7329604473306525696.parquet │ 2024 │ Germany │       6 │
11. │ test/t_03363_parquet/year=2024/country=Germany/7329604473306525696.parquet │ 2024 │ Germany │       7 │
    └────────────────────────────────────────────────────────────────────────────┴──────┴─────────┴─────────┘

11 rows in set. Elapsed: 0.015 sec. 

The partition_columns_in_data_file can be used to control whether ClickHouse will write the partition columns or not. Not only that, but it'll also tell ClickHouse whether it is expected to have those columns in the files upon reading. It can only be used with partition_strategy='hive'.

I have used function arguments instead of settings because these are not ephemeral and are tied to the table itself. It makes more sense imho.

Refactoring: Treating hive columns as actual columns instead of virtual

One of the main problems I faced when implementing native reads and writes was the fact that use_hive_partitioning was implemented using virtual columns. This is problematic if we want to have S3 tables associated with hive.

In the existing use_hive_partitioning implementation, whenever a key-value pair is found in the filepath, this column will be removed from the table schema. Now consider a table with three columns: year, country and counter. And a partition by expression of (year, country).

The following query will result in clickhouse-client asking for the table metadata (i.e, StorageMetadataPtr) so that it knows the schema and what to expected from the input.

clickhouse-client -q "insert into s3_hive_table values (2025, "Japan", 9)" .

With the existing use_Hive_partitioning implementation, both the year and country columns are not part of the schema. Thus, the clickhouse-client will only see counter. You can see where I am going, right?

Therefore, I refactored the hive partition columns to be in the metadata instead of virtual (for now, this is just a poc that seems to work ok). I think it also makes more sense as it is actually part of the schema, specially if we have partition_columns_in_data_file=1

BUT in case it is ok to have the partition columns existing both in virtual columns and storage, then I guess this refactoring is not needed?

Backwards compatibility with use_hive_partitioning

This setting is not optimal imo, and the name might be confusing. Still, I believe I had to make it backwards compatible. So, here's how use_hive_partitioning and partition_strategy interact:

partition_strategy='hive' is the dominant control. It is like a request for hive style. It'll allow native reads and writes in hive-style format. The use_hive_partitioning setting should have no effect here.

On the other hand, use_hive_partitioning is more like a hint for reads that only work if the path is globbed. It should work just like it does nowadays.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions