[Variant] Harmonize primitive conversions variant unshredding and casting #8521

scovich · 2025-09-30T23:28:51Z

NOTE: Stacked on [Variant] Define and use unshred_variant function #8481, ignore the first four commits

Which issue does this PR close?

Related to [Variant] [Shredding] Support typed_access for Struct #8336

Rationale for this change

See #8481 (comment) --cast_to_variant and unshred_variant have significant overlap in the code that handles conversion of primitive types to variant.

What changes are included in this PR?

Define a new ArrowToVariant trait that does what it says, and update both sets of code to use it.

Are these changes tested?

All existing test coverage of the two conversion functions continues to pass.

Are there any user-facing changes?

No.

…red_variant

scovich · 2025-09-30T23:30:51Z

Attn @alamb -- I'm on the fence. This change actually increased code size a bit, but it does eliminate some duplication of logic. Maybe still be worthwhile?

scovich · 2025-09-30T23:42:03Z

parquet-variant-compute/src/arrow_to_variant.rs

+}
+
+/// Macro to define ArrowToVariant implementations with optional value transformation
+macro_rules! define_arrow_to_variant {


One thing I don't love is we now have two complex macros that feel kind of similar. This new macro simplified the unshred_variant code but has ~no effect on the cast_to_variant code in this file. In fact, the builder definition didn't even change (L397 on the left side of the diff):

define_row_builder!( struct PrimitiveArrowToVariantBuilder<'a, T: ArrowPrimitiveType> where T::Native: Into<Variant<'a, 'a>>, |array| -> PrimitiveArray<T> { array.as_primitive() } );

So in retrospect, I'm not sure what duplication this PR actually eliminated?
Seems like it just moved some definitions around.

Any ideas on a better approach?

I agree this PR doesn't seem to have eliminated much yet

There seems to be two major pieces of functionality

Converting elements of Arrow Arrays to the corresponding Variant elements

Converting Variant arrays back to Arrow Arrays

It seems to me like arrow_to_variant.rs and unshred_variant.rs still have non trivial overlap as you mention. What I was hoping was that we could somehow use one set of traits / structs for both those operations which would mean we would get unshred support "for free" for other types like UnionArrays

That was the dream at anyways.

After I spent some time reviewing this PR I think I would like to propose some naming consolidation as we have two functions that are inverses of each other that are confusingly named;

cast_to_variant which casts arrow arrays to variant

variant_to_arrow which casts variants to arrow arrays

I will make a PR to propose cast_arrow_to_variant and cast_variant_to_arrow for the kernel names and we can keep the modules variant_to_arrow and arrow_to_variant with the actual code

After I spent some time reviewing this PR I think I would like to propose some naming consolidation as we have two functions that are inverses of each other that are confusingly named;

* `cast_to_variant` which casts arrow arrays to variant * `variant_to_arrow` which casts variants to arrow arrays

I will make a PR to propose cast_arrow_to_variant and cast_variant_to_arrow for the kernel names and we can keep the modules variant_to_arrow and arrow_to_variant with the actual code

cast_to_variant is a function that leverages the arrow_to_variant module.

variant_to_arrow is a module that the variant_get function relies on.

I guess variant_get could be seen as a superset of a hypothetical cast_variant_to_arrow function that is an inverse of cast_[arrow_]to_variant?

That said, I do think it's a good idea to step back and take a hard look at naming conventions as the code matures and the interactions (or lack thereof) become clearer:

cast_to_variant - converts fully strongly-typed data to binary variant

uses the row builders defined in arrow_to_variant module

shred_variant - shreds a binary variant input according to the requested shredding schema

uses its own row builders defined in the same module

unshred_variant - converts shredded variant back to binary variant

uses its own row builders defined in the same module

variant_get - can be used to extract fully strongly-typed data from variant (shredded or not)

tries to do columnar operations when possible, but falls back to row builders defined in the variant_to_arrow module when necessary

And, for good measure, some of the conversions probably should move to type_conversions module, and I don't know where the ListLikeArray trait should live?

Here is the PR [Variant] Improve documentation and includes for casts #8532

I guess variant_get could be seen as a superset of a hypothetical cast_variant_to_arrow function that is an inverse of cast_[arrow_]to_variant?

Yes, I think that is a good way to think about it. I don't think we should add cast_to_variant_to_arrow at this time (I was very confused)

And, for good measure, some of the conversions probably should move to type_conversions module, and I don't know where the ListLikeArray trait should live?

Or maybe we could consolidate the conversions into arrow_to_variant or variant_to_arrow 🤔

There seems to be two major pieces of functionality

1. Converting elements of Arrow Arrays to the corresponding Variant elements 2. Converting Variant arrays back to Arrow Arrays

I agree this is the intuitive view, and arrow_to_variant and unshred_variant both fall loosely in category 1/.

It seems to me like arrow_to_variant.rs and unshred_variant.rs still have non trivial overlap as you mention. What I was hoping was that we could somehow use one set of traits / structs for both those operations which would mean we would get unshred support "for free" for other types like UnionArrays

That was the dream at anyways.

So this is tricky -- shredded variant typed_value columns must be one of the supported variant shredding types defined by the shredding spec.

Unsigned integer types are not on that list, so we'll never need to unshred them (even tho we can shred them, and can even variant_get them by converting back).

Complex types like Union and Map are also not on that list and so we'll never need to unshred them. But we can still convert them to variant: Whatever union branch is active for each row gets converted to variant, which works fine; maps are trickier -- I think our current code forcibly converts the map key column to string (with a cast 🙀) and then converts the result to variant object.

FixedLenBinary is also tricky, because it's not a valid shredded variant type, but UUID uses FixedLenBinary(16) as its physical type. So when converting to variant, all binary types (including fixed len) convert to Variant::Binary except that FixedLenBinary(16) with the UUID extension type would convert to Variant::Uuid.

So one immediate problem is that the two operations have an overlapping but not equivalent set of types. And some physical types that seem the same have different interpretation/semantics. Even the simplest -- the NULL builder -- has different semantics between the two operations.

Another problem is that the definition of "primitive type" differs between the two modules:

unshred_variant takes the variant perspective, so string, binary, and boolean arrays can all implement the generic AppendToVariantBuilder trait that UnshredPrimitiveRowBuilder relies on. But timestamps, which need extra state (timezone info) are not primitive and need their own builder.

arrow_to_variant takes the arrow perspective, so string, binary and boolean are not primitive types and thus need their own builder implementations. Additionally, all decimal and temporal types need special treatment and they get customer builders as well.

Another problem is the need to handle value column when unshredding, which is not needed when converting strongly typed data to variant.

Overall... I couldn't find a way to slice this better, in spite of my intuition screaming it should be possible.

Oh, and casting failures also... converting arrow value to variant can fail, and cast options decides whether the failure produces Variant::Null or an error. In theory, unshredding is infallible and any failure there is due to invalid data (so should always produce an error).

Here's an LLM-generated compare/contrast, in case that's helpful:

Analysis of Current Consolidation Effort

Current State

The diff shows an attempt to consolidate primitive type handling by:

Introducing shared ArrowToVariant trait - A zero-cost trait that both modules can use

Adding define_arrow_to_variant macro - A simpler macro for basic type conversions

Sharing timestamp conversion logic - shared_timestamp_to_variant function

Partial adoption in unshred_variant.rs - Using the shared trait for most primitive types

Key Differences Between Approaches

arrow_to_variant.rs (Cast semantics)

Philosophy: Convert any Arrow type to Variant, with flexible error handling

Macro: define_row_builder! - Complex, feature-rich macro supporting:

Optional extra fields (like CastOptions, scale)

Fallible transformations with Option<T> return types

Strict vs non-strict error handling modes

Generic type parameters and where clauses

Primitive definition: T::Native: Into<Variant> constraint

Decimal support: Full support for all decimal types with overflow handling

FixedSizeBinary: Accepts any size, converts to Variant::Binary

unshred_variant.rs (Unshred semantics)

Philosophy: Reconstruct original Variant from shredded representation

Macro: define_arrow_to_variant! - Simpler macro supporting:

Basic value transformations

Simple error propagation

No extra configuration fields

Primitive definition: Any type implementing ArrowToVariant trait

Decimal support: Missing (not implemented)

FixedSizeBinary: Only size 16 allowed, converts to Variant::Uuid

Areas of Redundancy

1. Primitive Type Implementations

Both modules have nearly identical logic for:

Basic integer types (i8, i16, i32, i64, u8, u16, u32, u64)

Floating point types (f32, f64, f16)

Boolean, String, BinaryView

Date32, Time64Microsecond

Timestamp types (with timezone handling)

2. Enum Variants and Match Arms

Both have large enums with similar variants:

// arrow_to_variant.rs PrimitiveInt8(PrimitiveArrowToVariantBuilder<'a, Int8Type>), PrimitiveInt16(PrimitiveArrowToVariantBuilder<'a, Int16Type>), // ... 12+ more primitive variants // unshred_variant.rs PrimitiveInt8(UnshredPrimitiveRowBuilder<'a, PrimitiveArray<Int8Type>>), PrimitiveInt16(UnshredPrimitiveRowBuilder<'a, PrimitiveArray<Int16Type>>), // ... 10+ more primitive variants

3. Factory Pattern Logic

Both have similar DataType matching logic to create appropriate builders.

Semantic Differences That Prevent Full Sharing

FixedSizeBinary Handling:

Cast: Any size → Variant::Binary

Unshred: Only size 16 → Variant::Uuid

Error Handling Philosophy:

Cast: Configurable strict/non-strict modes, overflow becomes Variant::Null

Unshred: Strict validation, errors propagate up

Decimal Support:

Cast: Full decimal support with scale handling and overflow detection

Unshred: No decimal support (missing data point)

Method Signatures:

Cast: append_row(builder, index)

Unshred: append_row(builder, metadata, index) - needs metadata for unshredded values

Here is my updated suggestion (basically just update comments)

[Variant] Improve documentation and make kernels consistent #8536

alamb

FWIW I think this is an improvement over what we have on main, even if we can still improve it further

alamb · 2025-10-01T21:19:47Z

Thank you @scovich -- I think it just needs to have conflicts resolved and it will be good to go

scovich added 5 commits September 30, 2025 05:45

[Variant] Define and use unshred_variant function

0662e12

address reviews

62b6299

Merge remote-tracking branch 'oss/main' into unshred-variant

29cdbad

remove stale TODO

2831e34

[Variant] Harmonize primitive conversions in cast_to_variant and unsh…

509d432

…red_variant

github-actions bot added parquet Changes to the parquet crate parquet-variant parquet-variant* crates labels Sep 30, 2025

scovich changed the title ~~Harmonize unshred and cast~~ [Variant] Harmonize primitive conversions variant unshredding and casting Sep 30, 2025

scovich commented Sep 30, 2025

View reviewed changes

alamb approved these changes Oct 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Variant] Harmonize primitive conversions variant unshredding and casting #8521

[Variant] Harmonize primitive conversions variant unshredding and casting #8521

scovich commented Sep 30, 2025

Uh oh!

scovich commented Sep 30, 2025

Uh oh!

scovich Sep 30, 2025

Uh oh!

alamb Oct 1, 2025

Uh oh!

scovich Oct 1, 2025

Uh oh!

alamb Oct 1, 2025

Uh oh!

scovich Oct 1, 2025

Uh oh!

scovich Oct 1, 2025

Uh oh!

scovich Oct 1, 2025

Uh oh!

alamb Oct 2, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb commented Oct 1, 2025

Uh oh!

Uh oh!

[Variant] Harmonize primitive conversions variant unshredding and casting #8521

Are you sure you want to change the base?

[Variant] Harmonize primitive conversions variant unshredding and casting #8521

Conversation

scovich commented Sep 30, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

scovich commented Sep 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Analysis of Current Consolidation Effort

Current State

Key Differences Between Approaches

arrow_to_variant.rs (Cast semantics)

unshred_variant.rs (Unshred semantics)

Areas of Redundancy

1. Primitive Type Implementations

2. Enum Variants and Match Arms

3. Factory Pattern Logic

Semantic Differences That Prevent Full Sharing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 1, 2025

Uh oh!

Uh oh!