Skip to content

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Sep 30, 2025

Which issue does this PR close?

Rationale for this change

See #8481 (comment) --cast_to_variant and unshred_variant have significant overlap in the code that handles conversion of primitive types to variant.

What changes are included in this PR?

Define a new ArrowToVariant trait that does what it says, and update both sets of code to use it.

Are these changes tested?

All existing test coverage of the two conversion functions continues to pass.

Are there any user-facing changes?

No.

@github-actions github-actions bot added parquet Changes to the parquet crate parquet-variant parquet-variant* crates labels Sep 30, 2025
@scovich
Copy link
Contributor Author

scovich commented Sep 30, 2025

Attn @alamb -- I'm on the fence. This change actually increased code size a bit, but it does eliminate some duplication of logic. Maybe still be worthwhile?

@scovich scovich changed the title Harmonize unshred and cast [Variant] Harmonize primitive conversions variant unshredding and casting Sep 30, 2025
}

/// Macro to define ArrowToVariant implementations with optional value transformation
macro_rules! define_arrow_to_variant {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I don't love is we now have two complex macros that feel kind of similar. This new macro simplified the unshred_variant code but has ~no effect on the cast_to_variant code in this file. In fact, the builder definition didn't even change (L397 on the left side of the diff):

define_row_builder!(
    struct PrimitiveArrowToVariantBuilder<'a, T: ArrowPrimitiveType>
    where T::Native: Into<Variant<'a, 'a>>,
    |array| -> PrimitiveArray<T> { array.as_primitive() }
);

So in retrospect, I'm not sure what duplication this PR actually eliminated?
Seems like it just moved some definitions around.

Any ideas on a better approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this PR doesn't seem to have eliminated much yet

There seems to be two major pieces of functionality

  1. Converting elements of Arrow Arrays to the corresponding Variant elements
  2. Converting Variant arrays back to Arrow Arrays

It seems to me like arrow_to_variant.rs and unshred_variant.rs still have non trivial overlap as you mention. What I was hoping was that we could somehow use one set of traits / structs for both those operations which would mean we would get unshred support "for free" for other types like UnionArrays

That was the dream at anyways.

After I spent some time reviewing this PR I think I would like to propose some naming consolidation as we have two functions that are inverses of each other that are confusingly named;

  • cast_to_variant which casts arrow arrays to variant
  • variant_to_arrow which casts variants to arrow arrays

I will make a PR to propose cast_arrow_to_variant and cast_variant_to_arrow for the kernel names and we can keep the modules variant_to_arrow and arrow_to_variant with the actual code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After I spent some time reviewing this PR I think I would like to propose some naming consolidation as we have two functions that are inverses of each other that are confusingly named;

* `cast_to_variant` which casts arrow arrays to variant

* `variant_to_arrow` which casts variants to arrow arrays

I will make a PR to propose cast_arrow_to_variant and cast_variant_to_arrow for the kernel names and we can keep the modules variant_to_arrow and arrow_to_variant with the actual code

cast_to_variant is a function that leverages the arrow_to_variant module.

variant_to_arrow is a module that the variant_get function relies on.

I guess variant_get could be seen as a superset of a hypothetical cast_variant_to_arrow function that is an inverse of cast_[arrow_]to_variant?

That said, I do think it's a good idea to step back and take a hard look at naming conventions as the code matures and the interactions (or lack thereof) become clearer:

  • cast_to_variant - converts fully strongly-typed data to binary variant
    • uses the row builders defined in arrow_to_variant module
  • shred_variant - shreds a binary variant input according to the requested shredding schema
    • uses its own row builders defined in the same module
  • unshred_variant - converts shredded variant back to binary variant
    • uses its own row builders defined in the same module
  • variant_get - can be used to extract fully strongly-typed data from variant (shredded or not)
    • tries to do columnar operations when possible, but falls back to row builders defined in the variant_to_arrow module when necessary

And, for good measure, some of the conversions probably should move to type_conversions module, and I don't know where the ListLikeArray trait should live?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess variant_get could be seen as a superset of a hypothetical cast_variant_to_arrow function that is an inverse of cast_[arrow_]to_variant?

Yes, I think that is a good way to think about it. I don't think we should add cast_to_variant_to_arrow at this time (I was very confused)

And, for good measure, some of the conversions probably should move to type_conversions module, and I don't know where the ListLikeArray trait should live?

Or maybe we could consolidate the conversions into arrow_to_variant or variant_to_arrow 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be two major pieces of functionality

1. Converting elements of Arrow Arrays to the corresponding Variant elements

2. Converting Variant arrays back to Arrow Arrays

I agree this is the intuitive view, and arrow_to_variant and unshred_variant both fall loosely in category 1/.

It seems to me like arrow_to_variant.rs and unshred_variant.rs still have non trivial overlap as you mention. What I was hoping was that we could somehow use one set of traits / structs for both those operations which would mean we would get unshred support "for free" for other types like UnionArrays

That was the dream at anyways.

So this is tricky -- shredded variant typed_value columns must be one of the supported variant shredding types defined by the shredding spec.

Unsigned integer types are not on that list, so we'll never need to unshred them (even tho we can shred them, and can even variant_get them by converting back).

Complex types like Union and Map are also not on that list and so we'll never need to unshred them. But we can still convert them to variant: Whatever union branch is active for each row gets converted to variant, which works fine; maps are trickier -- I think our current code forcibly converts the map key column to string (with a cast 🙀) and then converts the result to variant object.

FixedLenBinary is also tricky, because it's not a valid shredded variant type, but UUID uses FixedLenBinary(16) as its physical type. So when converting to variant, all binary types (including fixed len) convert to Variant::Binary except that FixedLenBinary(16) with the UUID extension type would convert to Variant::Uuid.

So one immediate problem is that the two operations have an overlapping but not equivalent set of types. And some physical types that seem the same have different interpretation/semantics. Even the simplest -- the NULL builder -- has different semantics between the two operations.

Another problem is that the definition of "primitive type" differs between the two modules:

  • unshred_variant takes the variant perspective, so string, binary, and boolean arrays can all implement the generic AppendToVariantBuilder trait that UnshredPrimitiveRowBuilder relies on. But timestamps, which need extra state (timezone info) are not primitive and need their own builder.
  • arrow_to_variant takes the arrow perspective, so string, binary and boolean are not primitive types and thus need their own builder implementations. Additionally, all decimal and temporal types need special treatment and they get customer builders as well.

Another problem is the need to handle value column when unshredding, which is not needed when converting strongly typed data to variant.

Overall... I couldn't find a way to slice this better, in spite of my intuition screaming it should be possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and casting failures also... converting arrow value to variant can fail, and cast options decides whether the failure produces Variant::Null or an error. In theory, unshredding is infallible and any failure there is due to invalid data (so should always produce an error).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an LLM-generated compare/contrast, in case that's helpful:

Analysis of Current Consolidation Effort

Current State

The diff shows an attempt to consolidate primitive type handling by:

  1. Introducing shared ArrowToVariant trait - A zero-cost trait that both modules can use
  2. Adding define_arrow_to_variant macro - A simpler macro for basic type conversions
  3. Sharing timestamp conversion logic - shared_timestamp_to_variant function
  4. Partial adoption in unshred_variant.rs - Using the shared trait for most primitive types

Key Differences Between Approaches

arrow_to_variant.rs (Cast semantics)

  • Philosophy: Convert any Arrow type to Variant, with flexible error handling
  • Macro: define_row_builder! - Complex, feature-rich macro supporting:
    • Optional extra fields (like CastOptions, scale)
    • Fallible transformations with Option<T> return types
    • Strict vs non-strict error handling modes
    • Generic type parameters and where clauses
  • Primitive definition: T::Native: Into<Variant> constraint
  • Decimal support: Full support for all decimal types with overflow handling
  • FixedSizeBinary: Accepts any size, converts to Variant::Binary

unshred_variant.rs (Unshred semantics)

  • Philosophy: Reconstruct original Variant from shredded representation
  • Macro: define_arrow_to_variant! - Simpler macro supporting:
    • Basic value transformations
    • Simple error propagation
    • No extra configuration fields
  • Primitive definition: Any type implementing ArrowToVariant trait
  • Decimal support: Missing (not implemented)
  • FixedSizeBinary: Only size 16 allowed, converts to Variant::Uuid

Areas of Redundancy

1. Primitive Type Implementations

Both modules have nearly identical logic for:

  • Basic integer types (i8, i16, i32, i64, u8, u16, u32, u64)
  • Floating point types (f32, f64, f16)
  • Boolean, String, BinaryView
  • Date32, Time64Microsecond
  • Timestamp types (with timezone handling)

2. Enum Variants and Match Arms

Both have large enums with similar variants:

// arrow_to_variant.rs
PrimitiveInt8(PrimitiveArrowToVariantBuilder<'a, Int8Type>),
PrimitiveInt16(PrimitiveArrowToVariantBuilder<'a, Int16Type>),
// ... 12+ more primitive variants

// unshred_variant.rs  
PrimitiveInt8(UnshredPrimitiveRowBuilder<'a, PrimitiveArray<Int8Type>>),
PrimitiveInt16(UnshredPrimitiveRowBuilder<'a, PrimitiveArray<Int16Type>>),
// ... 10+ more primitive variants

3. Factory Pattern Logic

Both have similar DataType matching logic to create appropriate builders.

Semantic Differences That Prevent Full Sharing

  1. FixedSizeBinary Handling:

    • Cast: Any size → Variant::Binary
    • Unshred: Only size 16 → Variant::Uuid
  2. Error Handling Philosophy:

    • Cast: Configurable strict/non-strict modes, overflow becomes Variant::Null
    • Unshred: Strict validation, errors propagate up
  3. Decimal Support:

    • Cast: Full decimal support with scale handling and overflow detection
    • Unshred: No decimal support (missing data point)
  4. Method Signatures:

    • Cast: append_row(builder, index)
    • Unshred: append_row(builder, metadata, index) - needs metadata for unshredded values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my updated suggestion (basically just update comments)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think this is an improvement over what we have on main, even if we can still improve it further

@alamb
Copy link
Contributor

alamb commented Oct 1, 2025

Thank you @scovich -- I think it just needs to have conflicts resolved and it will be good to go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants