-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Harmonize primitive conversions variant unshredding and casting #8521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Attn @alamb -- I'm on the fence. This change actually increased code size a bit, but it does eliminate some duplication of logic. Maybe still be worthwhile? |
} | ||
|
||
/// Macro to define ArrowToVariant implementations with optional value transformation | ||
macro_rules! define_arrow_to_variant { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I don't love is we now have two complex macros that feel kind of similar. This new macro simplified the unshred_variant
code but has ~no effect on the cast_to_variant
code in this file. In fact, the builder definition didn't even change (L397 on the left side of the diff):
define_row_builder!(
struct PrimitiveArrowToVariantBuilder<'a, T: ArrowPrimitiveType>
where T::Native: Into<Variant<'a, 'a>>,
|array| -> PrimitiveArray<T> { array.as_primitive() }
);
So in retrospect, I'm not sure what duplication this PR actually eliminated?
Seems like it just moved some definitions around.
Any ideas on a better approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this PR doesn't seem to have eliminated much yet
There seems to be two major pieces of functionality
- Converting elements of Arrow Arrays to the corresponding Variant elements
- Converting Variant arrays back to Arrow Arrays
It seems to me like arrow_to_variant.rs
and unshred_variant.rs
still have non trivial overlap as you mention. What I was hoping was that we could somehow use one set of traits / structs for both those operations which would mean we would get unshred
support "for free" for other types like UnionArray
s
That was the dream at anyways.
After I spent some time reviewing this PR I think I would like to propose some naming consolidation as we have two functions that are inverses of each other that are confusingly named;
cast_to_variant
which casts arrow arrays to variantvariant_to_arrow
which casts variants to arrow arrays
I will make a PR to propose cast_arrow_to_variant
and cast_variant_to_arrow
for the kernel names and we can keep the modules variant_to_arrow
and arrow_to_variant
with the actual code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After I spent some time reviewing this PR I think I would like to propose some naming consolidation as we have two functions that are inverses of each other that are confusingly named;
* `cast_to_variant` which casts arrow arrays to variant * `variant_to_arrow` which casts variants to arrow arrays
I will make a PR to propose
cast_arrow_to_variant
andcast_variant_to_arrow
for the kernel names and we can keep the modulesvariant_to_arrow
andarrow_to_variant
with the actual code
cast_to_variant
is a function that leverages the arrow_to_variant
module.
variant_to_arrow
is a module that the variant_get
function relies on.
I guess variant_get
could be seen as a superset of a hypothetical cast_variant_to_arrow
function that is an inverse of cast_[arrow_]to_variant
?
That said, I do think it's a good idea to step back and take a hard look at naming conventions as the code matures and the interactions (or lack thereof) become clearer:
cast_to_variant
- converts fully strongly-typed data to binary variant- uses the row builders defined in
arrow_to_variant
module
- uses the row builders defined in
shred_variant
- shreds a binary variant input according to the requested shredding schema- uses its own row builders defined in the same module
unshred_variant
- converts shredded variant back to binary variant- uses its own row builders defined in the same module
variant_get
- can be used to extract fully strongly-typed data from variant (shredded or not)- tries to do columnar operations when possible, but falls back to row builders defined in the
variant_to_arrow
module when necessary
- tries to do columnar operations when possible, but falls back to row builders defined in the
And, for good measure, some of the conversions probably should move to type_conversions
module, and I don't know where the ListLikeArray
trait should live?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Here is the PR [Variant] Improve documentation and includes for casts #8532
I guess variant_get could be seen as a superset of a hypothetical cast_variant_to_arrow function that is an inverse of cast_[arrow_]to_variant?
Yes, I think that is a good way to think about it. I don't think we should add cast_to_variant_to_arrow
at this time (I was very confused)
And, for good measure, some of the conversions probably should move to type_conversions module, and I don't know where the ListLikeArray trait should live?
Or maybe we could consolidate the conversions into arrow_to_variant
or variant_to_arrow
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be two major pieces of functionality
1. Converting elements of Arrow Arrays to the corresponding Variant elements 2. Converting Variant arrays back to Arrow Arrays
I agree this is the intuitive view, and arrow_to_variant and unshred_variant both fall loosely in category 1/.
It seems to me like
arrow_to_variant.rs
andunshred_variant.rs
still have non trivial overlap as you mention. What I was hoping was that we could somehow use one set of traits / structs for both those operations which would mean we would getunshred
support "for free" for other types likeUnionArray
sThat was the dream at anyways.
So this is tricky -- shredded variant typed_value
columns must be one of the supported variant shredding types defined by the shredding spec.
Unsigned integer types are not on that list, so we'll never need to unshred them (even tho we can shred them, and can even variant_get
them by converting back).
Complex types like Union and Map are also not on that list and so we'll never need to unshred them. But we can still convert them to variant: Whatever union branch is active for each row gets converted to variant, which works fine; maps are trickier -- I think our current code forcibly converts the map key column to string (with a cast
🙀) and then converts the result to variant object.
FixedLenBinary is also tricky, because it's not a valid shredded variant type, but UUID uses FixedLenBinary(16)
as its physical type. So when converting to variant, all binary types (including fixed len) convert to Variant::Binary
except that FixedLenBinary(16)
with the UUID extension type would convert to Variant::Uuid
.
So one immediate problem is that the two operations have an overlapping but not equivalent set of types. And some physical types that seem the same have different interpretation/semantics. Even the simplest -- the NULL builder -- has different semantics between the two operations.
Another problem is that the definition of "primitive type" differs between the two modules:
unshred_variant
takes the variant perspective, so string, binary, and boolean arrays can all implement the genericAppendToVariantBuilder
trait thatUnshredPrimitiveRowBuilder
relies on. But timestamps, which need extra state (timezone info) are not primitive and need their own builder.arrow_to_variant
takes the arrow perspective, so string, binary and boolean are not primitive types and thus need their own builder implementations. Additionally, all decimal and temporal types need special treatment and they get customer builders as well.
Another problem is the need to handle value
column when unshredding, which is not needed when converting strongly typed data to variant.
Overall... I couldn't find a way to slice this better, in spite of my intuition screaming it should be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and casting failures also... converting arrow value to variant can fail, and cast options decides whether the failure produces Variant::Null
or an error. In theory, unshredding is infallible and any failure there is due to invalid data (so should always produce an error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an LLM-generated compare/contrast, in case that's helpful:
Analysis of Current Consolidation Effort
Current State
The diff shows an attempt to consolidate primitive type handling by:
- Introducing shared
ArrowToVariant
trait - A zero-cost trait that both modules can use - Adding
define_arrow_to_variant
macro - A simpler macro for basic type conversions - Sharing timestamp conversion logic -
shared_timestamp_to_variant
function - Partial adoption in unshred_variant.rs - Using the shared trait for most primitive types
Key Differences Between Approaches
arrow_to_variant.rs (Cast semantics)
- Philosophy: Convert any Arrow type to Variant, with flexible error handling
- Macro:
define_row_builder!
- Complex, feature-rich macro supporting:- Optional extra fields (like
CastOptions
,scale
) - Fallible transformations with
Option<T>
return types - Strict vs non-strict error handling modes
- Generic type parameters and where clauses
- Optional extra fields (like
- Primitive definition:
T::Native: Into<Variant>
constraint - Decimal support: Full support for all decimal types with overflow handling
- FixedSizeBinary: Accepts any size, converts to
Variant::Binary
unshred_variant.rs (Unshred semantics)
- Philosophy: Reconstruct original Variant from shredded representation
- Macro:
define_arrow_to_variant!
- Simpler macro supporting:- Basic value transformations
- Simple error propagation
- No extra configuration fields
- Primitive definition: Any type implementing
ArrowToVariant
trait - Decimal support: Missing (not implemented)
- FixedSizeBinary: Only size 16 allowed, converts to
Variant::Uuid
Areas of Redundancy
1. Primitive Type Implementations
Both modules have nearly identical logic for:
- Basic integer types (i8, i16, i32, i64, u8, u16, u32, u64)
- Floating point types (f32, f64, f16)
- Boolean, String, BinaryView
- Date32, Time64Microsecond
- Timestamp types (with timezone handling)
2. Enum Variants and Match Arms
Both have large enums with similar variants:
// arrow_to_variant.rs
PrimitiveInt8(PrimitiveArrowToVariantBuilder<'a, Int8Type>),
PrimitiveInt16(PrimitiveArrowToVariantBuilder<'a, Int16Type>),
// ... 12+ more primitive variants
// unshred_variant.rs
PrimitiveInt8(UnshredPrimitiveRowBuilder<'a, PrimitiveArray<Int8Type>>),
PrimitiveInt16(UnshredPrimitiveRowBuilder<'a, PrimitiveArray<Int16Type>>),
// ... 10+ more primitive variants
3. Factory Pattern Logic
Both have similar DataType matching logic to create appropriate builders.
Semantic Differences That Prevent Full Sharing
-
FixedSizeBinary Handling:
- Cast: Any size →
Variant::Binary
- Unshred: Only size 16 →
Variant::Uuid
- Cast: Any size →
-
Error Handling Philosophy:
- Cast: Configurable strict/non-strict modes, overflow becomes
Variant::Null
- Unshred: Strict validation, errors propagate up
- Cast: Configurable strict/non-strict modes, overflow becomes
-
Decimal Support:
- Cast: Full decimal support with scale handling and overflow detection
- Unshred: No decimal support (missing data point)
-
Method Signatures:
- Cast:
append_row(builder, index)
- Unshred:
append_row(builder, metadata, index)
- needs metadata for unshredded values
- Cast:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my updated suggestion (basically just update comments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think this is an improvement over what we have on main, even if we can still improve it further
Thank you @scovich -- I think it just needs to have conflicts resolved and it will be good to go |
Which issue does this PR close?
Struct
#8336Rationale for this change
See #8481 (comment) --
cast_to_variant
andunshred_variant
have significant overlap in the code that handles conversion of primitive types to variant.What changes are included in this PR?
Define a new
ArrowToVariant
trait that does what it says, and update both sets of code to use it.Are these changes tested?
All existing test coverage of the two conversion functions continues to pass.
Are there any user-facing changes?
No.