Skip to content

Conversation

@stevomitric
Copy link
Contributor

@stevomitric stevomitric commented Mar 26, 2024

What changes were proposed in this pull request?

Added normalization of map keys when they are put in ArrayBasedMapBuilder.

Why are the changes needed?

As map keys need to be unique, we need to add normalization on floating point numbers and prevent the following case when building a map: Map(0.0, -0.0).
This further unblocks GROUP BY statement for Map Types as per this discussion.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New UTs in ArrayBasedMapBuilderSuite

Was this patch authored or co-authored using generative AI tooling?

No

case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER(value)
case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER(value)
case ArrayType(dt, _) =>
new GenericArrayData(value.asInstanceOf[GenericArrayData].array.map { element =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have an array of 1 million strings we will go through each value even though we know we don't need to normalize strings

what about doing the same as in NormalizeFloatingNumbers and first check if we need to perform normalization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied NormalizeFloatingNumbers.needNormalize here

"mapKeyDedupPolicy" -> "\"spark.sql.mapKeyDedupPolicy\"")
)

val builderStruct = new ArrayBasedMapBuilder(new StructType().add("i", "double"), IntegerType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a case when array is inside of a struct

)
}

test("apply key normalization when creating") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add another test for successful normalization

@stefankandic
Copy link
Contributor

please add info in the description on why the change is needed, ie right now we can create a map with both keys -0 and 0 etc

case StructType(sf) =>
new GenericInternalRow(
value.asInstanceOf[GenericInternalRow].values.zipWithIndex.map { element =>
normalize(element._1, sf(element._2).dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could also check if you need to do normalization here right?

this way we would avoid normalization of all fields of a struct if only one actually needs it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted by @cloud-fan below, complex types have been dropped.


private lazy val keyNeedNormalize = NormalizeFloatingNumbers.needNormalize(keyType)

def normalize(value: Any, dataType: DataType): Any = dataType match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should return a lambda function to do normalization based on the data type, instead of matching the data type per row.

def normalize(value: Any, dataType: DataType): Any = dataType match {
case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER(value)
case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER(value)
case ArrayType(dt, _) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to handle complex types, as we use TreeMap for complex types which should handle floating points well.

@stevomitric stevomitric requested a review from cloud-fan March 26, 2024 15:42
private lazy val keyNeedNormalize =
keyType.isInstanceOf[FloatType] || keyType.isInstanceOf[DoubleType]

def normalize(dataType: DataType): Any => Any = dataType match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private lazy val keyNormalizer: Any => Any = keyType match {
  case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER
  case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER
  case _ => identity
}

then the can just write

val keyNormalized = keyNormalizer(key)

@cloud-fan cloud-fan closed this in 87449c3 Mar 27, 2024
@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants