Skip to content

Conversation

@mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented Aug 14, 2025

Which issue does this PR close?

  • Closes #.

Rationale for this change

#11667 added some Uft8View support for regexp_replace, but only for the first argument. The other args only support Uft8, which likely isn't a huge deal since they're probably short literals, but adding support for other string types makes it easier to simplify the signature. Systems like Comet (which may emit string literals as Utf8View) will also benefit from being able to use this function.

What changes are included in this PR?

  • Change the signature to support more types, only when all the args are the same types.
  • Update end-to-end tests for other string types.

Are these changes tested?

Yes, there are new tests. See above.

Are there any user-facing changes?

The signature for regexp_replace changed, but it was maybe wrong before (did not have LargeUtf8).

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Aug 14, 2025
let patterns = StringArray::from(patterns);
let replacements = StringArray::from(replacement);
let patterns = <$U>::from(patterns);
let replacements = <$U>::from(replacement);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change all of the other arg string types at once since they hit the same code path (fetch_string_arg) so I don't think we need to test them independently.

@mbutrovich mbutrovich changed the title feat: better Utf8View support for regexp_replace, signature cleanup feat: support Utf8View for more arguments of regexp_replace, signature cleanup Aug 14, 2025
@mbutrovich mbutrovich changed the title feat: support Utf8View for more arguments of regexp_replace, signature cleanup feat: support Utf8View for more args of regexp_replace, signature cleanup Aug 14, 2025
@mbutrovich mbutrovich changed the title feat: support Utf8View for more args of regexp_replace, signature cleanup feat: support Utf8View for more args of regexp_replace Aug 14, 2025
@mbutrovich mbutrovich changed the title feat: support Utf8View for more args of regexp_replace feat: support Utf8View for more args of regexp_replace Aug 14, 2025
@mbutrovich mbutrovich marked this pull request as draft August 14, 2025 20:32
@mbutrovich
Copy link
Contributor Author

I'm afraid I might have lost the plot with all of those string args. I looked at other string functions to see if a huge match is how we handle this, but the worst I could find was 2 args, not 4.

@mbutrovich
Copy link
Contributor Author

I'm not proud of the readability of the change, but at least there doesn't seem to be a meaningful performance regression:

main

regexp_replace_1000     time:   [1.5948 ms 1.5958 ms 1.5967 ms]

regexp_replace_1000 utf8view
                        time:   [1.5956 ms 1.5966 ms 1.5976 ms]

PR

regexp_replace_1000     time:   [1.5982 ms 1.5991 ms 1.6000 ms]

regexp_replace_1000 utf8view
                        time:   [1.6023 ms 1.6038 ms 1.6056 ms]

@mbutrovich mbutrovich marked this pull request as ready for review August 15, 2025 19:54
…anymore, update the .slt test for string_view.slt, and understand why String(3) and String(4) is not equivalent to this.
@alamb
Copy link
Contributor

alamb commented Aug 20, 2025

@mbutrovich -- thank you for this PR.

I still can't help but feel there is a mismatch between what this PR is doing (trying expand the types of string arguments this function supports) and what you are actually seeing in Comet.

Can you explain what is happening in comet? For example, is regexp_replace being called with regexp_replace(UTF8View, Utf8View, Utf8View)

I ask because a mismatch between the actual argument types, and the types a function accepts can be fixed in at least 2 ways:

  1. Expand the types a function supports (this PR)
  2. Cast the arguments to some combination that the function supports (via coercion).

Typically coercion is done in the LogicalPlan, and I know that Comet does not use LogicalPlans. However, @adriangb and @kosiew have been working on a version of this for PhysicalExprs, in the PhysicalExprAdapter

I also reviewed the documentation on types and type signatures and I think there is some ambiguity about what a function "supporting argument types" means. I will make a PR to improve the documentation in this area

@mbutrovich
Copy link
Contributor Author

I haven't forgotten about this. I need to generate a new .slt file for the relevant tests.

@adriangb adriangb self-requested a review August 26, 2025 21:01
@mbutrovich
Copy link
Contributor Author

Can you explain what is happening in comet? For example, is regexp_replace being called with regexp_replace(UTF8View, Utf8View, Utf8View)

In my experimental branch adding StringView support to Comet, we need a way to represent string literals during serialization from the Spark side to DataFusion. Currently all string literals come over as Utf8 and that just works. However, with Utf8View columns coming out of the Parquet reader, Arrow complains about not being able to evaluate filter expressions with mismatched types. I changed all string literals to be Utf8View, which underneath doesn't really change anything underneath for single ScalarValues. Now, however, I have problems with functions like regexp_replace which expect literals to be Utf8. Since Comet does not use DataFusion's front-end, we don't get the cast operations inserted into the plan that the signature logic is designed for.

I am increasingly of the mind that Comet needs to start doing some passes over the physical plan, and type coercion like this might be one reason.

I think this PR is good to go, but also am okay if we think it's needless complexity.

Self {
signature: Signature::one_of(
vec![
TypeSignature::Exact(vec![Utf8, Utf8, Utf8]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an option too?

Uniform(3, vec![Utf8View, LargeUtf8, Utf8]),
Uniform(4, vec![Utf8View, LargeUtf8, Utf8]),

I tested it on a checked out copy of this PR and the slt tests passed

@mbutrovich mbutrovich requested a review from Jefffrey September 24, 2025 13:38
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mbutrovich
Copy link
Contributor Author

Thanks for the reviews and helpful feedback @Jefffrey and @alamb! I'm hopeful that this will get Comet closer to fully supporting Utf8View.

@mbutrovich mbutrovich added this pull request to the merge queue Sep 24, 2025
Merged via the queue into apache:main with commit 564864b Sep 24, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants