Skip to content

Conversation

bpintea
Copy link
Contributor

@bpintea bpintea commented Dec 17, 2024

This adds an optimization rule to rewrite TO_UPPER/TO_LOWER comparisons against a string into an InsensitiveEquals comparison. The rewrite can also result right away into a TRUE/FALSE, in case the string doesn't match the caseness of the function.

This also allows later pushing down the predicate to lucene as a case-insensitive term-query.

Fixes #118304.

This adds an optimization rule to rewrite TO_UPPER/TO_LOWER comparisons
against a string into an InsensitiveEquals comparison. The rewrite can
also result right away into a TRUE/FALSE, in case the string doesn't
match the caseness of the function.

This also allows later pushing down the predicate to lucene as a
case-insensitive term-query.
@bpintea bpintea added >enhancement auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL v9.0.0 v8.18.0 labels Dec 17, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Dec 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @bpintea, I've created a changelog YAML for you.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


/** Similar to {@link String#strip()}, but removes the WS throughout the entire string. */
public static String stripThrough(String input) {
return WS_PATTERN.matcher(input).replaceAll(StringUtils.EMPTY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative to avoid the pattern matching, use Strings.deleteAny(input, " ") or Strings.replace(input, " ", "")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The \s class will match more ([ \t\n\x0B\f\r]). I use it to compare (multiline, "randomly" spaced) JSONs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

var foldedRight = BytesRefs.toString(bc.right().fold());
return changeCase.caseType().matchesCase(foldedRight)
? new InsensitiveEquals(bc.source(), changeCase.field(), bc.right())
: Literal.FALSE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Literal.of(bc, Boolean.FALSE) to preserve the original source.
Additionally, see whether it makes sense to add a warning when dealing with a literal (WHERE TO_UPPER(field) == "lowercase").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add an additional check to remove any nested casing function since only the top level one counts:
to_upper(to_lower(to_upper(field))) is the same as to_upper(field).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add an additional check to remove any nested casing function

Added.
It's basic; ideally, we'd be able to transform TO_UPPER(CONCAT(TO_LOWER(field), "foo")) to just TO_UPPER(CONCAT(field, "foo")), but not TO_UPPER((CONCAT(TO_LOWER(field), "-foo") == "x-foo")::KEYWORD). But maybe another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see whether it makes sense to add a warning when dealing with a literal (WHERE TO_UPPER(field) == "lowercase")

We definitely can, but do we want then to expand a bit the scope and do this when other optimisations apply (folding to null or booleans), especially in filters?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to add sorting or ignore order
Additionally add some tests with non-alphabetical strings such as 🐔✈🔥🎉👓

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added, both here and in the unit tests. The ones here are just functional, but we should probably add some unicode data in the CSV tests indices.

import org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.InsensitiveEquals;
import org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.NotEquals;

public class ReplaceCaseWithInsensitiveMatch extends OptimizerRules.OptimizerExpressionRule<BinaryComparison> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReplaceStringCasingWithInsensitiveEquals (avoid using Match).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

@bpintea bpintea added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Dec 20, 2024
@elasticsearchmachine elasticsearchmachine merged commit d521f89 into elastic:main Dec 23, 2024
16 checks passed
@bpintea bpintea deleted the enh/push_down_to_case branch December 23, 2024 10:08
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

bpintea added a commit to bpintea/elasticsearch that referenced this pull request Dec 23, 2024
This adds an optimization rule to rewrite TO_UPPER/TO_LOWER comparisons
against a string into an InsensitiveEquals comparison. The rewrite can
also result right away into a TRUE/FALSE, in case the string doesn't
match the caseness of the function.

This also allows later pushing down the predicate to lucene as a
case-insensitive term-query.

Fixes elastic#118304.
elasticsearchmachine pushed a commit that referenced this pull request Dec 23, 2024
* ESQL: Rewrite TO_UPPER/TO_LOWER comparisons (#118870)

This adds an optimization rule to rewrite TO_UPPER/TO_LOWER comparisons
against a string into an InsensitiveEquals comparison. The rewrite can
also result right away into a TRUE/FALSE, in case the string doesn't
match the caseness of the function.

This also allows later pushing down the predicate to lucene as a
case-insensitive term-query.

Fixes #118304.

* Disable `TO_UPPER(null)`-tests prior to 8.17 (#119213)

TO_UPPER/TO_LOWER resolution incorrectly returned child's type (that
could also be `null`, type `NULL`), instead of KEYWORD/TEXT. So a test
like `TO_UPPER(null) == "..."` fails on type mismatch. This was fixed
collaterally by #114334 (8.17.0)

Also, correct some of the tests skipping (that had however no impact,
due to testing range).

(cherry picked from commit edb3818)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: look to pushdown case insensitive functions
3 participants