-
Notifications
You must be signed in to change notification settings - Fork 25.5k
ESQL: Rewrite TO_UPPER/TO_LOWER comparisons #118870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL: Rewrite TO_UPPER/TO_LOWER comparisons #118870
Conversation
This adds an optimization rule to rewrite TO_UPPER/TO_LOWER comparisons against a string into an InsensitiveEquals comparison. The rewrite can also result right away into a TRUE/FALSE, in case the string doesn't match the caseness of the function. This also allows later pushing down the predicate to lucene as a case-insensitive term-query.
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Hi @bpintea, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
/** Similar to {@link String#strip()}, but removes the WS throughout the entire string. */ | ||
public static String stripThrough(String input) { | ||
return WS_PATTERN.matcher(input).replaceAll(StringUtils.EMPTY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative to avoid the pattern matching, use Strings.deleteAny(input, " ")
or Strings.replace(input, " ", "")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The \s
class will match more ([ \t\n\x0B\f\r]
). I use it to compare (multiline, "randomly" spaced) JSONs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
var foldedRight = BytesRefs.toString(bc.right().fold()); | ||
return changeCase.caseType().matchesCase(foldedRight) | ||
? new InsensitiveEquals(bc.source(), changeCase.field(), bc.right()) | ||
: Literal.FALSE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Literal.of(bc, Boolean.FALSE)
to preserve the original source.
Additionally, see whether it makes sense to add a warning when dealing with a literal (WHERE TO_UPPER(field) == "lowercase").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add an additional check to remove any nested casing function since only the top level one counts:
to_upper(to_lower(to_upper(field)))
is the same as to_upper(field)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add an additional check to remove any nested casing function
Added.
It's basic; ideally, we'd be able to transform TO_UPPER(CONCAT(TO_LOWER(field), "foo"))
to just TO_UPPER(CONCAT(field, "foo"))
, but not TO_UPPER((CONCAT(TO_LOWER(field), "-foo") == "x-foo")::KEYWORD)
. But maybe another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see whether it makes sense to add a warning when dealing with a literal (WHERE TO_UPPER(field) == "lowercase")
We definitely can, but do we want then to expand a bit the scope and do this when other optimisations apply (folding to null
or booleans), especially in filters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to add sorting or ignore order
Additionally add some tests with non-alphabetical strings such as 🐔✈🔥🎉👓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added, both here and in the unit tests. The ones here are just functional, but we should probably add some unicode data in the CSV tests indices.
import org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.InsensitiveEquals; | ||
import org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.NotEquals; | ||
|
||
public class ReplaceCaseWithInsensitiveMatch extends OptimizerRules.OptimizerExpressionRule<BinaryComparison> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReplaceStringCasingWithInsensitiveEquals
(avoid using Match).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed.
💚 Backport successful
|
This adds an optimization rule to rewrite TO_UPPER/TO_LOWER comparisons against a string into an InsensitiveEquals comparison. The rewrite can also result right away into a TRUE/FALSE, in case the string doesn't match the caseness of the function. This also allows later pushing down the predicate to lucene as a case-insensitive term-query. Fixes elastic#118304.
* ESQL: Rewrite TO_UPPER/TO_LOWER comparisons (#118870) This adds an optimization rule to rewrite TO_UPPER/TO_LOWER comparisons against a string into an InsensitiveEquals comparison. The rewrite can also result right away into a TRUE/FALSE, in case the string doesn't match the caseness of the function. This also allows later pushing down the predicate to lucene as a case-insensitive term-query. Fixes #118304. * Disable `TO_UPPER(null)`-tests prior to 8.17 (#119213) TO_UPPER/TO_LOWER resolution incorrectly returned child's type (that could also be `null`, type `NULL`), instead of KEYWORD/TEXT. So a test like `TO_UPPER(null) == "..."` fails on type mismatch. This was fixed collaterally by #114334 (8.17.0) Also, correct some of the tests skipping (that had however no impact, due to testing range). (cherry picked from commit edb3818)
This adds an optimization rule to rewrite TO_UPPER/TO_LOWER comparisons against a string into an InsensitiveEquals comparison. The rewrite can also result right away into a TRUE/FALSE, in case the string doesn't match the caseness of the function.
This also allows later pushing down the predicate to lucene as a case-insensitive term-query.
Fixes #118304.