-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add nullable to Categorize #118176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nullable to Categorize #118176
Conversation
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
nik9000
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instanceof certainly feels terrible, but is Nullable.FALSE the right way to express this?
It looks like FALSE means "this will never make null. @jan-elastic, is that true for this function? Even if you send it no results?
Haven't seen any obvious tests with |
alex-spies
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heya, thanks for taking a stab at this!
@ivancea and I still wanted to take a look and check if we really need the special casing in FoldNull. Unfortunately, I do not think it's correct to always assume Nullability.FALSE - actually, the opposite is correct, see below :/
|
|
||
| @Override | ||
| public Nullability nullable() { | ||
| return Nullability.FALSE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this is conceptually not true. Actually, this should be Nullability.TRUE.
In the csv tests, we have test cases that emit nulls: either from actual null inputs, or from empty strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CATEGORIZE of null, empty string, only whitespace, or only stopword tokens, is null.
Just look at categorize.csv-spec and search for "null" for many examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why that check in FoldNull?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disclaimer: I'm not an expert on this part of the code. @nik9000 or @ivancea probably are the right persons to ask.
My thoughts: CATEGORIZE is a stateful function, that uses a specialized CategorizeBlockHash to gather the results. I think folding the nulls breaks that. (Note that commenting out the line also breaks a bunch of CSV tests.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivancea provided some context on this to me. I think this is required for the following case, which can be seen when removing the check in FoldNull before running:
row x = null | stats count() by categorize(x)
{"error":{"root_cause":[{"type":"null_pointer_exception","reason":"Cannot invoke \"org.elasticsearch.xpack.esql.core.expression.Attribute.id()\" because \"sourceAttribute\" is null"}],"type":"null_pointer_exception","reason":"Cannot invoke \"org.elasticsearch.xpack.esql.core.expression.Attribute.id()\" because \"sourceAttribute\" is null"},"status":500}
FoldNull propagates the null from the row even into the Aggregate:
[2024-12-09T11:08:22,575][INFO ][o.e.x.e.o.LogicalPlanOptimizer] [runTask-0] Rule logical.FoldNull applied
Limit[1000[INTEGER]] = Limit[1000[INTEGER]]
\_Aggregate[STANDARD,[CATEGORIZE(x{r}#104) AS categorize(x)],[COUNT([2a][KEYWORD],true[BOOLEAN]) AS count(), categorize(x){r ! \_Aggregate[STANDARD,[null[KEYWORD] AS categorize(x)],[COUNT([2a][KEYWORD],true[BOOLEAN]) AS count(), categorize(x){r}#106]]
}#106]] ! \_Row[[null[NULL] AS x]]
\_Row[[null[NULL] AS x]] !
The reason this can happen is that the CATEGORIZE must remain in the Aggregate's groupings.
I think the correct course of action is:
- Add the query from above as csv test
- Leave the check in place and document what it is needed for
- Override
Categorize.Nullableand set it toNullable.TRUEas we can only know if it'll produce a null after running/folding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Csv test proposal that I just had pass locally:
row null
required_capability: categorize_v5
ROW message = null, str = ["a", "b", "c"]
| STATS COUNT(), VALUES(str) BY category=CATEGORIZE(message)
;
COUNT():long | VALUES(str):keyword | category:keyword
1 | [a, b, c] | null
;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ivan is adding a similar test case here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And probably CI passed in this PR because we were missing the right tests, no? @alex-spies @ivancea @jan-elastic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think the nullable method may currently only be relevant for FoldNull, where the explicit exemption for CATEGORIZE makes it kinda irrelevant.
That is to say, I can't think of a test case that will make your PR fail @astefan . (But I think we should still not use nullable() this way, as it would be a breach of contract).
FWIW, the row test case above, or the one from Ivan's PR at least make things fail when you just remove the extra check for CATEGORIZE from FoldNull without also overriding Categorize.nullable().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added here an extra testcase (ROW null), apart from the null-in-param one the PR had already: #118013
Also, updated the Categorize#nullable() to make sense (An empty string will also return null), and commented the why of the Categorize exception in FoldNull
|
Relates to #112392 |
Better to check
nullable()rather thaninstanceof.