Skip to content

Conversation

@astefan
Copy link
Contributor

@astefan astefan commented Dec 6, 2024

Better to check nullable() rather than instanceof.

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Dec 6, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instanceof certainly feels terrible, but is Nullable.FALSE the right way to express this?

It looks like FALSE means "this will never make null. @jan-elastic, is that true for this function? Even if you send it no results?

@astefan
Copy link
Contributor Author

astefan commented Dec 6, 2024

The instanceof certainly feels terrible, but is Nullable.FALSE the right way to express this?

It looks like FALSE means "this will never make null. @jan-elastic, is that true for this function? Even if you send it no results?

Haven't seen any obvious tests with categorize(null) or categorize of a nullable thing, saw the "skip" in FoldNull and assumed this function is not nullable. If indeed is, I am curious why the change in FoldNull though. This PR (no reviewers yet :-) ) was to pseudo-validate my assumption of a non-nullable categorize. Happy to discuss further.

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya, thanks for taking a stab at this!

@ivancea and I still wanted to take a look and check if we really need the special casing in FoldNull. Unfortunately, I do not think it's correct to always assume Nullability.FALSE - actually, the opposite is correct, see below :/


@Override
public Nullability nullable() {
return Nullability.FALSE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this is conceptually not true. Actually, this should be Nullability.TRUE.

In the csv tests, we have test cases that emit nulls: either from actual null inputs, or from empty strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CATEGORIZE of null, empty string, only whitespace, or only stopword tokens, is null.

Just look at categorize.csv-spec and search for "null" for many examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why that check in FoldNull?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclaimer: I'm not an expert on this part of the code. @nik9000 or @ivancea probably are the right persons to ask.

My thoughts: CATEGORIZE is a stateful function, that uses a specialized CategorizeBlockHash to gather the results. I think folding the nulls breaks that. (Note that commenting out the line also breaks a bunch of CSV tests.)

Copy link
Contributor

@alex-spies alex-spies Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivancea provided some context on this to me. I think this is required for the following case, which can be seen when removing the check in FoldNull before running:

row x = null | stats count() by categorize(x)

{"error":{"root_cause":[{"type":"null_pointer_exception","reason":"Cannot invoke \"org.elasticsearch.xpack.esql.core.expression.Attribute.id()\" because \"sourceAttribute\" is null"}],"type":"null_pointer_exception","reason":"Cannot invoke \"org.elasticsearch.xpack.esql.core.expression.Attribute.id()\" because \"sourceAttribute\" is null"},"status":500}

FoldNull propagates the null from the row even into the Aggregate:

[2024-12-09T11:08:22,575][INFO ][o.e.x.e.o.LogicalPlanOptimizer] [runTask-0] Rule logical.FoldNull applied
Limit[1000[INTEGER]]                                                                                                         = Limit[1000[INTEGER]]
\_Aggregate[STANDARD,[CATEGORIZE(x{r}#104) AS categorize(x)],[COUNT([2a][KEYWORD],true[BOOLEAN]) AS count(), categorize(x){r ! \_Aggregate[STANDARD,[null[KEYWORD] AS categorize(x)],[COUNT([2a][KEYWORD],true[BOOLEAN]) AS count(), categorize(x){r}#106]]
}#106]]                                                                                                                      !   \_Row[[null[NULL] AS x]]
  \_Row[[null[NULL] AS x]]                                                                                                   ! 
                             

The reason this can happen is that the CATEGORIZE must remain in the Aggregate's groupings.

I think the correct course of action is:

  • Add the query from above as csv test
  • Leave the check in place and document what it is needed for
  • Override Categorize.Nullable and set it to Nullable.TRUE as we can only know if it'll produce a null after running/folding it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Csv test proposal that I just had pass locally:

row null
required_capability: categorize_v5

ROW message = null, str = ["a", "b", "c"]
  | STATS COUNT(), VALUES(str) BY category=CATEGORIZE(message)
;

COUNT():long | VALUES(str):keyword | category:keyword
           1 | [a, b, c]           | null
;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ivan is adding a similar test case here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And probably CI passed in this PR because we were missing the right tests, no? @alex-spies @ivancea @jan-elastic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think the nullable method may currently only be relevant for FoldNull, where the explicit exemption for CATEGORIZE makes it kinda irrelevant.

That is to say, I can't think of a test case that will make your PR fail @astefan . (But I think we should still not use nullable() this way, as it would be a breach of contract).

FWIW, the row test case above, or the one from Ivan's PR at least make things fail when you just remove the extra check for CATEGORIZE from FoldNull without also overriding Categorize.nullable().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added here an extra testcase (ROW null), apart from the null-in-param one the PR had already: #118013
Also, updated the Categorize#nullable() to make sense (An empty string will also return null), and commented the why of the Categorize exception in FoldNull

@astefan
Copy link
Contributor Author

astefan commented Dec 9, 2024

Relates to #112392

@astefan astefan closed this Dec 9, 2024
@astefan astefan deleted the categorize_not_nullable branch December 9, 2024 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants