Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
import org.elasticsearch.compute.operator.EvalOperator.ExpressionEvaluator;
import org.elasticsearch.xpack.esql.capabilities.Validatable;
import org.elasticsearch.xpack.esql.core.expression.Expression;
import org.elasticsearch.xpack.esql.core.expression.Nullability;
import org.elasticsearch.xpack.esql.core.tree.NodeInfo;
import org.elasticsearch.xpack.esql.core.tree.Source;
import org.elasticsearch.xpack.esql.core.type.DataType;
Expand Down Expand Up @@ -92,6 +93,11 @@ public boolean foldable() {
return false;
}

@Override
public Nullability nullable() {
return Nullability.FALSE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this is conceptually not true. Actually, this should be Nullability.TRUE.

In the csv tests, we have test cases that emit nulls: either from actual null inputs, or from empty strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CATEGORIZE of null, empty string, only whitespace, or only stopword tokens, is null.

Just look at categorize.csv-spec and search for "null" for many examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why that check in FoldNull?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclaimer: I'm not an expert on this part of the code. @nik9000 or @ivancea probably are the right persons to ask.

My thoughts: CATEGORIZE is a stateful function, that uses a specialized CategorizeBlockHash to gather the results. I think folding the nulls breaks that. (Note that commenting out the line also breaks a bunch of CSV tests.)

Copy link
Contributor

@alex-spies alex-spies Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivancea provided some context on this to me. I think this is required for the following case, which can be seen when removing the check in FoldNull before running:

row x = null | stats count() by categorize(x)

{"error":{"root_cause":[{"type":"null_pointer_exception","reason":"Cannot invoke \"org.elasticsearch.xpack.esql.core.expression.Attribute.id()\" because \"sourceAttribute\" is null"}],"type":"null_pointer_exception","reason":"Cannot invoke \"org.elasticsearch.xpack.esql.core.expression.Attribute.id()\" because \"sourceAttribute\" is null"},"status":500}

FoldNull propagates the null from the row even into the Aggregate:

[2024-12-09T11:08:22,575][INFO ][o.e.x.e.o.LogicalPlanOptimizer] [runTask-0] Rule logical.FoldNull applied
Limit[1000[INTEGER]]                                                                                                         = Limit[1000[INTEGER]]
\_Aggregate[STANDARD,[CATEGORIZE(x{r}#104) AS categorize(x)],[COUNT([2a][KEYWORD],true[BOOLEAN]) AS count(), categorize(x){r ! \_Aggregate[STANDARD,[null[KEYWORD] AS categorize(x)],[COUNT([2a][KEYWORD],true[BOOLEAN]) AS count(), categorize(x){r}#106]]
}#106]]                                                                                                                      !   \_Row[[null[NULL] AS x]]
  \_Row[[null[NULL] AS x]]                                                                                                   ! 
                             

The reason this can happen is that the CATEGORIZE must remain in the Aggregate's groupings.

I think the correct course of action is:

  • Add the query from above as csv test
  • Leave the check in place and document what it is needed for
  • Override Categorize.Nullable and set it to Nullable.TRUE as we can only know if it'll produce a null after running/folding it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Csv test proposal that I just had pass locally:

row null
required_capability: categorize_v5

ROW message = null, str = ["a", "b", "c"]
  | STATS COUNT(), VALUES(str) BY category=CATEGORIZE(message)
;

COUNT():long | VALUES(str):keyword | category:keyword
           1 | [a, b, c]           | null
;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ivan is adding a similar test case here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And probably CI passed in this PR because we were missing the right tests, no? @alex-spies @ivancea @jan-elastic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think the nullable method may currently only be relevant for FoldNull, where the explicit exemption for CATEGORIZE makes it kinda irrelevant.

That is to say, I can't think of a test case that will make your PR fail @astefan . (But I think we should still not use nullable() this way, as it would be a breach of contract).

FWIW, the row test case above, or the one from Ivan's PR at least make things fail when you just remove the extra check for CATEGORIZE from FoldNull without also overriding Categorize.nullable().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added here an extra testcase (ROW null), apart from the null-in-param one the PR had already: #118013
Also, updated the Categorize#nullable() to make sense (An empty string will also return null), and commented the why of the Categorize exception in FoldNull

}

@Override
public ExpressionEvaluator.Factory toEvaluator(ToEvaluator toEvaluator) {
throw new UnsupportedOperationException("CATEGORIZE is only evaluated during aggregations");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@
import org.elasticsearch.xpack.esql.core.expression.Literal;
import org.elasticsearch.xpack.esql.core.expression.Nullability;
import org.elasticsearch.xpack.esql.expression.function.aggregate.AggregateFunction;
import org.elasticsearch.xpack.esql.expression.function.grouping.Categorize;
import org.elasticsearch.xpack.esql.expression.predicate.operator.comparison.In;

public class FoldNull extends OptimizerRules.OptimizerExpressionRule<Expression> {
Expand Down Expand Up @@ -43,7 +42,6 @@ public Expression rule(Expression e) {
}
} else if (e instanceof Alias == false
&& e.nullable() == Nullability.TRUE
&& e instanceof Categorize == false
&& Expressions.anyMatch(e.children(), Expressions::isNull)) {
return Literal.of(e, null);
}
Expand Down