Fixes esql class cast bug in STATS at planning level #137511

ncordon · 2025-11-03T11:12:30Z

Addresses #133992 and #136598, partially.

Missing from this pr that we still need to do: at the moment the runtime part tries to avoid double computations, resulting in exceptions if the plan is correct but not optimal. In other words, queries like:

from airports 
rename scalerank AS x 
stats  a = count(x), b = count(x) + count(x), c = count_distinct(x)

should had never failed at runtime even if the plan was not optimal for repeated aggregations.

elasticsearchmachine · 2025-11-03T11:18:18Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine · 2025-11-03T11:19:07Z

Hi @ncordon, I've created a changelog YAML for you.

alex-spies

Heya, no review yet, except for a very quick first glance.

This also fixes #136598, nice! Let's note that down in the PR description so the other issue gets auto-closed on merge, as well.

That said, I don't think this addresses problems like

| stats median(foo), percentile(foo, 50), count_distinct(foo)

because the substitution median(foo) -> percentile(foo, 50) happens after ReplaceAggregateAggExpressionWithEval, right?

The PR description says this partially addresses #133992; what else is not yet addressed?

alex-spies

Heya, could you please add some tests to the logical plan optimizer tests that demonstrate what the plans for some relevant STATS queries will look like? Actually, we should create a test class similar to ReplaceStatsFilteredAggWithEvalTests; there are probably some tests in LogicalPlanOptimizerTests that could be moved there, too, but that's optional.

I'm interested in seeing a bunch of cases, esp. ones with a BY clause and with per-agg-function WHERE clauses. We seem to have little coverage of per-agg-function WHERE clauses that are different from their canonicalization (otherwise I'd have expected some test failures).

Other than that, I think the approach in the fix is good! Clearly, when deduplicating aggs in expressions, we need to be consistent between a single agg function and an expression with agg functions within it.

alex-spies · 2025-11-03T12:38:26Z

.../elasticsearch/xpack/esql/optimizer/rules/logical/ReplaceAggregateAggExpressionWithEval.java

                        if (alias == null) {
                            // create synthetic alias ove the found agg function
-                            alias = new Alias(af.source(), syntheticName(canonical, child, counter[0]++), canonical, null, true);
+                            alias = new Alias(af.source(), syntheticName(canonical, child, counter[0]++), af, null, true);


Why do we not want to use the canonicalized agg function here anymore?

I think we should keep af.canonicalize(). The canonicalization still affects the per-agg filter, as in STATS c = count(field) WHERE other_field*1 > 10

I don't understand the explanation on why it is important to keep the af.cannonical() here. I'd swear at some point I had to change this because tests were breaking otherwise. But if all tests are passing with it that means that either it is not that important or that we are missing specific tests that would break because of this?

The canonicalization may not be important, but I have a hunch that this is under tested + it's just a smaller change if we keep emitting the same agg functions as before (with canonicalization).

alex-spies · 2025-11-03T12:42:49Z

.../elasticsearch/xpack/esql/optimizer/rules/logical/ReplaceAggregateAggExpressionWithEval.java

                    Expression aggExpression = child.transformUp(AggregateFunction.class, af -> {
-                        AggregateFunction canonical = (AggregateFunction) af.canonical();
+                        // canonical representation, with resolved aliases
+                        AggregateFunction canonical = (AggregateFunction) af.canonical().transformUp(e -> aliases.resolve(e, e));


Maybe we should use a helper function for this line to prevent this from being different from how we canonicalize agg functions above (line 91)?

idegtiarenko · 2025-11-03T13:49:44Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/stats.csv-spec

+891    | 1782   | 8
+;
+
+fixClassCastBug2


Nit: It would be nice to avoid numbers in test cases.
Additional description could also hint what aspects were broken before:

combining results of two aggregate functions

nesting functions

multiplying result by constant

etc

idegtiarenko · 2025-11-03T13:52:08Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java

        PUSHING_DOWN_EVAL_WITH_SCORE,

+        /**
+         * Fix for ClassCastException in STATS


Suggested change

* Fix for ClassCastException in STATS

* Fix for ClassCastException in STATS

* https://github.com/elastic/elasticsearch/issues/133992

I realize it is a bit tricky to describe the change with java doc.
It might be worth linking an issue as it has a bit detailed description.
There are some prior examples with such links.

ncordon · 2025-11-03T15:41:52Z

That said, I don't think this addresses problems like

| stats median(foo), percentile(foo, 50), count_distinct(foo)

@alex-spies, I've now included another planning phase after the constant folding that should take care of cases like these that @astefan suggested.

The PR description says this partially addresses #133992; what else is not yet addressed?

Philosophically I don't think the design of the compute engine part is correct at the moment. We try to make optimizations at runtime to avoid computing duplicated things and that breaks in case the plan is not optimal because we end up accessing wrong positions in our buffers.

For many (if not all) of the tests I added the plans were correct (but not optimal), and we are throwing at runtime. I've been in touch with @dnhatn about this part and he's helping me solve it.

ncordon · 2025-11-03T16:28:07Z

...rc/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/ReplaceDuplicatedAggs.java

+ * becomes
+ * stats a = min(x), c = count(*) by g | eval b = a, d = c | keep a, b, c, d, g
+ */
+public final class ReplaceDuplicatedAggs extends OptimizerRules.OptimizerRule<Aggregate> implements OptimizerRules.CoordinatorOnly {


Ignore the duplicate code in this file with respect to ReplaceAggregateAggExpressionWithEval.java, I'll try to share as much code as possible once I've checked this passes all tests

BASE=3017e334274a7292997b0fea77f90d2c73b58eba HEAD=f1fd40ef2107a7fcf162a3baa2a9484c03cd5546 Branch=main

BASE=3017e334274a7292997b0fea77f90d2c73b58eba HEAD=71c1691e34614a43230da1905611f4fa9dfeec01 Branch=main

alex-spies

Thanks a lot for iterating on this @ncordon ! This has some subtilities.

alex-spies · 2025-11-07T15:51:00Z

...k/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizer.java

            new PropagateEvalFoldables(),
            new ConstantFolding(),
+            // then extract nested aggs top-level
+            new DeduplicateAggs(),


Interesting!

Is the constant folding required to run before? Will this take care of cases that have percentile( foo, 25+25) and percentile(foo,2*25)?

I'd have thought that maybe the aggs substitution needs to happen before we replace agg expressions with evals. But this is placing deduplication pretty late into the optimization.

It's not wrong, though! But maybe a bit unfortunate that we'll have 2 rules that dedupe; and we can't assume that aggs come out of the substitutions batch already deduped. But maybe that was never correct to assume.

Constant folding in aggregation is an endemic problem and tech debt. Some insights here #112392 (comment).

I don't think we could realistically do the right thing now with constant folding in aggs (due to priorities), and there are already optimization stuff we duplicate because of this combination of surrogate expressions in aggs + constant folding issues - see

new SubstituteSurrogateAggregations(), new ReplaceAggregateNestedExpressionWithEval()

being called twice in Substitutions batch. Having deduplication called twice for aggs only because agg constant folding doesn't happen correctly and at the right time is, in my mind, the right compromise at this point given how mysterious the errors are, how likely users are to run into them and, also, is not a completely wrong thing to do.

ncordon · 2025-11-11T10:05:13Z

...java/org/elasticsearch/xpack/esql/optimizer/rules/logical/AbstractAggregateDeduplicator.java

+import java.util.List;
+import java.util.Map;
+
+abstract class AbstractAggregateDeduplicator extends OptimizerRules.OptimizerRule<Aggregate> {


I've extracted this to avoid duplicating code between DeduplicateAggs and ReplaceAggregateAggExpressionWithEval

ncordon · 2025-11-11T10:06:03Z

...gin/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizerTests.java

        );
    }

-    /**


I've moved the relevant tests here to DeduplicateAggTests

ncordon · 2025-11-11T10:07:25Z

...src/test/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/DeduplicateAggsTests.java

+     *       \_EsRelation[test][_meta_field{f}#17, emp_no{f}#11, first_name{f}#12, ..]
+     * }</pre>
+     */
+    @AwaitsFix(bugUrl = "https://github.com/elastic/elasticsearch/issues/100634")


I tried fixing the tests that have this annotation but we need extra work. It looks doable to me.

Deduplcating counts when we have a count(1) and a count(*)`, for example, is not exactly trivial and requires good care because the argument for the count could be a null, or the count could have a filter, etc.

astefan

I went through the whole code and the only thing that is bothering to me is the forced refactoring for reusing the deduplication code. One of the aims for the rules, in general, is to be as much as possible easily readable and understandable by non-experts when looking at them.

For example, for the sake of code reuse, the method processAlias gets a long list of 9 parameters that are hard to grasp. Whoever is looking at this code trying to understand either DeduplicateAggs or AbstractAggregateDeduplicator will have a harder time understanding the code than that code being duplicated in the first place, imo.

I don't think this refactoring achieved this goal and it could go for another try.
I would try, as a second attempt, to have DeduplicateAggs expose a static (if possible) method that does the deduplication and that method to be re-used by ReplaceAggregateAggExpressionWithEval. Why I think this is logically more easily understandable is because:

DeduplicateAggs has as its main purpose deduplication of aggregations (even the name says it so)
ReplaceAggregateAggExpressionWithEval does something else as a main task, but also (secondary) the deduplication. It makes sense to reuse the deduplication (as an added optimization) from somewhere else.

astefan · 2025-11-13T12:53:13Z

...java/org/elasticsearch/xpack/esql/optimizer/rules/logical/AbstractAggregateDeduplicator.java

+        Holder<Boolean> changed = new Holder<>(false);
+        int[] counter = new int[] { 0 };
+
+        for (NamedExpression agg : aggregate.aggregates()) {


Suggested change

for (NamedExpression agg : aggregate.aggregates()) {

for (NamedExpression agg : aggs) {

astefan · 2025-11-13T13:09:08Z

.../elasticsearch/xpack/esql/optimizer/rules/logical/ReplaceAggregateAggExpressionWithEval.java

+            changed.set(true);
+            Expression aggExpression = child.transformUp(AggregateFunction.class, af -> {
+                // canonical representation, with resolved aliases
+                AggregateFunction canonical = (AggregateFunction) af.canonical().transformUp(e -> aliases.resolve(e, e));


The original code in this rule was different: AggregateFunction canonical = (AggregateFunction) af.canonical();. What is the purpose of this change in the PR? What exactly does it fix/change?

This is the original fix for the problem. When we are deduplicating aggs, the cannonical representative was already calculated with this method. The same cannonical representative should be used when fixing nested expressions. For example for a query:

from airports | rename scalerank AS x | stats a = count(x), b = count(x) + count(x), c = count_distinct(x)

The first count(x) would be stored as already seen with the representative count(scalerank). For b, which is a nested expression, we should already know we've seen each one of the count(x).

astefan · 2025-11-13T13:47:56Z

...src/test/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/DeduplicateAggsTests.java

+        assumeTrue("requires FIX FOR CLASSCAST exception to be enabled", EsqlCapabilities.Cap.FIX_STATS_CLASSCAST_EXCEPTION.isEnabled());
+        String query = """
+                FROM airports
+                | rename scalerank AS x


Suggested change

| rename scalerank AS x

| RENAME scalerank AS x

Nitty nit: please, try to keep the same style of code as the rest of the class. It shows a consistency and uniformity in approach. I know it's not functionally different, but imho it's a consistency level that shows a high level of attention.

astefan · 2025-11-13T13:51:41Z

...src/test/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/DeduplicateAggsTests.java

+                | STATS a = 2*COUNT_DISTINCT(scalerank, 100),
+                b = 2*COUNT_DISTINCT(scalerank, 220 - 150 + 30),
+                c = 2*COUNT_DISTINCT(scalerank, 1 + 200 - 80 - 20 - 1)


Suggested change

| STATS a = 2*COUNT_DISTINCT(scalerank, 100),

b = 2*COUNT_DISTINCT(scalerank, 220 - 150 + 30),

c = 2*COUNT_DISTINCT(scalerank, 1 + 200 - 80 - 20 - 1)

| STATS a = 2*COUNT_DISTINCT(scalerank, 100),

b = 2*COUNT_DISTINCT(scalerank, 220 - 150 + 30),

c = 2*COUNT_DISTINCT(scalerank, 1 + 200 - 80 - 20 - 1)

astefan · 2025-11-13T13:53:35Z

...src/test/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/DeduplicateAggsTests.java

+     *       \_EsRelation[airports][abbrev{f}#12, city{f}#18, city_location{f}#19, coun..]
+     */
+    public void testDuplicatedAggWithFoldableIdenticalExpressions() {
+        assumeTrue("requires FIX FOR CLASSCAST exception to be enabled", EsqlCapabilities.Cap.FIX_STATS_CLASSCAST_EXCEPTION.isEnabled());


Do you really need this?

Not really 🙈

astefan · 2025-11-13T13:54:04Z

...src/test/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/DeduplicateAggsTests.java

+
+    /**
+     * Project[[a{r}#5, b{r}#8, c{r}#11]]
+     * \_Eval[[$$COUNTDISTINCT$2*COUNT_DISTINC>$0{r$}#20 * 2[INTEGER] AS a#5, $$COUNTDISTINCT$2*COUNT_DISTINC>$0{r$}#20 * 2[


Why is this $$COUNTDISTINCT$2*COUNT_DISTINC (missing a T at the end)?

In TemporaryNameUtils, we limit the length of this strings to 16 characters:

static String limitToString(String string) { return string.length() > TO_STRING_LIMIT ? string.substring(0, TO_STRING_LIMIT - 1) + ">" : string; }

ncordon · 2025-11-14T09:09:17Z

.../elasticsearch/xpack/esql/optimizer/rules/logical/ReplaceAggregateAggExpressionWithEval.java

+    private static AggregateFunction getCannonical(AggregateFunction af, AttributeMap<Expression> aliases) {
+        return (AggregateFunction) af.canonical().transformUp(e -> aliases.resolve(e, e));
+    }
+


I've extracted this as @alex-spies asked me to do

ncordon · 2025-11-14T09:10:53Z

.../elasticsearch/xpack/esql/optimizer/rules/logical/ReplaceAggregateAggExpressionWithEval.java

+                } else {
+                    newAggs.add(agg);
+                    newProjections.add(agg.toAttribute());


This block is doing nothing really because it only gets triggered if replaceNestedExpressions is false and it does not seem to be reached in that case ever, at least for the tests in our codebase at the moment.

But I'd rather have this safeguard (just leaving that aggregation as it is) than risking something important to disappear in the plan.

ncordon · 2025-11-14T09:14:46Z

I went through the whole code and the only thing that is bothering to me is the forced refactoring for reusing the deduplication code. One of the aims for the rules, in general, is to be as much as possible easily readable and understandable by non-experts when looking at them.

Now it should be better. I've added a boolean to the original class instead to gate the part of the code that treats nested aggs.

astefan

LGTM with a small comment. Thank you!

astefan · 2025-11-17T09:19:39Z

.../elasticsearch/xpack/esql/optimizer/rules/logical/ReplaceAggregateAggExpressionWithEval.java

 */
-public final class ReplaceAggregateAggExpressionWithEval extends OptimizerRules.OptimizerRule<Aggregate> {
+public class ReplaceAggregateAggExpressionWithEval extends OptimizerRules.OptimizerRule<Aggregate> {
+    private boolean replaceNestedExpressions = true;


final and set it always in the constructor.

Fixes esql class cast bug in STATS at planning level

53f9dbd

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.3.0 labels Nov 3, 2025

ncordon added :Analytics/ES|QL AKA ESQL Team:ES|QL and removed needs:triage Requires assignment of a team area label labels Nov 3, 2025

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Nov 3, 2025

ncordon added the >bug label Nov 3, 2025

Update docs/changelog/137511.yaml

c66b6b5

Merge branch 'main' into stats-fix-classCastException-planning

5f7e040

astefan requested a review from alex-spies November 3, 2025 12:18

alex-spies reviewed Nov 3, 2025

View reviewed changes

idegtiarenko reviewed Nov 3, 2025

View reviewed changes

Adds intermediate phase to replace aggs

ad46b64

Adds link to the esql capability

f1fd40e

ncordon commented Nov 3, 2025

View reviewed changes

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 4, 2025

Mirror upstream elastic#137511 as single snapshot commit for AI review

304f2f5

BASE=3017e334274a7292997b0fea77f90d2c73b58eba HEAD=f1fd40ef2107a7fcf162a3baa2a9484c03cd5546 Branch=main

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 5, 2025

Mirror upstream elastic#137511 as single snapshot commit for AI review

6a50f89

BASE=3017e334274a7292997b0fea77f90d2c73b58eba HEAD=f1fd40ef2107a7fcf162a3baa2a9484c03cd5546 Branch=main

ncordon and others added 3 commits November 5, 2025 18:49

Reshapes some stuff

94682b3

Adds more planner tests

bce1940

[CI] Auto commit changes from spotless

71c1691

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 7, 2025

Mirror upstream elastic#137511 as single snapshot commit for AI review

10cac80

BASE=3017e334274a7292997b0fea77f90d2c73b58eba HEAD=71c1691e34614a43230da1905611f4fa9dfeec01 Branch=main

alex-spies reviewed Nov 7, 2025

View reviewed changes

ncordon added 2 commits November 7, 2025 17:27

format

709717c

Adds and moves more deduplication tests

7c99692

ncordon and others added 2 commits November 10, 2025 15:51

Merge branch 'main' into stats-fix-classCastException-planning

c3cfc7d

Reshapes some code

15216e2

ncordon force-pushed the stats-fix-classCastException-planning branch from b541eb7 to 15216e2 Compare November 11, 2025 09:53

ncordon commented Nov 11, 2025

View reviewed changes

ncordon and others added 2 commits November 11, 2025 12:22

format

6981f16

Merge branch 'main' into stats-fix-classCastException-planning

f6e5316

astefan self-requested a review November 13, 2025 08:13

astefan reviewed Nov 13, 2025

View reviewed changes

ncordon added 2 commits November 14, 2025 09:24

Makes code sharing better

f968149

More small nits

e1695a0

ncordon commented Nov 14, 2025

View reviewed changes

ncordon mentioned this pull request Nov 14, 2025

Fixes esql class cast bug in STATS at runtime level #138085

Open

astefan approved these changes Nov 17, 2025

View reviewed changes

ncordon and others added 2 commits November 17, 2025 11:27

Merge branch 'main' into stats-fix-classCastException-planning

a1935ea

Small nit

d4bed3f

ncordon force-pushed the stats-fix-classCastException-planning branch from b78574e to fdde58c Compare November 17, 2025 21:50

Fixes merge conflict: test that was added to main that now needs changes

c483a52

ncordon force-pushed the stats-fix-classCastException-planning branch from fdde58c to c483a52 Compare November 17, 2025 22:01

Merge branch 'main' into stats-fix-classCastException-planning

219c8a8

ncordon merged commit 0a81875 into elastic:main Nov 18, 2025
35 checks passed

luigidellaquila mentioned this pull request Nov 20, 2025

ES|QL: Potential cycle detected on JOIN planning #138346

Open

	* Fix for ClassCastException in STATS
	* Fix for ClassCastException in STATS
	* https://github.com/elastic/elasticsearch/issues/133992

                       );
                   }
-                  /**

	for (NamedExpression agg : aggregate.aggregates()) {
	for (NamedExpression agg : aggs) {

Fixes esql class cast bug in STATS at planning level #137511

Fixes esql class cast bug in STATS at planning level #137511

Conversation

ncordon commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 3, 2025

Uh oh!

elasticsearchmachine commented Nov 3, 2025

Uh oh!

alex-spies left a comment

Choose a reason for hiding this comment

Uh oh!

alex-spies left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncordon commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alex-spies left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncordon Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncordon Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncordon Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncordon commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

ncordon commented Nov 3, 2025 •

edited

Loading

ncordon commented Nov 3, 2025 •

edited

Loading

ncordon Nov 11, 2025 •

edited

Loading

ncordon Nov 11, 2025 •

edited

Loading

ncordon Nov 14, 2025 •

edited

Loading

ncordon commented Nov 14, 2025 •

edited

Loading