[SPARK-46741][SQL] Cache Table with CTE should work when CTE in plan expression subquery #53526

AngersZhuuuu · 2025-12-18T10:00:39Z

What changes were proposed in this pull request?

Follow comment #53333 (comment)

Why are the changes needed?

Support all case

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

Was this patch authored or co-authored using generative AI tooling?

No

…uery

AngersZhuuuu · 2025-12-18T10:21:30Z

ping @cloud-fan , could you take a look? also add a corresponding unit suite case, verified that before change to transformDownWithSubqueries the test case will fail.

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

…scala

cloud-fan · 2025-12-18T17:02:12Z

@AngersZhuuuu can you re-trigger failed CI jobs?

AngersZhuuuu · 2025-12-19T02:11:00Z

@AngersZhuuuu can you re-trigger failed CI jobs?

DOne

cloud-fan · 2025-12-19T13:34:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/normalizer/NormalizeCTEIds.scala

-object NormalizeCTEIds extends Rule[LogicalPlan]{
+object NormalizeCTEIds extends Rule[LogicalPlan] {
  override def apply(plan: LogicalPlan): LogicalPlan = {
-    val curId = new java.util.concurrent.atomic.AtomicLong()


shouldn't we have unique CTE ids per query? then we need a new AtomicLong instance per apply invocation.

I know this, directly change to transformDownWithSubqueries will cause UT SPARK-51109 failed.
For query

test("SPARK-51109: CTE in subquery expression as grouping column") { withTable("t") { Seq(1 -> 1).toDF("c1", "c2").write.saveAsTable("t") withView("v") { sql( """ |CREATE VIEW v AS |WITH r AS (SELECT c1 + c2 AS c FROM t) |SELECT * FROM r |""".stripMargin) checkAnswer( sql("SELECT (SELECT max(c) FROM v WHERE c > id) FROM range(1) GROUP BY 1"), Row(2) ) } } }

Plan will be normalized from

Aggregate [scalar-subquery#15 [id#16L]], [scalar-subquery#15 [id#16L] AS scalarsubquery(id)#21] : :- Aggregate [max(c#18) AS max(c)#20] : : +- Filter (cast(c#18 as bigint) > outer(id#16L)) : : +- SubqueryAlias spark_catalog.default.v : : +- View (`spark_catalog`.`default`.`v`, [c#18]) : : +- Project [cast(c#17 as int) AS c#18] : : +- WithCTE : : :- CTERelationDef 1, false : : : +- SubqueryAlias r : : : +- Project [(c1#12 + c2#13) AS c#17] : : : +- SubqueryAlias spark_catalog.default.t : : : +- Relation spark_catalog.default.t[c1#12,c2#13] parquet : : +- Project [c#17] : : +- SubqueryAlias r : : +- CTERelationRef 1, true, [c#17], false, false : +- Aggregate [max(c#26) AS max(c)#27] : +- Filter (cast(c#26 as bigint) > outer(id#16L)) : +- SubqueryAlias spark_catalog.default.v : +- View (`spark_catalog`.`default`.`v`, [c#26]) : +- Project [cast(c#25 as int) AS c#26] : +- WithCTE : :- CTERelationDef 1, false : : +- SubqueryAlias r : : +- Project [(c1#22 + c2#23) AS c#24] : : +- SubqueryAlias spark_catalog.default.t : : +- Relation spark_catalog.default.t[c1#22,c2#23] parquet : +- Project [c#25] : +- SubqueryAlias r : +- CTERelationRef 1, true, [c#25], false, false +- Range (0, 1, step=1)

to

Aggregate [scalar-subquery#15 [id#16L]], [scalar-subquery#15 [id#16L] AS scalarsubquery(id)#21] : :- Aggregate [max(c#18) AS max(c)#20] : : +- Filter (cast(c#18 as bigint) > outer(id#16L)) : : +- SubqueryAlias spark_catalog.default.v : : +- View (`spark_catalog`.`default`.`v`, [c#18]) : : +- Project [cast(c#17 as int) AS c#18] : : +- WithCTE : : :- CTERelationDef 0, false : : : +- SubqueryAlias r : : : +- Project [(c1#12 + c2#13) AS c#17] : : : +- SubqueryAlias spark_catalog.default.t : : : +- Relation spark_catalog.default.t[c1#12,c2#13] parquet : : +- Project [c#17] : : +- SubqueryAlias r : : +- CTERelationRef 0, true, [c#17], false, false : +- Aggregate [max(c#26) AS max(c)#27] : +- Filter (cast(c#26 as bigint) > outer(id#16L)) : +- SubqueryAlias spark_catalog.default.v : +- View (`spark_catalog`.`default`.`v`, [c#26]) : +- Project [cast(c#25 as int) AS c#26] : +- WithCTE : :- CTERelationDef 1, false : : +- SubqueryAlias r : : +- Project [(c1#22 + c2#23) AS c#24] : : +- SubqueryAlias spark_catalog.default.t : : +- Relation spark_catalog.default.t[c1#22,c2#23] parquet : +- Project [c#25] : +- SubqueryAlias r : +- CTERelationRef 1, true, [c#25], false, false +- Range (0, 1, step=1)

in same plan the normalized cte id changed causing throw

[info] is not a valid aggregate expression: [SCALAR_SUBQUERY_IS_IN_GROUP_BY_OR_AGGREGATE_FUNCTION] The correlated scalar subquery '"scalarsubquery(id)"' is neither present in GROUP BY, nor in an aggregate function. [info] Add it to GROUP BY using ordinal position or wrap it in `first()` (or `first_value`) if you don't care which value you get. SQLSTATE: 0A000; line 1 pos 7 [info] Previous schema:scalarsubquery(id)#21

I am still trying how to fix such problem.

This means we should handle the case when CTE def IDs can be duplicated. In such cases, we should not generate new IDs blindly.

val defIdToNewId = withCTE.cteDefs.map(_.id).map((_, curId.getAndIncrement())).toMap

We need to fix this line. The id map should be per apply invocation, not per WithCTE.

Or use a global map in one traversal: #53333 (comment)

ping @cloud-fan How about current?

This reverts commit 4b7e1d8.

AngersZhuuuu · 2025-12-22T11:06:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/normalizer/NormalizeCTEIds.scala

-    plan transformDown {
+    val defIdToNewId = new HashMap[Long, Long]()
+
+    plan transformDownWithSubqueries   {


Here I tried foreachWithSubqueries..but meet strange exception== @cloud-fan

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/normalizer/NormalizeCTEIds.scala

peter-toth

Only a minor nit.

cloud-fan · 2025-12-23T05:18:28Z

thanks, merging to master/4.1!

…expression subquery ### What changes were proposed in this pull request? Follow comment #53333 (comment) ### Why are the changes needed? Support all case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #53526 from AngersZhuuuu/SPARK-46741-FOLLOWUP. Lead-authored-by: Angerszhuuuu <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d65ee81) Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-46741][SQL] Cache Table with CTE should work when CTE in subq…

3a5d16b

…uery

github-actions bot added the SQL label Dec 18, 2025

AngersZhuuuu added 3 commits December 18, 2025 18:04

update

3c5db3f

update

24870ed

Update NormalizeCTEIds.scala

2b09361

cloud-fan reviewed Dec 18, 2025

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala Outdated Show resolved Hide resolved

Update sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.…

11ba02f

…scala

cloud-fan approved these changes Dec 18, 2025

View reviewed changes

peter-toth approved these changes Dec 18, 2025

View reviewed changes

Update NormalizeCTEIds.scala

4b7e1d8

yaooqinn approved these changes Dec 19, 2025

View reviewed changes

cloud-fan reviewed Dec 19, 2025

View reviewed changes

AngersZhuuuu added 3 commits December 22, 2025 09:58

Revert "Update NormalizeCTEIds.scala"

c09421a

This reverts commit 4b7e1d8.

Update NormalizeCTEIds.scala

e65b59c

update

960d864

AngersZhuuuu commented Dec 22, 2025

View reviewed changes

cloud-fan reviewed Dec 22, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/normalizer/NormalizeCTEIds.scala Outdated Show resolved Hide resolved

Update NormalizeCTEIds.scala

942a98f

cloud-fan reviewed Dec 22, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/normalizer/NormalizeCTEIds.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 22, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/normalizer/NormalizeCTEIds.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Dec 22, 2025

View reviewed changes

AngersZhuuuu added 2 commits December 22, 2025 20:57

Update NormalizeCTEIds.scala

de65261

Update NormalizeCTEIds.scala

ae16b0f

peter-toth reviewed Dec 22, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/normalizer/NormalizeCTEIds.scala Outdated Show resolved Hide resolved

peter-toth approved these changes Dec 22, 2025

View reviewed changes

follow comment

4bad19b

cloud-fan closed this in d65ee81 Dec 23, 2025

[SPARK-46741][SQL] Cache Table with CTE should work when CTE in plan expression subquery #53526

[SPARK-46741][SQL] Cache Table with CTE should work when CTE in plan expression subquery #53526

Uh oh!

Conversation

AngersZhuuuu commented Dec 18, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

AngersZhuuuu commented Dec 18, 2025

Uh oh!

Uh oh!

cloud-fan commented Dec 18, 2025

Uh oh!

AngersZhuuuu commented Dec 19, 2025

Uh oh!

cloud-fan Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

peter-toth Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants