[SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables. #25040

mccheah · 2019-07-03T03:25:04Z

What changes were proposed in this pull request?

Implements the DESCRIBE TABLE logical and physical plans for data source v2 tables.

How was this patch tested?

Added unit tests to DataSourceV2SQLSuite.

SparkQA · 2019-07-03T04:47:29Z

Test build #107145 has finished for PR 25040 at commit d998c96.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DescribeTable(
case class DescribeColumnStatement(
case class DescribeTableStatement(
case class DescribeTableExec(

SparkQA · 2019-07-03T23:10:26Z

Test build #107193 has finished for PR 25040 at commit 596832b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-04T02:06:57Z

Test build #107204 has finished for PR 25040 at commit bdae301.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-07-06T19:31:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala

+
+    case DescribeColumnStatement(
+        CatalogObjectIdentifier(Some(catalog), ident), colName, isExtended) =>
+      throw new AnalysisException("Describing columns is not supported for v2 tables.")


Should this be supported eventually, or is it redundant if DESCRIBE TABLE is available?

Think we need to support it eventually if only to keep parity with V1 tables.

rdblue · 2019-07-06T19:32:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala

+      }
+      DescribeTable(catalog.asTableCatalog, ident, isExtended)
+
+    case DropTableStatement(AsTableIdentifier(tableName), ifExists, purge) =>


This was missing? Are there no tests for DROP TABLE?

Oh this might have been a bad copy-paste artifact. Even if this is missing, it doesn't belong in this PR.

rdblue · 2019-07-06T19:34:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala

+  }
+
+  private def toCatalystRow(strs: String*): InternalRow = {
+    val encoder = RowEncoder(DescribeTableSchemas.DESCRIBE_TABLE_SCHEMA).resolveAndBind()


Minor: I'd rather not create the encoder each time a row is created. Can you move this and the method to a companion object?

Sort of - couple of questions:

Is RowEncoder thread-safe?

I noticed if I create RowEncoder but immediately resolveAndBind it, and reuse the resolved encoder, the tests break as the describe returns incorrect rows. Presumably there's some kind of reused memory leak here. I didn't look into it that thoroughly - think we can just reuse the unresolved encoder and resolveAndBind before creating each row.

@cloud-fan, can you help answer these questions?

rdblue · 2019-07-06T19:35:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

        throw new ParseException("DESC TABLE COLUMN for a specific partition is not supported", ctx)
      } else {
-        DescribeColumnCommand(
-          visitTableIdentifier(ctx.tableIdentifier),


Now that these rules create DescribeColumnStatement and DescribeTableStatement, they should be moved into Catalyst. There isn't anything specific to the implementation any more.

Do you mean that DescribeColumnCommand and DescribeTableCommand should be moved to Catalyst? The V1 commands depend on a bunch of stuff that's in core, such as SparkSession and DDLUtils.

rdblue · 2019-07-06T19:39:59Z

Overall this looks good, but it doesn't move the parser rules to Catalyst. We've been trying to move as much as we can to Catalyst, to keep the parser and the SQL implementation separate instead of keeping them mixed together. That has also required moving the parser tests to Catalyst and moving the SparkSqlParser tests to a suite that tests parsing and conversion to v1 plans.

mccheah · 2019-07-10T20:46:36Z

I moved the parser rules and created a helper object for the encoder.

rdblue · 2019-07-10T22:19:20Z

sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala


  test("SPARK-17328 Fix NPE with EXPLAIN DESCRIBE TABLE") {
    assertEqual("describe t",
-      DescribeTableCommand(TableIdentifier("t"), Map.empty, isExtended = false))


These test cases should be moved into catalyst as well.

rdblue · 2019-07-10T22:20:22Z

+1

One minor comment, but otherwise this looks good to me.

SparkQA · 2019-07-10T23:18:53Z

Test build #107484 has finished for PR 25040 at commit eb5c843.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-11T01:21:56Z

Test build #107490 has finished for PR 25040 at commit 25ea40b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-11T23:23:32Z

Test build #107546 has finished for PR 25040 at commit a947006.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-07-12T09:14:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/DescribeTableSchemas.scala

+import org.apache.spark.sql.types.{MetadataBuilder, StringType, StructField, StructType}
+
+private[sql] object DescribeTableSchemas {
+  val DESCRIBE_TABLE_ATTRIBUTES = Seq(


We shouldn't define attributes in an object. AttributeReference will be assigned a unique ID when created, and in general we should create new attributes when creating a new logical plan.

For example, if you do df1 = sql("desc table t1"); df2 = sql("desc table ");, df1.join(df2) would fail.

Can you join the results of DESCRIBE?

Changed this from a value to a method, so it will generate new identifiers every time while still being shared amongst multiple contexts.

cloud-fan · 2019-07-12T09:16:19Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+    catalog: TableCatalog,
+    ident: Identifier,
+    isExtended: Boolean) extends Command {
+  override lazy val output = DescribeTableSchemas.DESCRIBE_TABLE_ATTRIBUTES


we don't need lazy val here, as it's not a heavy computing

cloud-fan · 2019-07-12T09:16:45Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+    ident: Identifier,
+    isExtended: Boolean) extends Command {
+  override lazy val output = DescribeTableSchemas.DESCRIBE_TABLE_ATTRIBUTES
+  override lazy val schema = DescribeTableSchemas.DESCRIBE_TABLE_SCHEMA


by default schema is StructType.fromAttributes(output), so we don't need to override it.

cloud-fan · 2019-07-12T09:53:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/DescribeTableSchemas.scala

+  )
+
+  val DESCRIBE_TABLE_SCHEMA = StructType(
+    DESCRIBE_TABLE_ATTRIBUTES.map(attr => StructField(attr.name, attr.dataType, attr.nullable)))


nit: StructType.fromAttributes(DESCRIBE_TABLE_ATTRIBUTES)

cloud-fan · 2019-07-12T09:57:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala

+      }
+
+    } else {
+      rows += toCatalystRow(s"Table $ident does not exist.", "", "")


shouldn't we throw exception when table not found?

I think we can follow the #24937: The DescribeTable should contain an UnresolvedRelation, so that analyzer can check table existence for us.

Followed the AlterTable approach in the latest commit.

Ah but I didn't remove this - though I guess technically we should never hit this code path. We can throw an exception here instead.

cloud-fan · 2019-07-12T10:07:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala

+  private val EMPTY_ROW = toCatalystRow("", "", "")
+
+  private def toCatalystRow(strs: String*): InternalRow = {
+    ENCODER.resolveAndBind().toRow(


the encoder only need to call resolveAndBind once

I don't necessarily think so, but it could also be how this class is built. I think the encoder's state needs to be reset. When I don't resolveAndBind every time, the tests yield wrong results entirely.

Fixed it - we have to copy the rows generated by the encoder since the encoder re-uses the same memory space.

…ebased

cloud-fan · 2019-07-24T01:03:58Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

      }

+    case DescribeTable(catalog, ident, _, isExtended) =>
+      DescribeTableExec(catalog, ident, isExtended) :: Nil


how about

case DescribeTable(catalog, ident, r: DataSourceV2Relation, isExtended) => DescribeTableExec(r.table, ident, isExtended) :: Nil

Then we don't need to lookup the table again in DescribeTableExec

SparkQA · 2019-07-24T02:37:15Z

Test build #108063 has finished for PR 25040 at commit c396d32.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-07-24T02:56:05Z

Test build #108064 has finished for PR 25040 at commit 18d688f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

imback82 · 2019-07-27T05:55:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/DescribeTableSchemas.scala

+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.types.{MetadataBuilder, StringType, StructField, StructType}
+
+private[sql] object DescribeTableSchemas {


Singular, DescribeTableSchema?

mccheah · 2019-07-30T01:45:29Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala

 import org.apache.spark.sql.catalyst.catalog.BucketSpec
 import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project}
-import org.apache.spark.sql.catalyst.plans.logical.sql.{AlterTableAddColumnsStatement, AlterTableAlterColumnStatement, AlterTableDropColumnsStatement, AlterTableRenameColumnStatement, AlterTableSetLocationStatement, AlterTableSetPropertiesStatement, AlterTableUnsetPropertiesStatement, AlterViewSetPropertiesStatement, AlterViewUnsetPropertiesStatement, CreateTableAsSelectStatement, CreateTableStatement, DropTableStatement, DropViewStatement, InsertIntoStatement, QualifiedColType, ReplaceTableAsSelectStatement, ReplaceTableStatement}
+import org.apache.spark.sql.catalyst.plans.logical.sql.{AlterTableAddColumnsStatement, AlterTableAlterColumnStatement, AlterTableDropColumnsStatement, AlterTableRenameColumnStatement, AlterTableSetLocationStatement, AlterTableSetPropertiesStatement, AlterTableUnsetPropertiesStatement, AlterViewSetPropertiesStatement, AlterViewUnsetPropertiesStatement, CreateTableAsSelectStatement, CreateTableStatement, DescribeColumnStatement, DescribeTableStatement, DropTableStatement, DropViewStatement, InsertIntoStatement, QualifiedColType, ReplaceTableAsSelectStatement, ReplaceTableStatement}


These are getting really really long, and in particular the merge conflicts are a bit tedious to resolve. I'm normally very averse to wildcard imports, but there might come a point where we'll have to do that. Or I wonder if we could have a helper object that bundles all of these, or factory methods for these, or matchers... somehow.

+1 for wildcard here.

This package may make sense for a wildcard import because it has no sub-packages and is unlikely to in the future. Still, because Scala will import sub-packages, I think it's probably best to keep avoiding wildcard imports, even here.

mccheah · 2019-07-30T02:02:51Z

Everything that's been brought up has been taken care of. Let me know if there's anything else to address.

cloud-fan · 2019-07-30T03:25:25Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/TestInMemoryTableCatalog.scala

    override val properties: util.Map[String, String])
  extends Table with SupportsRead with SupportsWrite {

+  def this(


where do we use this new constructor?

cloud-fan

LGTM except a few minor comments

SparkQA · 2019-07-30T05:49:53Z

Test build #108358 has finished for PR 25040 at commit 2b77c80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-30T05:56:37Z

Test build #108362 has finished for PR 25040 at commit 9e04c6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mccheah · 2019-08-06T18:53:04Z

@cloud-fan is this good to merge?

mccheah · 2019-08-06T18:53:19Z

Ah sorry I missed a few things, I'll address

SparkQA · 2019-08-06T22:39:27Z

Test build #108729 has finished for PR 25040 at commit cff78a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-07T06:26:57Z

thanks, merging to master!

[SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables.

d998c96

dongjoon-hyun added the SQL label Jul 3, 2019

Fix ignoring rigorous comparison for Hive

596832b

More test fixes

bdae301

rdblue reviewed Jul 6, 2019

View reviewed changes

mccheah added 3 commits July 10, 2019 12:54

Move parser rules to AstBuilder in Catalyst

527cbc3

Reuse row encoder

6e5f0ea

Remove reduandant drop

eb5c843

rdblue reviewed Jul 10, 2019

View reviewed changes

Fix style, move tests

25ea40b

mccheah added 2 commits July 11, 2019 12:51

Update describe test output

4d32701

Merge remote-tracking branch 'origin/master' into describe-table-v2

a947006

cloud-fan reviewed Jul 12, 2019

View reviewed changes

mccheah added 4 commits July 23, 2019 14:00

Merge from master

da202b1

Resolve conflicts, address comments

c396d32

Merge remote-tracking branch 'origin/master' into describe-table-v2-r…

ee31d8a

…ebased

Throw exception if table doesn't exist

18d688f

cloud-fan reviewed Jul 24, 2019

View reviewed changes

imback82 reviewed Jul 27, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Show resolved Hide resolved

imback82 reviewed Jul 27, 2019

View reviewed changes

mccheah added 2 commits July 29, 2019 16:53

Merge remote-tracking branch 'origin/master' into describe-table-v2

cf51488

Address comments, resolve conflicts

2b77c80

mccheah commented Jul 30, 2019

View reviewed changes

mccheah added 2 commits July 29, 2019 18:54

Fix some redundant code and address another comment

f701daa

Add test for non-existent table.

9e04c6e

cloud-fan reviewed Jul 30, 2019

View reviewed changes

cloud-fan approved these changes Jul 30, 2019

View reviewed changes

imback82 approved these changes Jul 31, 2019

View reviewed changes

mccheah added 2 commits August 6, 2019 11:54

Merge remote-tracking branch 'origin/master' into describe-table-v2

d0a4533

Remove unused constructor

cff78a1

cloud-fan closed this in 44e607e Aug 7, 2019

[SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables. #25040

[SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables. #25040

Uh oh!

Conversation

mccheah commented Jul 3, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 3, 2019

Uh oh!

SparkQA commented Jul 3, 2019

Uh oh!

SparkQA commented Jul 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jul 6, 2019

Uh oh!

mccheah commented Jul 10, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jul 10, 2019

Uh oh!

SparkQA commented Jul 10, 2019

Uh oh!

SparkQA commented Jul 11, 2019

Uh oh!

SparkQA commented Jul 11, 2019

Uh oh!

cloud-fan Jul 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 24, 2019

Uh oh!

cloud-fan Jul 12, 2019 •

edited

Loading