Skip to content

Conversation

@mccheah
Copy link
Contributor

@mccheah mccheah commented Jul 3, 2019

What changes were proposed in this pull request?

Implements the DESCRIBE TABLE logical and physical plans for data source v2 tables.

How was this patch tested?

Added unit tests to DataSourceV2SQLSuite.

@SparkQA
Copy link

SparkQA commented Jul 3, 2019

Test build #107145 has finished for PR 25040 at commit d998c96.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class DescribeTable(
  • case class DescribeColumnStatement(
  • case class DescribeTableStatement(
  • case class DescribeTableExec(

@SparkQA
Copy link

SparkQA commented Jul 3, 2019

Test build #107193 has finished for PR 25040 at commit 596832b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 4, 2019

Test build #107204 has finished for PR 25040 at commit bdae301.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


case DescribeColumnStatement(
CatalogObjectIdentifier(Some(catalog), ident), colName, isExtended) =>
throw new AnalysisException("Describing columns is not supported for v2 tables.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be supported eventually, or is it redundant if DESCRIBE TABLE is available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we need to support it eventually if only to keep parity with V1 tables.

}
DescribeTable(catalog.asTableCatalog, ident, isExtended)

case DropTableStatement(AsTableIdentifier(tableName), ifExists, purge) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missing? Are there no tests for DROP TABLE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this might have been a bad copy-paste artifact. Even if this is missing, it doesn't belong in this PR.

}

private def toCatalystRow(strs: String*): InternalRow = {
val encoder = RowEncoder(DescribeTableSchemas.DESCRIBE_TABLE_SCHEMA).resolveAndBind()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I'd rather not create the encoder each time a row is created. Can you move this and the method to a companion object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of - couple of questions:

  • Is RowEncoder thread-safe?
  • I noticed if I create RowEncoder but immediately resolveAndBind it, and reuse the resolved encoder, the tests break as the describe returns incorrect rows. Presumably there's some kind of reused memory leak here. I didn't look into it that thoroughly - think we can just reuse the unresolved encoder and resolveAndBind before creating each row.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, can you help answer these questions?

throw new ParseException("DESC TABLE COLUMN for a specific partition is not supported", ctx)
} else {
DescribeColumnCommand(
visitTableIdentifier(ctx.tableIdentifier),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that these rules create DescribeColumnStatement and DescribeTableStatement, they should be moved into Catalyst. There isn't anything specific to the implementation any more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that DescribeColumnCommand and DescribeTableCommand should be moved to Catalyst? The V1 commands depend on a bunch of stuff that's in core, such as SparkSession and DDLUtils.

@rdblue
Copy link
Contributor

rdblue commented Jul 6, 2019

Overall this looks good, but it doesn't move the parser rules to Catalyst. We've been trying to move as much as we can to Catalyst, to keep the parser and the SQL implementation separate instead of keeping them mixed together. That has also required moving the parser tests to Catalyst and moving the SparkSqlParser tests to a suite that tests parsing and conversion to v1 plans.

@mccheah
Copy link
Contributor Author

mccheah commented Jul 10, 2019

I moved the parser rules and created a helper object for the encoder.


test("SPARK-17328 Fix NPE with EXPLAIN DESCRIBE TABLE") {
assertEqual("describe t",
DescribeTableCommand(TableIdentifier("t"), Map.empty, isExtended = false))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test cases should be moved into catalyst as well.

@rdblue
Copy link
Contributor

rdblue commented Jul 10, 2019

+1

One minor comment, but otherwise this looks good to me.

@SparkQA
Copy link

SparkQA commented Jul 10, 2019

Test build #107484 has finished for PR 25040 at commit eb5c843.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 11, 2019

Test build #107490 has finished for PR 25040 at commit 25ea40b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 11, 2019

Test build #107546 has finished for PR 25040 at commit a947006.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import org.apache.spark.sql.types.{MetadataBuilder, StringType, StructField, StructType}

private[sql] object DescribeTableSchemas {
val DESCRIBE_TABLE_ATTRIBUTES = Seq(
Copy link
Contributor

@cloud-fan cloud-fan Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't define attributes in an object. AttributeReference will be assigned a unique ID when created, and in general we should create new attributes when creating a new logical plan.

For example, if you do df1 = sql("desc table t1"); df2 = sql("desc table ");, df1.join(df2) would fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you join the results of DESCRIBE?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala> val df1 = sql("desc t1")
df1: org.apache.spark.sql.DataFrame = [col_name: string, data_type: string ... 1 more field]

scala> val df2 = sql("desc t2")
df2: org.apache.spark.sql.DataFrame = [col_name: string, data_type: string ... 1 more field]

scala> df1.crossJoin(df2).show
+--------+---------+-------+--------+---------+-------+
|col_name|data_type|comment|col_name|data_type|comment|
+--------+---------+-------+--------+---------+-------+
|       i|      int|   null|       j|      int|   null|
+--------+---------+-------+--------+---------+-------+

This is not a common use case but we don't have to break it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this from a value to a method, so it will generate new identifiers every time while still being shared amongst multiple contexts.

catalog: TableCatalog,
ident: Identifier,
isExtended: Boolean) extends Command {
override lazy val output = DescribeTableSchemas.DESCRIBE_TABLE_ATTRIBUTES
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need lazy val here, as it's not a heavy computing

ident: Identifier,
isExtended: Boolean) extends Command {
override lazy val output = DescribeTableSchemas.DESCRIBE_TABLE_ATTRIBUTES
override lazy val schema = DescribeTableSchemas.DESCRIBE_TABLE_SCHEMA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default schema is StructType.fromAttributes(output), so we don't need to override it.

)

val DESCRIBE_TABLE_SCHEMA = StructType(
DESCRIBE_TABLE_ATTRIBUTES.map(attr => StructField(attr.name, attr.dataType, attr.nullable)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: StructType.fromAttributes(DESCRIBE_TABLE_ATTRIBUTES)

}

} else {
rows += toCatalystRow(s"Table $ident does not exist.", "", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we throw exception when table not found?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can follow the #24937: The DescribeTable should contain an UnresolvedRelation, so that analyzer can check table existence for us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed the AlterTable approach in the latest commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah but I didn't remove this - though I guess technically we should never hit this code path. We can throw an exception here instead.

private val EMPTY_ROW = toCatalystRow("", "", "")

private def toCatalystRow(strs: String*): InternalRow = {
ENCODER.resolveAndBind().toRow(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the encoder only need to call resolveAndBind once

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily think so, but it could also be how this class is built. I think the encoder's state needs to be reset. When I don't resolveAndBind every time, the tests yield wrong results entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it - we have to copy the rows generated by the encoder since the encoder re-uses the same memory space.

}

case DescribeTable(catalog, ident, _, isExtended) =>
DescribeTableExec(catalog, ident, isExtended) :: Nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

case DescribeTable(catalog, ident, r: DataSourceV2Relation, isExtended) =>
  DescribeTableExec(r.table, ident, isExtended) :: Nil

Then we don't need to lookup the table again in DescribeTableExec

@SparkQA
Copy link

SparkQA commented Jul 24, 2019

Test build #108063 has finished for PR 25040 at commit c396d32.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 24, 2019

Test build #108064 has finished for PR 25040 at commit 18d688f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import org.apache.spark.sql.catalyst.expressions.AttributeReference
import org.apache.spark.sql.types.{MetadataBuilder, StringType, StructField, StructType}

private[sql] object DescribeTableSchemas {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Singular, DescribeTableSchema?

import org.apache.spark.sql.catalyst.catalog.BucketSpec
import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project}
import org.apache.spark.sql.catalyst.plans.logical.sql.{AlterTableAddColumnsStatement, AlterTableAlterColumnStatement, AlterTableDropColumnsStatement, AlterTableRenameColumnStatement, AlterTableSetLocationStatement, AlterTableSetPropertiesStatement, AlterTableUnsetPropertiesStatement, AlterViewSetPropertiesStatement, AlterViewUnsetPropertiesStatement, CreateTableAsSelectStatement, CreateTableStatement, DropTableStatement, DropViewStatement, InsertIntoStatement, QualifiedColType, ReplaceTableAsSelectStatement, ReplaceTableStatement}
import org.apache.spark.sql.catalyst.plans.logical.sql.{AlterTableAddColumnsStatement, AlterTableAlterColumnStatement, AlterTableDropColumnsStatement, AlterTableRenameColumnStatement, AlterTableSetLocationStatement, AlterTableSetPropertiesStatement, AlterTableUnsetPropertiesStatement, AlterViewSetPropertiesStatement, AlterViewUnsetPropertiesStatement, CreateTableAsSelectStatement, CreateTableStatement, DescribeColumnStatement, DescribeTableStatement, DropTableStatement, DropViewStatement, InsertIntoStatement, QualifiedColType, ReplaceTableAsSelectStatement, ReplaceTableStatement}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are getting really really long, and in particular the merge conflicts are a bit tedious to resolve. I'm normally very averse to wildcard imports, but there might come a point where we'll have to do that. Or I wonder if we could have a helper object that bundles all of these, or factory methods for these, or matchers... somehow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for wildcard here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This package may make sense for a wildcard import because it has no sub-packages and is unlikely to in the future. Still, because Scala will import sub-packages, I think it's probably best to keep avoiding wildcard imports, even here.

@mccheah
Copy link
Contributor Author

mccheah commented Jul 30, 2019

Everything that's been brought up has been taken care of. Let me know if there's anything else to address.

override val properties: util.Map[String, String])
extends Table with SupportsRead with SupportsWrite {

def this(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we use this new constructor?

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except a few minor comments

@SparkQA
Copy link

SparkQA commented Jul 30, 2019

Test build #108358 has finished for PR 25040 at commit 2b77c80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 30, 2019

Test build #108362 has finished for PR 25040 at commit 9e04c6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mccheah
Copy link
Contributor Author

mccheah commented Aug 6, 2019

@cloud-fan is this good to merge?

@mccheah
Copy link
Contributor Author

mccheah commented Aug 6, 2019

Ah sorry I missed a few things, I'll address

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108729 has finished for PR 25040 at commit cff78a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 44e607e Aug 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants