Skip to content

Conversation

@sunchao
Copy link
Member

@sunchao sunchao commented Jan 10, 2021

What changes were proposed in this pull request?

This passes original SQL text to CacheTableAsSelect command in DSv1 and v2 so that it will be stored instead of the analyzed logical plan, similar to CREATE VIEW command.

In addition, this changes the behavior of dropping temporary view to also invalidate dependent caches in a cascade, when the config SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW is false (which is the default value).

Why are the changes needed?

Currently, after creating a temporary view with CACHE TABLE ... AS SELECT command, the view can still be queried even after the source table is dropped or replaced (in v2). This can cause correctness issue.

For instance, in the following:

> CREATE TABLE t ...;
> CACHE TABLE v AS SELECT * FROM t;
> DROP TABLE t;
> SELECT * FROM v;

The last select query still returns the old (and stale) result instead of fail. Note that the cache is already invalidated as part of dropping table t, but the temporary view v still exist.

On the other hand, the following:

> CREATE TABLE t ...;
> CREATE TEMPORARY VIEW v AS SELECT * FROM t;
> CACHE TABLE v;
> DROP TABLE t;
> SELECT * FROM v;

will throw "Table or view not found" error in the last select query.

This is related to #30567 which aligns the behavior of temporary view and global view by storing the original SQL text for temporary view, as opposed to the analyzed logical plan. However, the PR only handles CreateView case but not the CacheTableAsSelect case.

This also changes uncache logic and use cascade invalidation for temporary views created above. This is to align its behavior to how a permanent view is handled as of today, and also to avoid potential issues where a dependent view becomes invalid while its data is still kept in cache.

Does this PR introduce any user-facing change?

Yes, now when SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW is set to false (the default value), whenever a table/permanent view/temp view that a cached view depends on is dropped, the cached view itself will become invalid during analysis, i.e., user will get "Table or view not found" error. In addition, when the dependent is a temp view in the previous case, the cache itself will also be invalidated.

How was this patch tested?

Modified/Enhanced some existing tests.

@github-actions github-actions bot added the SQL label Jan 10, 2021
@sunchao sunchao changed the title [SPARK-34052][SQL] A cached view should become invalid after a table is dropped [SPARK-34052][SQL] A cached view should become invalid after the source table is dropped Jan 10, 2021
@SparkQA
Copy link

SparkQA commented Jan 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38469/

@SparkQA
Copy link

SparkQA commented Jan 10, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38469/

@SparkQA
Copy link

SparkQA commented Jan 10, 2021

Test build #133880 has finished for PR 31107 at commit 69c81c6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Could you take a look at the UT failure, @sunchao ?

@sunchao
Copy link
Member Author

sunchao commented Jan 11, 2021

@dongjoon-hyun yes I'm looking at some of the test failures. Will update this PR soon.

@sunchao
Copy link
Member Author

sunchao commented Jan 11, 2021

One of the test failure:

CachedTableSuite.CACHE TABLE tableName AS SELECT ...
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to qualifier on unresolved object, tree: 'key

looks like a separate bug related to #30567, and I've opened SPARK-34076 to track it.

@SparkQA
Copy link

SparkQA commented Jan 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38619/

@SparkQA
Copy link

SparkQA commented Jan 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38619/

@SparkQA
Copy link

SparkQA commented Jan 14, 2021

Test build #134032 has finished for PR 31107 at commit bce8265.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Also, cc @MaxGekk

@SparkQA
Copy link

SparkQA commented Jan 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38651/

@SparkQA
Copy link

SparkQA commented Jan 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38651/

@sunchao sunchao marked this pull request as ready for review January 14, 2021 22:38
@SparkQA
Copy link

SparkQA commented Jan 15, 2021

Test build #134065 has finished for PR 31107 at commit 380bbb1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao
Copy link
Member Author

sunchao commented Jan 15, 2021

cc @cloud-fan @viirya @HyukjinKwon

@cloud-fan
Copy link
Contributor

thanks for catching this bug!

@SparkQA
Copy link

SparkQA commented Jan 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38680/

@SparkQA
Copy link

SparkQA commented Jan 15, 2021

Test build #134094 has finished for PR 31107 at commit 120dd3d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38680/

@SparkQA
Copy link

SparkQA commented Jan 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38702/

catalog.getTempViewOrPermanentTableMetadata(tableName).viewText.isDefined
sparkSession.sharedState.cacheManager.uncacheQuery(
sparkSession.table(tableName), cascade = !isTempView)
sparkSession.table(tableName), cascade = !isTempView || hasViewText)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the PR title and description up-to-dated? Looks like the change in the PR is basically changing the uncaching behavior of view with view text.

Looks like we already did cascade uncache when dropping a source table here, so I'm wondering why the title is "A cached view should become invalid after the source table is dropped"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the PR title is not very good - I just updated it. I think the description is more aligned to what the PR does but I'll try to add more details there.

@sunchao sunchao changed the title [SPARK-34052][SQL] A cached view should become invalid after the source table is dropped [SPARK-34052][SQL] store SQL text for a temp view created using "CACHE TABLE .. AS SELECT" Jan 19, 2021
Comment on lines 944 to 945
// dropping a temp view trigger cache invalidation on dependents iff the config is
// turned on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is incorrect, right? It should be when the config is turned off.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I confused myself by considering "turned on" = "turned on storing SQL text"

@viirya
Copy link
Member

viirya commented Jan 19, 2021

CREATE TABLE t ...;
CACHE TABLE v AS SELECT * FROM t;
DROP TABLE t;
SELECT * FROM v;

BTW, for the example in the description, I think the t should be a view to show the issue?

@sunchao
Copy link
Member Author

sunchao commented Jan 19, 2021

I think it doesn't matter whether t is a table or view. The issue is v is stored as analyzed plan which doesn't get re-analyzed when t is dropped.

@viirya
Copy link
Member

viirya commented Jan 19, 2021

Hmm, but seems I don't see such example is added as a test?

@sunchao
Copy link
Member Author

sunchao commented Jan 19, 2021

I used to have a test here but removed it since I think it largely overlaps with some existing ones, but yeah I could add it back if you think that helps.

@viirya
Copy link
Member

viirya commented Jan 19, 2021

yeah, thanks, i think it is better to have it as it is used in the description as example.

@SparkQA
Copy link

SparkQA commented Jan 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38820/

@SparkQA
Copy link

SparkQA commented Jan 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38823/

@SparkQA
Copy link

SparkQA commented Jan 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38820/

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks.

@SparkQA
Copy link

SparkQA commented Jan 19, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38823/

@SparkQA
Copy link

SparkQA commented Jan 19, 2021

Test build #134235 has finished for PR 31107 at commit 12cd598.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2021

Test build #134237 has finished for PR 31107 at commit 40ef4b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Didn't take a close look too but sounds making sense.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 902a08b Jan 20, 2021
@cloud-fan
Copy link
Contributor

@sunchao can you open a backport PR for 3.1 since the new SQL temp view semantic is already there? Thanks!

@sunchao
Copy link
Member Author

sunchao commented Jan 20, 2021

Thanks for the review!

@sunchao can you open a backport PR for 3.1 since the new SQL temp view semantic is already there? Thanks!

I got lots of conflicts when backporting it - @imback82 is there any plan to backport SPARK-33654 to branch-3.1? I see its affected version is 3.1.0.

@cloud-fan
Copy link
Contributor

Ah 3.1 doesn't have CacheTableAsSelect. We need to re-implement this fix with the old CacheTableCommand.

@sunchao
Copy link
Member Author

sunchao commented Jan 20, 2021

I see. Let me implement it separately for branch 3.1.

sunchao added a commit to sunchao/spark that referenced this pull request Jan 22, 2021
…E TABLE .. AS SELECT"

This passes original SQL text to `CacheTableAsSelect` command in DSv1 and v2 so that it will be stored instead of the analyzed logical plan, similar to `CREATE VIEW` command.

In addition, this changes the behavior of dropping temporary view to also invalidate dependent caches in a cascade, when the config `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is false (which is the default value).

Currently, after creating a temporary view with `CACHE TABLE ... AS SELECT` command, the view can still be queried even after the source table is dropped or replaced (in v2). This can cause correctness issue.

For instance, in the following:
```sql
> CREATE TABLE t ...;
> CACHE TABLE v AS SELECT * FROM t;
> DROP TABLE t;
> SELECT * FROM v;
```
The last select query still returns the old (and stale) result instead of fail. Note that the cache is already invalidated as part of dropping table `t`, but the temporary view `v` still exist.

On the other hand, the following:
```sql
> CREATE TABLE t ...;
> CREATE TEMPORARY VIEW v AS SELECT * FROM t;
> CACHE TABLE v;
> DROP TABLE t;
> SELECT * FROM v;
```
will throw "Table or view not found" error in the last select query.

This is related to apache#30567 which aligns the behavior of temporary view and global view by storing the original SQL text for temporary view, as opposed to the analyzed logical plan. However, the PR only handles `CreateView` case but not the `CacheTableAsSelect` case.

This also changes uncache logic and use cascade invalidation for temporary views created above. This is to align its behavior to how a permanent view is handled as of today, and also to avoid potential issues where a dependent view becomes invalid while its data is still kept in cache.

Yes, now when `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is set to false (the default value), whenever a table/permanent view/temp view that a cached view depends on is dropped, the cached view itself will become invalid during analysis, i.e., user will get "Table or view not found" error. In addition, when the dependent is a temp view in the previous case, the cache itself will also be invalidated.

Modified/Enhanced some existing tests.

Closes apache#31107 from sunchao/SPARK-34052.

Lead-authored-by: Chao Sun <[email protected]>
Co-authored-by: Chao Sun <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Jan 24, 2021
…"CACHE TABLE .. AS SELECT ..."

This is a backport of #31107 to branch-3.1.

### What changes were proposed in this pull request?

This passes original SQL text to `CacheTableCommand` command in DSv1 so that it will be stored instead of the analyzed logical plan, similar to `CREATE VIEW` command.

In addition, this changes the behavior of dropping temporary view to also invalidate dependent caches in a cascade, when the config `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is false (which is the default value).

### Why are the changes needed?

Currently, after creating a temporary view with `CACHE TABLE ... AS SELECT` command, the view can still be queried even after the source table is dropped or replaced (in v2). This can cause correctness issue.

For instance, in the following:
```sql
> CREATE TABLE t ...;
> CACHE TABLE v AS SELECT * FROM t;
> DROP TABLE t;
> SELECT * FROM v;
```
The last select query still returns the old (and stale) result instead of fail. Note that the cache is already invalidated as part of dropping table `t`, but the temporary view `v` still exist.

On the other hand, the following:
```sql
> CREATE TABLE t ...;
> CREATE TEMPORARY VIEW v AS SELECT * FROM t;
> CACHE TABLE v;
> DROP TABLE t;
> SELECT * FROM v;
```
will throw "Table or view not found" error in the last select query.

This is related to #30567 which aligns the behavior of temporary view and global view by storing the original SQL text for temporary view, as opposed to the analyzed logical plan. However, the PR only handles `CreateView` case but not the `CacheTableAsSelect` case.

This also changes uncache logic and use cascade invalidation for temporary views created above. This is to align its behavior to how a permanent view is handled as of today, and also to avoid potential issues where a dependent view becomes invalid while its data is still kept in cache.

### Does this PR introduce _any_ user-facing change?

Yes, now when `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is set to false (the default value), whenever a table/permanent view/temp view that a cached view depends on is dropped, the cached view itself will become invalid during analysis, i.e., user will get "Table or view not found" error. In addition, when the dependent is a temp view in the previous case, the cache itself will also be invalidated.

### How was this patch tested?

Added new test cases. Also modified and enhanced some existing related tests.

Closes #31300 from sunchao/SPARK-34052-branch-3.1.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…E TABLE .. AS SELECT"

### What changes were proposed in this pull request?

This passes original SQL text to `CacheTableAsSelect` command in DSv1 and v2 so that it will be stored instead of the analyzed logical plan, similar to `CREATE VIEW` command.

In addition, this changes the behavior of dropping temporary view to also invalidate dependent caches in a cascade, when the config `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is false (which is the default value).

### Why are the changes needed?

Currently, after creating a temporary view with `CACHE TABLE ... AS SELECT` command, the view can still be queried even after the source table is dropped or replaced (in v2). This can cause correctness issue.

For instance, in the following:
```sql
> CREATE TABLE t ...;
> CACHE TABLE v AS SELECT * FROM t;
> DROP TABLE t;
> SELECT * FROM v;
```
The last select query still returns the old (and stale) result instead of fail. Note that the cache is already invalidated as part of dropping table `t`, but the temporary view `v` still exist.

On the other hand, the following:
```sql
> CREATE TABLE t ...;
> CREATE TEMPORARY VIEW v AS SELECT * FROM t;
> CACHE TABLE v;
> DROP TABLE t;
> SELECT * FROM v;
```
will throw "Table or view not found" error in the last select query.

This is related to apache#30567 which aligns the behavior of temporary view and global view by storing the original SQL text for temporary view, as opposed to the analyzed logical plan. However, the PR only handles `CreateView` case but not the `CacheTableAsSelect` case.

This also changes uncache logic and use cascade invalidation for temporary views created above. This is to align its behavior to how a permanent view is handled as of today, and also to avoid potential issues where a dependent view becomes invalid while its data is still kept in cache.

### Does this PR introduce _any_ user-facing change?

Yes, now when `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is set to false (the default value), whenever a table/permanent view/temp view that a cached view depends on is dropped, the cached view itself will become invalid during analysis, i.e., user will get "Table or view not found" error. In addition, when the dependent is a temp view in the previous case, the cache itself will also be invalidated.

### How was this patch tested?

Modified/Enhanced some existing tests.

Closes apache#31107 from sunchao/SPARK-34052.

Lead-authored-by: Chao Sun <[email protected]>
Co-authored-by: Chao Sun <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
viirya pushed a commit that referenced this pull request Feb 10, 2021
…ade for temp views

### What changes were proposed in this pull request?

This PR includes the following changes:
1. in `CatalogImpl.uncacheTable`, invalidate caches in cascade when the target table is
 a temp view, and `spark.sql.legacy.storeAnalyzedPlanForView` is false (default value).
2. make `SessionCatalog.lookupTempView` public and return processed temp view plan (i.e., with `View` op).

### Why are the changes needed?

Following [SPARK-34052](https://issues.apache.org/jira/browse/SPARK-34052) (#31107), we should invalidate in cascade for `CatalogImpl.uncacheTable` when the table is a temp view, so that the behavior is consistent.

### Does this PR introduce _any_ user-facing change?

Yes, now `SQLContext.uncacheTable` will drop temp view in cascade by default.

### How was this patch tested?

Added a UT

Closes #31462 from sunchao/SPARK-34347.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants