[SPARK-27105][SQL][test-hadoop3.2] Optimize away exponential complexity in ORC predicate conversion #24783

gengliangwang · 2019-06-03T17:40:12Z

What changes were proposed in this pull request?

In #24068, @IvanVergiliev reports that OrcFilters.createBuilder has exponential complexity in the height of the filter tree due to the way the check-and-build pattern is implemented.
This is because the same method createBuilder is called twice recursively for any children under And/Or/Not nodes, so that inside the first call, the second call is called as well(See description in #24068 for details).

Comparing to the approach in #24068, I propose a very simple solution for the issue. We can rely on the result of convertibleFilters, which can build a fully convertible tree. With it, we don't need to concern about the children of a certain node is not convertible in method createBuilder.

How was this patch tested?

Unit test

gengliangwang · 2019-06-03T17:40:33Z

@IvanVergiliev @gatorsmile @cloud-fan

SparkQA · 2019-06-03T20:39:31Z

Test build #106119 has finished for PR 24783 at commit 341f7a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-06-04T10:25:45Z

I have updated the benchmark result. This PR is ready for review.

IvanVergiliev · 2019-06-04T12:01:11Z

It's true that this PR results in a smaller code change because it reuses the existing convertibleFilters function. However, it suffers from the same problems that I was trying to get away from in the other PR, which is why I ended up with the current state of the code.

Namely, I mentioned a few benefits to the "filter-and-build in the same case-match" approach here: #24068 (comment) :

@cloud-fan I took a stab at a slightly different approach to structuring the code in https://github.com/IvanVergiliev/spark/pull/2/files . The idea is to implement filtering and building in the same match expression, with an enum that tells us whether to perform a filter or a build operation. This has the following benefits:

All the logic for a given predicate is grouped logically in the same place. You don't have to scroll across the whole file to see what the filter action for an And is while you're looking at the build action.
You can't really add a new predicate to the set of filtered predicates without also defining a Build action for it - this fails the exhaustiveness check on ActionType.

So, while I'm obviously biased, I still think that the code in the other PR results in a better end state for the implementation, despite the change being a bit larger. It also does exactly what your PR does, but it's structured in a different way (which I think has the benefits I mentioned above).

gengliangwang · 2019-06-04T12:55:01Z

@IvanVergiliev I think the code in this PR is much simpler and readable.

The PR #24068 introduces two ActionType: TrimUnconvertibleFilters and BuildSearchArgument, and handles the two actions in one function:

For the And/Or/Not nodes, the logic is complex to understand
For most of the leaf nodes, the convertible result is always Some(node), we can abstract it like this PR.

This PR builds a fully convertible tree first, and then convert the tree to SearchArgument very straightforwardly. Putting the two procedures into two functions makes the logic cleaner. We can also see that the method convertibleFilters is quite short because is reuse the leaf code handling in method createBuilder.

With respect, this PR uses the benchmark in #24068, and it will be co-authored with you. I know there is a lot of work in #24068, but I prefer the simple implementation in this one.

gengliangwang · 2019-06-04T12:58:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala

+          saveAsTable(df, dir)
+          val benchmark =
+            new Benchmark("Select data with filters", numRows, minNumIters = 5, output = output)
+          Seq(100, 500, 1000).foreach { numFilter =>


I tried with 5000 filters, and the execution becomes very slow. For end-to-end tests, we need to have a smaller size here, comparing to the benchmark Convert filters to ORC filter

SparkQA · 2019-06-04T13:20:32Z

Test build #106145 has finished for PR 24783 at commit 66d012b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-06-06T07:24:27Z

retest this please.

SparkQA · 2019-06-06T10:38:05Z

Test build #106232 has finished for PR 24783 at commit 66d012b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-06-06T21:34:24Z

sql/core/benchmarks/FilterPushdownBenchmark-results.txt

-Parquet Vectorized                          10561 / 10565          1.5         671.4       1.0X
-Parquet Vectorized (Pushdown)                  711 /  716         22.1          45.2      14.9X
-Native ORC Vectorized                         6791 / 6806          2.3         431.8       1.6X
-Native ORC Ve


Create a separate file?

All the benchmark results of FilterPushdownBenchmark will be in this file, unless we move the new benchmarks to another micro benchmark.

cloud-fan · 2019-06-09T02:04:57Z

theoretically #24068 has better perf because it builds the SearchArgument only once, but seems it doesn't matter as the perf difference should be very small. Since @IvanVergiliev has spent a lot of effort on #24068 and the PR itself is corrected, how about we merge #24068 first and then send a followup PR to simplify it?

gengliangwang · 2019-06-09T05:26:47Z

Since @IvanVergiliev has spent a lot of effort on #24068 and the PR itself is corrected, how about we merge #24068 first and then send a followup PR to simplify it?

Sure, I am fine with that.

dongjoon-hyun · 2019-06-13T01:44:09Z

Hi, @gengliangwang . Are you going to use this PR for the followup after #24068 ?

gengliangwang · 2019-06-13T01:48:44Z

@dongjoon-hyun Yes, I think so.
If it is OK, I am also fine with merging this one directly.

IvanVergiliev · 2019-06-13T04:56:00Z

@cloud-fan cool, this sounds good to me too! I can also bring my PR back to a state similar to before I merged https://github.com/IvanVergiliev/spark/pull/2/files - with filter and build in separate functions - and then @gengliangwang can followup with the change to reuse build for determining whether leaf nodes are convertible?

gengliangwang · 2019-06-19T16:00:45Z

I have created a new PR for this: #24910

IvanVergiliev and others added 2 commits June 4, 2019 00:55

Pushdown benchmark with unbalanced Column

02ca045

simplify orc filters

341f7a8

felixcheung requested review from dongjoon-hyun and maropu June 4, 2019 02:56

gengliangwang added 3 commits June 4, 2019 13:56

update benchmark

6e58675

update FilterPushdownBenchmark.scala

7491c27

update benchmark result

66d012b

gengliangwang changed the title ~~[WIP][SPARK-27105][SQL] Optimize away exponential complexity in ORC predicate conversion~~ [SPARK-27105][SQL] Optimize away exponential complexity in ORC predicate conversion Jun 4, 2019

gengliangwang commented Jun 4, 2019

View reviewed changes

gengliangwang changed the title ~~[SPARK-27105][SQL] Optimize away exponential complexity in ORC predicate conversion~~ [SPARK-27105][SQL][test-hadoop3.2] Optimize away exponential complexity in ORC predicate conversion Jun 6, 2019

gatorsmile reviewed Jun 6, 2019

View reviewed changes

dongjoon-hyun added the IMPROVEMENT label Jun 13, 2019

dongjoon-hyun added SQL and removed IMPROVEMENT labels Jun 14, 2019

gengliangwang mentioned this pull request Jun 19, 2019

[SPARK-28108][SQL][test-hadoop3.2] Simplify OrcFilters #24910

Closed

gengliangwang closed this Jun 19, 2019

[SPARK-27105][SQL][test-hadoop3.2] Optimize away exponential complexity in ORC predicate conversion #24783

[SPARK-27105][SQL][test-hadoop3.2] Optimize away exponential complexity in ORC predicate conversion #24783

Uh oh!

Conversation

gengliangwang commented Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gengliangwang commented Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 3, 2019

Uh oh!

gengliangwang commented Jun 4, 2019

Uh oh!

IvanVergiliev commented Jun 4, 2019

Uh oh!

gengliangwang commented Jun 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang Jun 4, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 4, 2019

Uh oh!

gengliangwang commented Jun 6, 2019

Uh oh!

SparkQA commented Jun 6, 2019

Uh oh!

gatorsmile Jun 6, 2019

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jun 7, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 9, 2019

Uh oh!

gengliangwang commented Jun 9, 2019

Uh oh!

dongjoon-hyun commented Jun 13, 2019

Uh oh!

gengliangwang commented Jun 13, 2019

Uh oh!

IvanVergiliev commented Jun 13, 2019

Uh oh!

gengliangwang commented Jun 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gengliangwang commented Jun 3, 2019 •

edited

Loading

gengliangwang commented Jun 3, 2019 •

edited

Loading

gengliangwang commented Jun 4, 2019 •

edited

Loading