Skip to content

Conversation

@gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Jun 3, 2019

What changes were proposed in this pull request?

In #24068, @IvanVergiliev reports that OrcFilters.createBuilder has exponential complexity in the height of the filter tree due to the way the check-and-build pattern is implemented.
This is because the same method createBuilder is called twice recursively for any children under And/Or/Not nodes, so that inside the first call, the second call is called as well(See description in #24068 for details).

Comparing to the approach in #24068, I propose a very simple solution for the issue. We can rely on the result of convertibleFilters, which can build a fully convertible tree. With it, we don't need to concern about the children of a certain node is not convertible in method createBuilder.

How was this patch tested?

Unit test

@gengliangwang
Copy link
Member Author

gengliangwang commented Jun 3, 2019

@SparkQA
Copy link

SparkQA commented Jun 3, 2019

Test build #106119 has finished for PR 24783 at commit 341f7a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

I have updated the benchmark result. This PR is ready for review.

@gengliangwang gengliangwang changed the title [WIP][SPARK-27105][SQL] Optimize away exponential complexity in ORC predicate conversion [SPARK-27105][SQL] Optimize away exponential complexity in ORC predicate conversion Jun 4, 2019
@IvanVergiliev
Copy link
Contributor

It's true that this PR results in a smaller code change because it reuses the existing convertibleFilters function. However, it suffers from the same problems that I was trying to get away from in the other PR, which is why I ended up with the current state of the code.

Namely, I mentioned a few benefits to the "filter-and-build in the same case-match" approach here: #24068 (comment) :

@cloud-fan I took a stab at a slightly different approach to structuring the code in https://github.com/IvanVergiliev/spark/pull/2/files . The idea is to implement filtering and building in the same match expression, with an enum that tells us whether to perform a filter or a build operation. This has the following benefits:

All the logic for a given predicate is grouped logically in the same place. You don't have to scroll across the whole file to see what the filter action for an And is while you're looking at the build action.
You can't really add a new predicate to the set of filtered predicates without also defining a Build action for it - this fails the exhaustiveness check on ActionType.

So, while I'm obviously biased, I still think that the code in the other PR results in a better end state for the implementation, despite the change being a bit larger. It also does exactly what your PR does, but it's structured in a different way (which I think has the benefits I mentioned above).

@gengliangwang
Copy link
Member Author

gengliangwang commented Jun 4, 2019

@IvanVergiliev I think the code in this PR is much simpler and readable.

The PR #24068 introduces two ActionType: TrimUnconvertibleFilters and BuildSearchArgument, and handles the two actions in one function:

  1. For the And/Or/Not nodes, the logic is complex to understand
  2. For most of the leaf nodes, the convertible result is always Some(node), we can abstract it like this PR.

This PR builds a fully convertible tree first, and then convert the tree to SearchArgument very straightforwardly. Putting the two procedures into two functions makes the logic cleaner. We can also see that the method convertibleFilters is quite short because is reuse the leaf code handling in method createBuilder.

With respect, this PR uses the benchmark in #24068, and it will be co-authored with you. I know there is a lot of work in #24068, but I prefer the simple implementation in this one.

saveAsTable(df, dir)
val benchmark =
new Benchmark("Select data with filters", numRows, minNumIters = 5, output = output)
Seq(100, 500, 1000).foreach { numFilter =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with 5000 filters, and the execution becomes very slow. For end-to-end tests, we need to have a smaller size here, comparing to the benchmark Convert filters to ORC filter

@SparkQA
Copy link

SparkQA commented Jun 4, 2019

Test build #106145 has finished for PR 24783 at commit 66d012b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang gengliangwang changed the title [SPARK-27105][SQL] Optimize away exponential complexity in ORC predicate conversion [SPARK-27105][SQL][test-hadoop3.2] Optimize away exponential complexity in ORC predicate conversion Jun 6, 2019
@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 6, 2019

Test build #106232 has finished for PR 24783 at commit 66d012b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Parquet Vectorized 10561 / 10565 1.5 671.4 1.0X
Parquet Vectorized (Pushdown) 711 / 716 22.1 45.2 14.9X
Native ORC Vectorized 6791 / 6806 2.3 431.8 1.6X
Native ORC Ve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a separate file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the benchmark results of FilterPushdownBenchmark will be in this file, unless we move the new benchmarks to another micro benchmark.

@cloud-fan
Copy link
Contributor

theoretically #24068 has better perf because it builds the SearchArgument only once, but seems it doesn't matter as the perf difference should be very small. Since @IvanVergiliev has spent a lot of effort on #24068 and the PR itself is corrected, how about we merge #24068 first and then send a followup PR to simplify it?

@gengliangwang
Copy link
Member Author

Since @IvanVergiliev has spent a lot of effort on #24068 and the PR itself is corrected, how about we merge #24068 first and then send a followup PR to simplify it?

Sure, I am fine with that.

@dongjoon-hyun
Copy link
Member

Hi, @gengliangwang . Are you going to use this PR for the followup after #24068 ?

@gengliangwang
Copy link
Member Author

@dongjoon-hyun Yes, I think so.
If it is OK, I am also fine with merging this one directly.

@IvanVergiliev
Copy link
Contributor

@cloud-fan cool, this sounds good to me too! I can also bring my PR back to a state similar to before I merged https://github.com/IvanVergiliev/spark/pull/2/files - with filter and build in separate functions - and then @gengliangwang can followup with the change to reuse build for determining whether leaf nodes are convertible?

@gengliangwang
Copy link
Member Author

I have created a new PR for this: #24910

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants