Improve logic when to write column indexes

Currently, we always write column indexes. In case of the data is ordered (ASCENDING or DESCENDING) the filtering would highly benefit from column indexes. While, if the data is UNORDERED it is not obvious if ordering based on column indexes would make sense. For example if the data is random then the min/max values of the different pages might be close to each other so in most cases filtering based on these values would not drop any of the pages. In the other hand UNORDERED values does not mean that the values are random. It can happen that the values are clustered or semi-ordered. We shall discover these cases somehow before writing the column indexes and write only if the min/max values for the pages do not overlap too much.

Another simple case if we have only one page. In this case writing column indexes is useless. 

**Reporter**: [Gabor Szadovszky](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gszadovszky) / @gszadovszky
**Assignee**: [Gabor Szadovszky](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gszadovszky) / @gszadovszky
#### Related issues:
- [Column indexes](https://github.com/apache/parquet-java/issues/2123) (depends upon)
- [Benchmark filtering column-indexes](https://github.com/apache/parquet-java/issues/2235) (depends upon)

<sub>**Note**: *This issue was originally created as [PARQUET-1415](https://issues.apache.org/jira/browse/PARQUET-1415). Please see the [migration documentation](https://issues.apache.org/jira/browse/PARQUET-2502) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve logic when to write column indexes #2228

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve logic when to write column indexes #2228

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions