-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Write the column indexes described in PARQUET-922.
This is the first phase of implementing the whole feature. The implementation is done in the following steps:
-
Utility to read/write indexes in parquet-format
-
Writing indexes in the parquet file
-
Extend parquet-tools and parquet-cli to show the indexes
-
Limit index size based on parquet properties
-
Trim min/max values where possible based on parquet properties
-
Filtering based on column indexes
The work is done on the feature branch
column-indexes. This JIRA will be resolved after the branch has been merged tomaster.
Reporter: Gabor Szadovszky / @gszadovszky
Assignee: Gabor Szadovszky / @gszadovszky
Subtasks:
- Column indexes: read/write API
- Column indexes: Show indexes in tools
- Column indexes: Limit index size
- Column indexes: Truncate min/max values
- Column indexes: Filtering
- Column Indexes: Invalid row indexes for pages starting with nulls
- Incorrect check for ASCENDING/DESCENDING at column index write path
- Fix issues of NaN and +-0.0 in case of float/double column indexes
- Improve value skipping at page synchronization
- appendRowGroup will loose pageIndex
Related issues:
- Don't write page level statistics (blocks)
- Write index page in parquet file (is duplicated by)
- Limit page size based on maximum row count (relates to)
- Make Spark SQL support Column indexes (relates to)
- Add index pages to the format to support efficient page skipping (depends upon)
- Improve logic when to write column indexes (is depended upon by)
- Benchmark filtering column-indexes (is depended upon by)
PRs and other links:
Note: This issue was originally created as PARQUET-1201. Please see the migration documentation for further details.