Skip to content

Column Indexes: Invalid row indexes for pages starting with nulls #1527

@asfimport

Description

@asfimport

The current implementation for writing managing row indexes for the pages is not reliable. There is a logic MessageColumnIO which caches null values and flush them just before opening a new group. This logic might cause starting pages with these cached nulls which are not correctly counted in the written rows so the rowIndexes are incorrect. It does not cause any issues if all the pages are read continuously put it is a huge problem for column index based filtering.
The implementation described above is really complicated and would not like to redesign because of the mentioned issue. It is easier to simply count the 0 repetition levels as record boundaries at the column writer level.

Reporter: Gabor Szadovszky / @gszadovszky
Assignee: Gabor Szadovszky / @gszadovszky

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-1364. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions