Skip to content

[Parquet] PushDecoder: Add a peek API to support pre-fetching #8668

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Unlike streams of JSON / CSV, the data that the parquet reader needs next i is not easy to predict as it depends on the filters, the row groups, which columns are requested, etc.

Now that we have the initial PushDecoder in this PR

We will be in the position to add an API for the decoder to communicate what data will be needed next

Describe the solution you'd like
I would like an API that allows users of the Parquet decoder to have more fine grained control over peeking

Describe alternatives you've considered

Here is an idea from @adriangb on https://github.com/apache/arrow-rs/pull/7997/files#r2444922393

a method along the lines of try_peek()? It'd be cool if it returned some structure that allowed fine grained control of the peeking:

let max_ranges = 32;
let max_bytes = 1024 * 1024 * 32;
let mut current_bytes = 0;
let mut ranges = Vec::new();
let mut peek = decoder.peek()
loop {
    match peek.next() {
        PeekResult::Range(range) => {
            ranges.push(range);
            current_bytes += range.end - range.start;
            if ranges.len() > max_ranges { break }
            if current_bytes > max_bytes { break }
        PeekResult::End { break }
    }
}

Here is another potential API from the original ticket:

// Create a decoder for decoding parquet data as above
let mut decoder: ParquetDecoderBuilder = ...;

// As the decoder up from what data it will need, start prefetching data if desired
while let Some(pre_request) = decoder.peek_next_requests() {
    // note that this is a peek and if we call peek again in the
    // future, we may get a different set of pre_requests (for example
    // if the decoder has applied a row filter and ruled out
    // some row groups or data pages)
    start_prefetch(pre_request);
}

// push data to the decoder as before, but hopefully the reader
// will have already prefetched some of the data
Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions