-
Notifications
You must be signed in to change notification settings - Fork 1k
Implement Push Parquet Decoder #7997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
e203c17
to
4fe003f
Compare
Update: I spent this morning creating a "push" based ParquetMetadataDecoder as I found metadata is, of course, needed to configure the parquet metadata decoding process. I am quite pleased with the results |
cf1f993
to
58c033d
Compare
58c033d
to
c6385ae
Compare
And the tests pass 🥁 🚀 Now I will do some test PRs to use this one implementation instead of the async reader (and I'll do the sync reader too) |
5fdc785
to
26f4ce4
Compare
26f4ce4
to
89cdfde
Compare
89cdfde
to
bf6dd1f
Compare
cd3eea7
to
4a41365
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm out of time right now, I will come back and take a look at parquet/src/arrow/push_decoder/reader_builder/mod.rs.
My 2 high level questions for now are:
- How does this fit into the current decoders? Does it replace them for all users? Is it something users have to opt into via code, or a feature flag? I'm sure the is obvious somewhere and I just haven't seen it.
- You mentioned this unlocks the possibility to widen IO by asking the push decoder to "look ahead" at what might need to be decoded next. I want to dig into what that would look like, I guess it would be something like asking
RowGroupDecoderState
for "what do you think is the next bit of data you will need"? In theStartData
orWaitingOnData
states that might be obvious / accurate but in theStart
,Filters
orWaitingOnFilterData
that would be either a conservative or aggressive guess.
metadata: &'a ParquetMetaData, | ||
} | ||
|
||
impl InMemoryRowGroup<'_> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes was going to comment the same thing. Is there a reason for not moving it? Since it's just an impl ...
moving it should not break any imports, etc.
impl InMemoryRowGroup<'_> { | ||
/// Returns the byte ranges to fetch for the columns specified in | ||
/// `projection` and `selection`. | ||
pub(crate) fn fetch_ranges( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Came here looking to confirm this. It looks like you've taken one big async fn fetch
and split it up into multiple smaller pub(crate) fn fetch_ranges
and pub(crate) fn fill_column_chunks
which seems reasonable to me. No new public APIs. Smaller functions. 👍🏻
|
||
// If we have a `RowSelection` and an `OffsetIndex` then only fetch pages required for the | ||
// `RowSelection` | ||
let mut page_start_offsets: Vec<Vec<u64>> = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW that was not pre-allocated in the existing code either so I don't think it's required to get this PR across the line.
_ => (), | ||
} | ||
|
||
// Expand selection to batch boundaries only for cached columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a first time reader of this code it would be nice to get 1 more line of comments here with a code pointer or explanation as to how the caching works e.g. what is a cached column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call -- I added some additional comments with context about the predicate cache and links
/// Column chunk data representing only a subset of data pages | ||
Sparse { | ||
/// Length of the full column chunk | ||
length: usize, | ||
/// Subset of data pages included in this sparse chunk. | ||
/// | ||
/// Each element is a tuple of (page offset within file, page data). | ||
/// Each entry is a complete page and the list is ordered by offset. | ||
data: Vec<(usize, Bytes)>, | ||
}, | ||
/// Full column chunk and the offset within the original file | ||
Dense { offset: usize, data: Bytes }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 very nice structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks -- this is just giving a name to what was already happening in
arrow-rs/parquet/src/arrow/async_reader/mod.rs
Lines 989 to 1083 in b9c2bf7
let metadata = self.metadata.row_group(self.row_group_idx); | |
if let Some((selection, offset_index)) = selection.zip(self.offset_index) { | |
let expanded_selection = | |
selection.expand_to_batch_boundaries(batch_size, self.row_count); | |
// If we have a `RowSelection` and an `OffsetIndex` then only fetch pages required for the | |
// `RowSelection` | |
let mut page_start_offsets: Vec<Vec<u64>> = vec![]; | |
let fetch_ranges = self | |
.column_chunks | |
.iter() | |
.zip(metadata.columns()) | |
.enumerate() | |
.filter(|&(idx, (chunk, _chunk_meta))| { | |
chunk.is_none() && projection.leaf_included(idx) | |
}) | |
.flat_map(|(idx, (_chunk, chunk_meta))| { | |
// If the first page does not start at the beginning of the column, | |
// then we need to also fetch a dictionary page. | |
let mut ranges: Vec<Range<u64>> = vec![]; | |
let (start, _len) = chunk_meta.byte_range(); | |
match offset_index[idx].page_locations.first() { | |
Some(first) if first.offset as u64 != start => { | |
ranges.push(start..first.offset as u64); | |
} | |
_ => (), | |
} | |
// Expand selection to batch boundaries only for cached columns | |
let use_expanded = cache_mask.map(|m| m.leaf_included(idx)).unwrap_or(false); | |
if use_expanded { | |
ranges.extend( | |
expanded_selection.scan_ranges(&offset_index[idx].page_locations), | |
); | |
} else { | |
ranges.extend(selection.scan_ranges(&offset_index[idx].page_locations)); | |
} | |
page_start_offsets.push(ranges.iter().map(|range| range.start).collect()); | |
ranges | |
}) | |
.collect(); | |
let mut chunk_data = input.get_byte_ranges(fetch_ranges).await?.into_iter(); | |
let mut page_start_offsets = page_start_offsets.into_iter(); | |
for (idx, chunk) in self.column_chunks.iter_mut().enumerate() { | |
if chunk.is_some() || !projection.leaf_included(idx) { | |
continue; | |
} | |
if let Some(offsets) = page_start_offsets.next() { | |
let mut chunks = Vec::with_capacity(offsets.len()); | |
for _ in 0..offsets.len() { | |
chunks.push(chunk_data.next().unwrap()); | |
} | |
*chunk = Some(Arc::new(ColumnChunkData::Sparse { | |
length: metadata.column(idx).byte_range().1 as usize, | |
data: offsets | |
.into_iter() | |
.map(|x| x as usize) | |
.zip(chunks.into_iter()) | |
.collect(), | |
})) | |
} | |
} | |
} else { | |
let fetch_ranges = self | |
.column_chunks | |
.iter() | |
.enumerate() | |
.filter(|&(idx, chunk)| chunk.is_none() && projection.leaf_included(idx)) | |
.map(|(idx, _chunk)| { | |
let column = metadata.column(idx); | |
let (start, length) = column.byte_range(); | |
start..(start + length) | |
}) | |
.collect(); | |
let mut chunk_data = input.get_byte_ranges(fetch_ranges).await?.into_iter(); | |
for (idx, chunk) in self.column_chunks.iter_mut().enumerate() { | |
if chunk.is_some() || !projection.leaf_included(idx) { | |
continue; | |
} | |
if let Some(data) = chunk_data.next() { | |
*chunk = Some(Arc::new(ColumnChunkData::Dense { | |
offset: metadata.column(idx).byte_range().0 as usize, | |
data, | |
})); | |
} | |
} | |
} |
(but was stored in local variable names)
fn get(&self, start: u64) -> crate::errors::Result<Bytes> { | ||
match &self { | ||
ColumnChunkData::Sparse { data, .. } => data | ||
.binary_search_by_key(&start, |(offset, _)| *offset as u64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
impl Length for ColumnChunkData { | ||
/// Return the total length of the full column chunk | ||
fn len(&self) -> u64 { | ||
match &self { | ||
ColumnChunkData::Sparse { length, .. } => *length as u64, | ||
ColumnChunkData::Dense { data, .. } => data.len() as u64, | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏🏻
self.offset = offset; | ||
self | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you've added pub fn clear_ranges(&mut self, ranges_to_clear: &[Range<u64>])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great @alamb!
I left a lot of nits, but nothing major!
One question I have left is performance, but I think the right place to do that is a follow up PR where you replace the current internal machinery with this, then we can do a main vs. PR type benchmark.
/// These must be owned by FilterInfo because they may be mutated as part of | ||
/// evaluation so there is a bunch of complexity of handing them back and forth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having some familiarity with what is happening in the levels above this I believe it needs to be mutable because the mask of one filter is fed into the next / it's a stack of filters we pop from. I will try to confirm as I read through the rest of the code but a comment explaining why / what mutable state is used for would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core driver is that ArrowPredicate takes &mut self
: https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/trait.ArrowPredicate.html
pub trait ArrowPredicate: Send + 'static {
// Required methods
fn projection(&self) -> &ProjectionMask;
fn evaluate(
&mut self,
batch: RecordBatch,
) -> Result<BooleanArray, ArrowError>;
}
This means that to evaluate the predicate there must be exactly one owner and to get the lifetimes to work out I needed to pass ownership around like this. I will clarify this in the commetns
/// evaluation so there is a bunch of complexity of handing them back and forth | ||
filter: RowFilter, | ||
/// The next filter to be evaluated | ||
next_predicate: NonZeroUsize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing now that we just maintain a mutable index into RowFilter (i.e. don't mutate RowFilter
itself I'm guessing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to clarify this in comments
let Some(filter) = self.filter.take() else { | ||
// no filter, start trying to read data immediately | ||
return Ok(NextState::again(RowGroupDecoderState::StartData { | ||
row_group_info, | ||
column_chunks, | ||
cache_info: None, | ||
})); | ||
}; | ||
// no predicates in filter, so start reading immediately | ||
if filter.predicates.is_empty() { | ||
return Ok(NextState::again(RowGroupDecoderState::StartData { | ||
row_group_info, | ||
column_chunks, | ||
cache_info: None, | ||
})); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we collapse these two states in RowGroupReaderBuilder::new()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it happens for each row group (aka row_group_info
changes on each row group) I don't think it could be in RowGroupReaderBuilder::new()
We could possibly inline it in RowGroupReaderBuilder::next_row_group()
🤔
https://github.com/apache/arrow-rs/blob/10aae0fc6bc8020f77d584a5fa3cc0c5da605211/parquet/src/arrow/push_decoder/reader_builder/mod.rs#L244-L243
But I am not sure how much better that would be as it just moves the code around?
self.exclude_nested_columns_from_cache(&cache_projection) | ||
} | ||
|
||
/// Exclude leaves belonging to roots that span multiple parquet leaves (i.e. nested columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why this is? Does this mean that if there is a query like select 1 from t where nested["field"] <15 AND nested["field"] > 5
we will read the leaf for nested["field"]
twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer is I don't know -- this just preserves the existing semantics
Maybe @XiangpengHao can explain in more detail
/// | ||
/// // In a loop, ask the decoder what it needs next, and provide it with the required data | ||
/// loop { | ||
/// match decoder.try_decode().unwrap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing if I wanted to "look ahead" we'd add a method along the lines of try_peek()
? It'd be cool if it returned some structure that allowed fine grained control of the peeking:
let max_ranges = 32;
let max_bytes = 1024 * 1024 * 32;
let mut current_bytes = 0;
let mut ranges = Vec::new();
let mut peek = decoder.peek()
loop {
match peek.next() {
PeekResult::Range(range) => {
ranges.push(range);
current_bytes += range.end - range.start;
if ranges.len() > max_ranges { break }
if current_bytes > max_bytes { break }
PeekResult::End { break }
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed, excellent call -- captured in
/// DecodeResult::NeedsData(ranges) => { | ||
/// // The decoder needs more data. Fetch the data for the given ranges | ||
/// let data = ranges.iter().map(|r| get_range(r)).collect::<Vec<_>>(); | ||
/// // Push the data to the decoder | ||
/// decoder.push_ranges(ranges, data).unwrap(); | ||
/// // After pushing the data, we can try to decode again on the next iteration | ||
/// } | ||
/// DecodeResult::Data(batch) => { | ||
/// // Successfully decoded a batch of data | ||
/// assert!(batch.num_rows() > 0); | ||
/// } | ||
/// DecodeResult::Finished => { | ||
/// // The decoder has finished decoding exit the loop | ||
/// break; | ||
/// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very nice high level API!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also pleased as it mirrors the Metadata decoder one as well: https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaDataPushDecoder.html
/// Methods for building a ParquetDecoder. See the base [`ArrowReaderBuilder`] for | ||
/// more options that can be configured. | ||
impl ParquetPushDecoderBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason for not putting the impl in the same file as the struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you mean alongside ArrowReaderBuilder
in arrow_reader/mod.rs
it is because I felt that file was already pretty big as it has both generic options along with specific options for sync and async readers
I agree -- I expect this code to have exactly the same performance as the current async decoder -- it is the same algorithm, just the state machine(s) are now explicit rather than implicit via
Indeed. Here is that PR:
Currently this is a entirely new (3rd) decoder. My plan (and I have already done it in #8159) is to rewrite the existing async decoder to use this push decoder.
Yes, exactly. Another thing that I would like to do is to make it possible to buffer only some of the pages needed for a row group rather than all of them (aka what is stored in InMemoryRowGroup). This would reduce memory requirements for files with large row groups. However, it would also increase the number of IO requests (aka object store requests) so it would have to be configurable to let people trade off the IOs and the memory requirements |
IMO the parquet decoder should produce as granular as possible ranges of data to read and the object store implementation can handle coalescing them as needed. Currently some object stores (remote ones that talk to S3, GCS, etc.) already coalesce ranges to be more efficient, but to what degree is hidden behind hardcoded constants (https://github.com/apache/arrow-rs-object-store/blob/ad1d70f4876b0c2ea6c6a5e34dc158c63f861384/src/util.rs#L90-L95). Maybe that's what should be used to tweak the tradeoff between IO round-trips and memory? I guess we still need to decide when to go to hit object storage (i.e. how many ranges or bytes we accumulate before making a request?). |
Co-authored-by: Adrian Garcia Badaracco <[email protected]>
…o alamb/parquet_decoder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @adriangb and @mbrobbel for your comments on this PR. Super super helpful.
I'll plan to merge this over the next few days, after I:
- file a few more follow on tasks
- Verify benchmarks on this PR again
- get the PR to rewrite the async decoder ready
I also plan a blog post about this work:
metadata: &'a ParquetMetaData, | ||
} | ||
|
||
impl InMemoryRowGroup<'_> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved!
|
||
// If we have a `RowSelection` and an `OffsetIndex` then only fetch pages required for the | ||
// `RowSelection` | ||
let mut page_start_offsets: Vec<Vec<u64>> = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ => (), | ||
} | ||
|
||
// Expand selection to batch boundaries only for cached columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call -- I added some additional comments with context about the predicate cache and links
/// Column chunk data representing only a subset of data pages | ||
Sparse { | ||
/// Length of the full column chunk | ||
length: usize, | ||
/// Subset of data pages included in this sparse chunk. | ||
/// | ||
/// Each element is a tuple of (page offset within file, page data). | ||
/// Each entry is a complete page and the list is ordered by offset. | ||
data: Vec<(usize, Bytes)>, | ||
}, | ||
/// Full column chunk and the offset within the original file | ||
Dense { offset: usize, data: Bytes }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks -- this is just giving a name to what was already happening in
arrow-rs/parquet/src/arrow/async_reader/mod.rs
Lines 989 to 1083 in b9c2bf7
let metadata = self.metadata.row_group(self.row_group_idx); | |
if let Some((selection, offset_index)) = selection.zip(self.offset_index) { | |
let expanded_selection = | |
selection.expand_to_batch_boundaries(batch_size, self.row_count); | |
// If we have a `RowSelection` and an `OffsetIndex` then only fetch pages required for the | |
// `RowSelection` | |
let mut page_start_offsets: Vec<Vec<u64>> = vec![]; | |
let fetch_ranges = self | |
.column_chunks | |
.iter() | |
.zip(metadata.columns()) | |
.enumerate() | |
.filter(|&(idx, (chunk, _chunk_meta))| { | |
chunk.is_none() && projection.leaf_included(idx) | |
}) | |
.flat_map(|(idx, (_chunk, chunk_meta))| { | |
// If the first page does not start at the beginning of the column, | |
// then we need to also fetch a dictionary page. | |
let mut ranges: Vec<Range<u64>> = vec![]; | |
let (start, _len) = chunk_meta.byte_range(); | |
match offset_index[idx].page_locations.first() { | |
Some(first) if first.offset as u64 != start => { | |
ranges.push(start..first.offset as u64); | |
} | |
_ => (), | |
} | |
// Expand selection to batch boundaries only for cached columns | |
let use_expanded = cache_mask.map(|m| m.leaf_included(idx)).unwrap_or(false); | |
if use_expanded { | |
ranges.extend( | |
expanded_selection.scan_ranges(&offset_index[idx].page_locations), | |
); | |
} else { | |
ranges.extend(selection.scan_ranges(&offset_index[idx].page_locations)); | |
} | |
page_start_offsets.push(ranges.iter().map(|range| range.start).collect()); | |
ranges | |
}) | |
.collect(); | |
let mut chunk_data = input.get_byte_ranges(fetch_ranges).await?.into_iter(); | |
let mut page_start_offsets = page_start_offsets.into_iter(); | |
for (idx, chunk) in self.column_chunks.iter_mut().enumerate() { | |
if chunk.is_some() || !projection.leaf_included(idx) { | |
continue; | |
} | |
if let Some(offsets) = page_start_offsets.next() { | |
let mut chunks = Vec::with_capacity(offsets.len()); | |
for _ in 0..offsets.len() { | |
chunks.push(chunk_data.next().unwrap()); | |
} | |
*chunk = Some(Arc::new(ColumnChunkData::Sparse { | |
length: metadata.column(idx).byte_range().1 as usize, | |
data: offsets | |
.into_iter() | |
.map(|x| x as usize) | |
.zip(chunks.into_iter()) | |
.collect(), | |
})) | |
} | |
} | |
} else { | |
let fetch_ranges = self | |
.column_chunks | |
.iter() | |
.enumerate() | |
.filter(|&(idx, chunk)| chunk.is_none() && projection.leaf_included(idx)) | |
.map(|(idx, _chunk)| { | |
let column = metadata.column(idx); | |
let (start, length) = column.byte_range(); | |
start..(start + length) | |
}) | |
.collect(); | |
let mut chunk_data = input.get_byte_ranges(fetch_ranges).await?.into_iter(); | |
for (idx, chunk) in self.column_chunks.iter_mut().enumerate() { | |
if chunk.is_some() || !projection.leaf_included(idx) { | |
continue; | |
} | |
if let Some(data) = chunk_data.next() { | |
*chunk = Some(Arc::new(ColumnChunkData::Dense { | |
offset: metadata.column(idx).byte_range().0 as usize, | |
data, | |
})); | |
} | |
} | |
} |
(but was stored in local variable names)
/// | ||
/// // In a loop, ask the decoder what it needs next, and provide it with the required data | ||
/// loop { | ||
/// match decoder.try_decode().unwrap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed, excellent call -- captured in
/// Offset to apply to remaining row groups (decremented as rows are read) | ||
offset: Option<usize>, | ||
|
||
/// The size in bytes of the predicate cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a link to RowGroupCache
pub(crate) fn try_build( | ||
&mut self, | ||
) -> Result<DecodeResult<ParquetRecordBatchReader>, ParquetError> { | ||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't called concurrently. try_transition
can change the state as decoding proceeds. If the decoder already has enough data to potentially move on to the next state it begins doing so immediately via this loop -- I will make this clearer. in comments
self.exclude_nested_columns_from_cache(&cache_projection) | ||
} | ||
|
||
/// Exclude leaves belonging to roots that span multiple parquet leaves (i.e. nested columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer is I don't know -- this just preserves the existing semantics
Maybe @XiangpengHao can explain in more detail
self.offset = offset; | ||
self | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will file a ticket to track
let Some(filter) = self.filter.take() else { | ||
// no filter, start trying to read data immediately | ||
return Ok(NextState::again(RowGroupDecoderState::StartData { | ||
row_group_info, | ||
column_chunks, | ||
cache_info: None, | ||
})); | ||
}; | ||
// no predicates in filter, so start reading immediately | ||
if filter.predicates.is_empty() { | ||
return Ok(NextState::again(RowGroupDecoderState::StartData { | ||
row_group_info, | ||
column_chunks, | ||
cache_info: None, | ||
})); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it happens for each row group (aka row_group_info
changes on each row group) I don't think it could be in RowGroupReaderBuilder::new()
We could possibly inline it in RowGroupReaderBuilder::next_row_group()
🤔
https://github.com/apache/arrow-rs/blob/10aae0fc6bc8020f77d584a5fa3cc0c5da605211/parquet/src/arrow/push_decoder/reader_builder/mod.rs#L244-L243
But I am not sure how much better that would be as it just moves the code around?
🤖 |
I agree it would be good to provide granular requests for data, but I think it is orthogonal to the 'what data to wait for until decoding can start' Right now, the readers (including the push decoder) will wait until all the data for a RowGroup (after filtering) is fetched Once the decoder can tell the caller how much data it really needs to decode something, then I think we'll be in a much better position to control CPU vs Memory |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
Which issue does this PR close?
Part of [Epic] Parquet Reader Improvement Plan / Proposal - July 2025 #8000
closes Decouple IO and CPU operations in the Parquet Reader (push decoder) #7983
Rationale for this change
This PR is the first part of separating IO and decode operations in the rust parquet decoder.
Decoupling IO and CPU enables several important usecases:
ParquetRecordBatchStreamBuilder
andParquetRecordBatchReaderBuilder
What changes are included in this PR?
ParquetDecoderBuilder
, andParquetDecoder
and testsIt is effectively an explicit version of the state machine that is used in existing async reader (where the state machine is encoded as Rust
async
/await
structures)Are these changes tested?
Yes -- there are extensive tests for the new code
Note that this PR actually adds a 3rd path for control flow (when I claim this will remove duplication!) In follow on PRs I will convert the existing readers to use this new pattern, similarly to the sequence I did for the metadata decoder:
Here is a preview of a PR that consolidates the async reader to use the push decoder internally (and removes one duplicate):
WIP: Rewrite
ParquetRecordBatchStream
in terms of the PushDecoder #8159closes [Parquet] Refactor InMemoryRowGroup to separate CPU and IO #8022
Are there any user-facing changes?
Yes, a new API, but now changes to the existing APIs