Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 15 additions & 4 deletions parquet/src/column/writer/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1104,12 +1104,23 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> {
rep_levels_byte_len + def_levels_byte_len + values_data.buf.len();

// Data Page v2 compresses values only.
match self.compressor {
let is_compressed = match self.compressor {
Some(ref mut cmpr) => {
let buffer_len = buffer.len();
cmpr.compress(&values_data.buf, &mut buffer)?;
if uncompressed_size <= buffer.len() - buffer_len {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be a "score", like if uncompressed_size <= (buffer.len() - buffer_len) * 0.9 {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest maybe going the opposite direction. If compression doesn't give say a 10% size reduction, then don't compress...it will be faster to read. 10% might be too much...maybe we could make it configurable.

Anyway, that's probably out of scope for this PR. Let's just go with the simple test for now...it's still an improvement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I do think that would need to configurable as I think some people would treat a 10% size increase as a regression, even if they got a faster reader

A good follow on ticket for sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll follow up once there’s a plan for this future ticket. Just a note from me —the score is still a highly empirical value (and depends on the decompression speed), so I’ve embedded it into my rewriter and have been extensively rewriting Parquet files based on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, given there is a tradeoff here (file size / decode speed) we can't hard code some heuristic that changes the default behavior.

However, I think it would be a good improvement to add some sort of tuning knob for people who wanted to make that tradeoff more explicitly as long as the default setting of the knob leaves the existing behavior

buffer.truncate(buffer_len);
buffer.extend_from_slice(&values_data.buf);
false
} else {
true
}
}
None => buffer.extend_from_slice(&values_data.buf),
}
None => {
buffer.extend_from_slice(&values_data.buf);
false
}
};

let data_page = Page::DataPageV2 {
buf: buffer.into(),
Expand All @@ -1119,7 +1130,7 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> {
num_rows: self.page_metrics.num_buffered_rows,
def_levels_byte_len: def_levels_byte_len as u32,
rep_levels_byte_len: rep_levels_byte_len as u32,
is_compressed: self.compressor.is_some(),
is_compressed,
statistics: page_statistics,
};

Expand Down
Loading