-
Notifications
You must be signed in to change notification settings - Fork 1k
Support writing GeospatialStatistics in Parquet writer #8524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Explicitly specify the Parquet schema to be used | ||
pub fn with_parquet_schema(self, schema_descr: SchemaDescriptor) -> Self { | ||
Self { | ||
schema_descr: Some(schema_descr), | ||
..self | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just me ensuring I can test the Arrow-specific byte array implementation (I couldn't figure out how to get a byte array column with a Geometry logical type otherwise)
|
||
fn flush_geospatial_statistics(&mut self) -> Option<Box<GeospatialStatistics>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ColumnValueEncoder
was not my first choice for where to put this; however, putting it at a higher level (e.g., the ColumnMetrics) is more disruptive because then the a reference to the bounder has to be passed through all the write/encode methods. I'm open to suggestions 🙂
impl GeoStatsAccumulatorFactory for DefaultGeoStatsAccumulatorFactory { | ||
fn new_accumulator(&self, _descr: &ColumnDescPtr) -> Box<dyn GeoStatsAccumulator> { | ||
#[cfg(feature = "geospatial")] | ||
if let Some(crate::basic::LogicalType::Geometry) = _descr.logical_type() { | ||
Box::new(ParquetGeoStatsAccumulator::default()) | ||
} else { | ||
Box::new(VoidGeospatialStatisticsAccumulator::default()) | ||
} | ||
|
||
#[cfg(not(feature = "geospatial"))] | ||
return Box::new(VoidGeospatialStatisticsAccumulator::default()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my strategy for avoiding #[cfg(...)]
all over the encoders/column writer.
bbox: Option<BoundingBox>, | ||
pub bbox: Option<BoundingBox>, | ||
/// Optional list of geometry type identifiers, where None represents lack of information | ||
geospatial_types: Option<Vec<i32>>, | ||
pub geospatial_types: Option<Vec<i32>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove these once the PR with the geospatial/thrift update merges (which exposes getters to inspect these).
let arrow_schema = Arc::new(Schema::new(vec![Field::new( | ||
"geom", | ||
DataType::Binary, | ||
true, | ||
)])); | ||
let batch = RecordBatch::try_new( | ||
arrow_schema.clone(), | ||
vec![wkb_array_xy([(1.0, 2.0), (11.0, 12.0)])], | ||
) | ||
.unwrap(); | ||
let expected_geometry_types = vec![1]; | ||
let expected_bounding_box = BoundingBox::new(1.0, 11.0, 2.0, 12.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of this PR (ensuring statistics are accumulated and written by the Arrow-specific byte array encoder)
let column_values = [wkb_item_xy(1.0, 2.0), wkb_item_xy(11.0, 12.0)].map(ByteArray::from); | ||
let expected_geometry_types = vec![1]; | ||
let expected_bounding_box = BoundingBox::new(1.0, 11.0, 2.0, 12.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and by the generic encoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @paleolimbot, looks pretty good on a first pass. I just want to make sure that the size statistics are written properly when geo stats are enabled.
if let Some(var_bytes) = T::T::variable_length_bytes(slice) { | ||
*self.variable_length_bytes.get_or_insert(0) += var_bytes; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should execute regardless of whether geo stats are enabled. The variable_length_bytes
are ultimately written to the SizeStatistics
which are useful even without min/max statistics.
drop(file_writer); | ||
|
||
// Check that statistics exist in thrift output | ||
thrift_metadata.row_groups[0].columns[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heads up that when the thrift stuff merges this will no longer be a format::FileMetaData
but file::metadata::ParquetMetaData
.
Thank you for the review! I will clean this up on Monday and add a few more tests. |
Which issue does this PR close?
Rationale for this change
One of the primary reasons the GeoParquet community was excited about first-class Parquet Geometry/Geography support was the built-in column chunk statistics (we had a workaround that involved adding a struct column, but it was difficult for non-spatial readers to use it and very difficult for non-spatial writers to write it). This PR ensures it is possible for arrow-rs to write files that include those statistics.
What changes are included in this PR?
This PR inserts the minimum required change to enable this support (behind a feature flag).
This also fixes a "bug" (if anybody had been using it to write Geospatial types, which is unlikely)...those types must not have page min/max statistics written.
Are these changes tested?
They will be! (Work in progress)
Are there any user-facing changes?
No, this is behind a newly invented feature flag. In the unlikely event anybody had been writing Geometry/Geography logical types, they would continue to not have geospatial statistics written.