Skip to content

Conversation

paleolimbot
Copy link
Member

Which issue does this PR close?

Rationale for this change

One of the primary reasons the GeoParquet community was excited about first-class Parquet Geometry/Geography support was the built-in column chunk statistics (we had a workaround that involved adding a struct column, but it was difficult for non-spatial readers to use it and very difficult for non-spatial writers to write it). This PR ensures it is possible for arrow-rs to write files that include those statistics.

What changes are included in this PR?

This PR inserts the minimum required change to enable this support (behind a feature flag).

This also fixes a "bug" (if anybody had been using it to write Geospatial types, which is unlikely)...those types must not have page min/max statistics written.

Are these changes tested?

They will be! (Work in progress)

Are there any user-facing changes?

No, this is behind a newly invented feature flag. In the unlikely event anybody had been writing Geometry/Geography logical types, they would continue to not have geospatial statistics written.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 1, 2025
Copy link
Member Author

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works!

@alamb @etseidl I'm aware this would need some tests/improved documentation at a lower level; however, I'd love some feedback on the approach before I go through and clean this up more thoroughly (whenever time allows!)

Comment on lines +503 to +509
/// Explicitly specify the Parquet schema to be used
pub fn with_parquet_schema(self, schema_descr: SchemaDescriptor) -> Self {
Self {
schema_descr: Some(schema_descr),
..self
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just me ensuring I can test the Arrow-specific byte array implementation (I couldn't figure out how to get a byte array column with a Geometry logical type otherwise)

Comment on lines +128 to +129

fn flush_geospatial_statistics(&mut self) -> Option<Box<GeospatialStatistics>>;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ColumnValueEncoder was not my first choice for where to put this; however, putting it at a higher level (e.g., the ColumnMetrics) is more disruptive because then the a reference to the bounder has to be passed through all the write/encode methods. I'm open to suggestions 🙂

Comment on lines +44 to +56
impl GeoStatsAccumulatorFactory for DefaultGeoStatsAccumulatorFactory {
fn new_accumulator(&self, _descr: &ColumnDescPtr) -> Box<dyn GeoStatsAccumulator> {
#[cfg(feature = "geospatial")]
if let Some(crate::basic::LogicalType::Geometry) = _descr.logical_type() {
Box::new(ParquetGeoStatsAccumulator::default())
} else {
Box::new(VoidGeospatialStatisticsAccumulator::default())
}

#[cfg(not(feature = "geospatial"))]
return Box::new(VoidGeospatialStatisticsAccumulator::default());
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my strategy for avoiding #[cfg(...)] all over the encoders/column writer.

Comment on lines -48 to +50
bbox: Option<BoundingBox>,
pub bbox: Option<BoundingBox>,
/// Optional list of geometry type identifiers, where None represents lack of information
geospatial_types: Option<Vec<i32>>,
pub geospatial_types: Option<Vec<i32>>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove these once the PR with the geospatial/thrift update merges (which exposes getters to inspect these).

Comment on lines +51 to +62
let arrow_schema = Arc::new(Schema::new(vec![Field::new(
"geom",
DataType::Binary,
true,
)]));
let batch = RecordBatch::try_new(
arrow_schema.clone(),
vec![wkb_array_xy([(1.0, 2.0), (11.0, 12.0)])],
)
.unwrap();
let expected_geometry_types = vec![1];
let expected_bounding_box = BoundingBox::new(1.0, 11.0, 2.0, 12.0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of this PR (ensuring statistics are accumulated and written by the Arrow-specific byte array encoder)

Comment on lines +115 to +117
let column_values = [wkb_item_xy(1.0, 2.0), wkb_item_xy(11.0, 12.0)].map(ByteArray::from);
let expected_geometry_types = vec![1];
let expected_bounding_box = BoundingBox::new(1.0, 11.0, 2.0, 12.0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and by the generic encoder.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @paleolimbot, looks pretty good on a first pass. I just want to make sure that the size statistics are written properly when geo stats are enabled.

Comment on lines +166 to +168
if let Some(var_bytes) = T::T::variable_length_bytes(slice) {
*self.variable_length_bytes.get_or_insert(0) += var_bytes;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should execute regardless of whether geo stats are enabled. The variable_length_bytes are ultimately written to the SizeStatistics which are useful even without min/max statistics.

drop(file_writer);

// Check that statistics exist in thrift output
thrift_metadata.row_groups[0].columns[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up that when the thrift stuff merges this will no longer be a format::FileMetaData but file::metadata::ParquetMetaData.

@paleolimbot
Copy link
Member Author

Thank you for the review! I will clean this up on Monday and add a few more tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support writing GeospatialStatistics in Parquet writer
2 participants