Skip to content

Conversation

@Kontinuation
Copy link
Member

@Kontinuation Kontinuation commented Sep 6, 2024

Rationale for this change

This is a continuation of #43196

In apache/parquet-format#240 a GEOMETRY logical type for Parquet is proposed with a proof-of-concept Java implementation ( apache/parquet-java#1379 ). This is a PR to explore what an implementation would look like in C++.

What changes are included in this PR?

We are still in progress of completing all necessary changes to integrate geometry logical type support to the C++ implementation.

  • [DONE] Adding geometry logical type
  • [DONE] Adding geometry column statistics
  • [DONE] Support reading/writing parquet files containing geometry columns

Are these changes tested?

The tests added only cover very basic use cases. Comprehensive tests will be added in future commits.

Are there any user-facing changes?

Yes! (And will eventually be documented)

@github-actions github-actions bot added the awaiting change review Awaiting change review label Sep 16, 2024
@Kontinuation
Copy link
Member Author

Please ping me when ready for review again. Thanks!

@wgtmac I have added ColumnIndex and covering support. Please help review this PR when you have time, thank you.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just finished reviewing it for the 2nd pass. Thanks for the great work!

My main concern is the difference with Java PoC, which generates min/max values in the statistics and page index as if the GEOMETRY column is a pure BYTE_ARRAY column. Otherwise we need to revise the spec to add a lot of exceptions for geometry type. WDYT?

Geometry(std::string crs, LogicalType::GeometryEdges::edges edges,
LogicalType::GeometryEncoding::geometry_encoding encoding,
std::string metadata)
: LogicalType::Impl(LogicalType::Type::GEOMETRY, SortOrder::UNKNOWN),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: LogicalType::Impl(LogicalType::Type::GEOMETRY, SortOrder::UNKNOWN),
: LogicalType::Impl(LogicalType::Type::GEOMETRY, SortOrder::UNSIGNED),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SortOrder::UNSIGNED is the default sort order of BYTE_ARRAY type. Could we just use this so you don't have to change a line in column_writer.cc. The good thing is that ColumnIndex of geometry type can also be generated automatically, though the min/max values are derived from their binary values and useless. This is the same practice used in the Java PoC impl.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is life-saving. Now all the special handling of geometry statistics for unknown sort order has gone away.

out.mmax = maxes[3];

if (coverings_.empty()) {
// Generate coverings from bounding box if coverings is not present
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will coverings_ be empty? Is it the default behavior? I'm not sure if we need to check if the edges is planar since bbox is not accurate for spherical edges. BTW, if we don't have a good implementation for coverings, I think we can just ignore it for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for generating coverings from the bounding box when assembling the encoded representation of the geometry statistics. I've added a member called generate_covering_ to make it more explicit.


class GeometryStatisticsImpl;

class PARQUET_EXPORT GeometryStatistics {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding theses!

return std::static_pointer_cast<TypedStatistics<DType>>(Statistics::Make(
descr, encoded_min, encoded_max, num_values, null_count, distinct_count,
has_min_max, has_null_count, has_distinct_count, pool));
int64_t distinct_count, const EncodedGeometryStatistics& geometry_statistics,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not directly add const EncodedGeometryStatistics* geometry_statistics = NULLPTR to the end of the existing function signature?

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 19, 2024
@Kontinuation
Copy link
Member Author

I just finished reviewing it for the 2nd pass. Thanks for the great work!

My main concern is the difference with Java PoC, which generates min/max values in the statistics and page index as if the GEOMETRY column is a pure BYTE_ARRAY column. Otherwise we need to revise the spec to add a lot of exceptions for geometry type. WDYT?

I've changed the min/max statistics of geometry columns to be the WKB representation of lower-left and upper-right corners in the last commit according to apache/iceberg#10981 (comment).

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick change! I've left some minor comments.

Comment on lines 247 to 251
EncodedGeometryStatistics encoded_geometry_stats;
if (stats.__isset.geometry_stats) {
encoded_geometry_stats = FromThrift(stats.geometry_stats);
}
page_statistics.set_geometry(encoded_geometry_stats);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EncodedGeometryStatistics encoded_geometry_stats;
if (stats.__isset.geometry_stats) {
encoded_geometry_stats = FromThrift(stats.geometry_stats);
}
page_statistics.set_geometry(encoded_geometry_stats);
page_statistics.set_geometry(FromThrift(stats.geometry_stats));

return std::static_pointer_cast<TypedStatistics<DType>>(Statistics::Make(
descr, encoded_min, encoded_max, num_values, null_count, distinct_count,
has_min_max, has_null_count, has_distinct_count, pool));
int64_t distinct_count, const EncodedGeometryStatistics& geometry_statistics,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we'd better not to add another overload. If there is a compelling reason to do so, we can add a ARROW_DEPRECATED macro to the old one.

@Kontinuation Kontinuation force-pushed the kontinuation-parquet-geometry branch from 8a50947 to da55a55 Compare October 30, 2024 15:01
}

uint32_t value;
memcpy(&value, data_, sizeof(uint32_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WKBGenericSequenceBounder(const WKBGenericSequenceBounder&) = default;

void ReadPoint(WKBBuffer* src, Dimensions::dimensions dimensions, bool swap) {
if (ARROW_PREDICT_TRUE(!swap)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there benchmarks that show the swap template variable adds a lot of perfomance. I'd expect branch prediction to do a pretty good job here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading the comment above, I see that might be an open question.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the utility file down to about 400 lines in #45459 (and removed all the templating!)

ptr[0] = kWkbNativeEndianness;
uint32_t geom_type = geometry::GeometryType::ToWKB(
geometry::GeometryType::geometry_type::POINT, has_z, has_m);
memcpy(&ptr[1], &geom_type, 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SafeLoadAs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #45459 !


namespace parquet::geometry {

constexpr double kInf = std::numeric_limits<double>::infinity();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment here, for new code, it might be nicer to use Arrow Status, so that this can potentially be moved to the main arrow library for things not directly related to parquet (at least WKB generating/parsing parts)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #45459 !

paleolimbot added a commit that referenced this pull request Apr 30, 2025
…implementations (#45459)

### Rationale for this change

The GEOMETRY and GEOGRAPHY logical types are being proposed as an addition to the Parquet format.

### What changes are included in this PR?

This is a continuation of @ Kontinuation 's initial PR (#43977) implementing apache/parquet-format#240 , which included:

- Added geometry logical types (printing, serialization, deserialization)
- Added geometry column statistics (serialization, deserialization, writing)
- Support reading/writing parquet files containing geometry columns

Changes after this were:

- Rebasing on the latest apache/arrow
- Split geography/geometry types
- Synchronize the final parameter names (e.g., no more "encoding", "edges" -> "algorithm")
- Simplify geometry_util_internal.h and use Status instead of exceptions according to suggestions from the previous PR

In order to write test files, I also:

- Implemented conversion to/from the GeoArrow extension type
- Wired the requisite options to pyarrow so that the files could be written from Python

Those last two are probably a bit much for this particular PR, and I'm happy to move them.

Some things that aren't in this PR (but should be in this one or a future PR):

- Update the bounding box logic to implement the "wraparound" bounding boxes where `max > min` (and generally make sure the stats for geography are written for trivial cases)
- Test more invalid WKB cases

### Are these changes tested?

Yes!

### Are there any user-facing changes?

Yes!

Example from the included Python bindings:

```python
import geopandas
import geopandas.testing
import geoarrow.pyarrow as _ # for extension type registration
import pyarrow as pa
from pyarrow import parquet

# More example files at
# https://github.com/geoarrow/geoarrow-data
gdf = geopandas.read_file(
    "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0-rc6/example-crs/files/example-crs_vermont-utm.fgb"
)
gdf.total_bounds
#> array([ 625858.19400305, 4733644.25036889,  775539.58040423,
#>        4989817.92403143])
gdf.crs.to_authority()
#> ('EPSG', '32618')

tab = pa.table(gdf.to_arrow())

# Use store_schema=False to explicitly check conversion to Parquet LogicalType
# This example also works with store_schema=True (the default) and without
# an explicit arrow_extensions_enabled=True on read.
parquet.write_table(tab, "vermont.parquet", store_schema=False)

f = parquet.ParquetFile("vermont.parquet", arrow_extensions_enabled=True)
f.schema
#> <pyarrow._parquet.ParquetSchema object at 0x1402e5940>
#> required group field_id=-1 schema {
#>   optional binary field_id=-1 geometry (Geometry(crs={"type":"ProjectedCRS", ...}));
#> }

f.metadata.row_group(0).column(0).geo_statistics
#> <pyarrow._parquet.GeoStatistics object at 0x127df3eb0>
#>   geospatial_types: [3]
#>   xmin: 625858.1940030524, xmax: 775539.5804042327
#>   ymin: 4733644.250368893, ymax: 4989817.92403143
#>   zmin: None, zmax: None
#>   mmin: None, mmax: None

gdf2 = geopandas.GeoDataFrame.from_arrow(f.read())
gdf2.crs.to_authority()
#> ('EPSG', '32618')

geopandas.testing.assert_geodataframe_equal(gdf2, gdf)
```

* GitHub Issue: #45522

Lead-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Kristin Cowalcijk <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Dewey Dunnington <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants