Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions arrow-schema/src/extension/canonical/geometry.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
use crate::extension::ExtensionType;
use crate::ArrowError;

/// Geospatial features in the WKB format with linear/planar edges interpolation
#[derive(Debug, Default, Clone, PartialEq, Eq, Hash)]
pub struct Geometry {
crs: Option<String>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the verbage in the spec I'm wondering if it would be better to have crs be required instead of optional.

The default CRS OGC:CRS84 means that the geospatial features must be stored
in the order of longitude/latitude based on the WGS84 datum.

Custom CRS can be specified by a string value. It is recommended to use an
identifier-based approach like srid.

My reading of this is that CRS is actually required for all Geometry and Geography types, and users may supply a custom string, but if not it will always be String::from("OGC:CRS84"). I suspect having this be required may also simplify some of the downstream code because there will be less Option handling that needs to happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the Parquet end, the CRS is never absent (or: if omitted, we know it's lon/lat). From the GeoArrow end, an omitted CRS means "the producer does not know", the verbage of which comes from GeoParquet and basically follows what a CRS of None is in GeoPandas. All to say that I think this part is correct (but you're right that it's annoying to handle for everybody).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so if I understand correctly this implies that it's always possible to convert Parquet geospatial types into valid GeoArrow (CRS is always known, either explicit or the spec defined default), but not all GeoArrow may be safely serialized as Parquet (an unset CRS is ambiguous).

If the above is true, it seems like this may be a trade off between ensuring correctness and broader ecosystem compatibility when serializing to parquet. This is one of those scenarios where there's probably not a "right" answer. I would personally err on the side of ensuring correctness, and error when the input Arrow data cannot be safely serialized to parquet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, an "unset" GeoArrow CRS should probably error on conversion to a Parquet type. I don't feel too strongly about this...Arrow C++ doesn't error here at the moment because it was pragmatic at the time (it allowed a suite of tests to pass without a JSON parser, which at the time required adding a C++ dependency that I wasn't sure would work).

}

impl Geometry {
/// Create a new Geometry extension type with an optional CRS.
pub fn new(crs: Option<String>) -> Self {
Self { crs }
}

/// Get the CRS of the Geometry type, if any.
pub fn crs(&self) -> Option<&str> {
self.crs.as_deref()
}
}

impl ExtensionType for Geometry {
const NAME: &'static str = "geoarrow.wkb";

type Metadata = ();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this type be String? In geoarrow-schema I have a separate Metadata struct defined, and store this type as Arc<Metadata>: https://docs.rs/geoarrow-schema/latest/geoarrow_schema/struct.WkbType.html#associatedtype.Metadata

In this case we might have different metadata for the Geometry/Geography types.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A string in theory allows for fully arbitrary metadata, which seems undesirable. I think the metadata would normalize between GEOMETRY and GEOGRAPHY if we enforced the default values for CRS and Edge Interpolation for GEOMETRY types:

GEOMETRY is used for geospatial features in the Well-Known Binary (WKB) format with linear/planar edges interpolation.

Copy link
Member Author

@kylebarron kylebarron Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important for edge interpolation, at least, to be stored under an Option, because that's the primary thing that separates a geometry from a geography (and that's how to discern whether input data with extension name geoarrow.wkb is a geometry or a geography).

FWIW it may be preferable to vendor the existing WkbType struct that's already in use in geoarrow-schema.

It's not clear that we should have separate Geometry/Geography arrow extension types. They're implemented here to parallel the Parquet implementation, but perhaps should be removed in favor of a single WkbType or similar.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that makes sense for the edge interpolation needing an Option. I didn't initially realize the spec defined edge interpolation algorithms didn't include a LINEAR variant.

My reading of the parquet spec suggests the only difference between the two types is linear vs non-linear edge interpolation. Am I missing something? If not, are you privy to why they are separate types? On the surface it feels odd to carry them as separate types.

Copy link
Member Author

@kylebarron kylebarron Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're separate types in the Parquet spec because some execution operations that are valid on Geometry types are not valid on Geography types and vice versa. And so it's useful to know in query planning whether the described operation is valid.

@paleolimbot probably has ideas on why this is the case too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Parquet spec chose to separate them because this is how those types are separated in most database systems. GeoArrow/GeoParquet predated the Parquet spec change, had fairly wide implementation support already, and thus couldn't make the breaking change to match it so here we all are 🙂 .

For what it's worth I usually just store enum EdgeType { Planar, Spherical, ... } and handle the fact that in JSON the "edge_type" key is supposed to be omitted on serialization/serialization.


fn metadata(&self) -> &Self::Metadata {
&()
}

fn serialize_metadata(&self) -> Option<String> {
None
}

fn deserialize_metadata(_metadata: Option<&str>) -> Result<Self::Metadata, ArrowError> {
Ok(())
}

fn supports_data_type(&self, data_type: &crate::DataType) -> Result<(), ArrowError> {
match data_type {
crate::DataType::Binary
| crate::DataType::LargeBinary
| crate::DataType::BinaryView => Ok(()),
data_type => Err(ArrowError::InvalidArgumentError(format!(
"Geometry data type mismatch, expected one of Binary, LargeBinary, BinaryView, found {data_type}"
))),
}
}

fn try_new(data_type: &crate::DataType, _metadata: Self::Metadata) -> Result<Self, ArrowError> {
// TODO: fix
let geo = Self { crs: None };
geo.supports_data_type(data_type)?;
Ok(geo)
}
}

/// Edge interpolation algorithm for Geography logical type
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum GeographyAlgorithm {
/// Edges are interpolated as geodesics on a sphere.
SPHERICAL,

/// <https://en.wikipedia.org/wiki/Vincenty%27s_formulae>
VINCENTY,

/// Thomas, Paul D. Spheroidal geodesics, reference systems, & local geometry. US Naval Oceanographic Office, 1970
THOMAS,

/// Thomas, Paul D. Mathematical models for navigation systems. US Naval Oceanographic Office, 1965.
ANDOYER,

/// Karney, Charles FF. "Algorithms for geodesics." Journal of Geodesy 87 (2013): 43-55
KARNEY,
}

/// Geospatial features in the [WKB format](https://libgeos.org/specifications/wkb/) with an
/// explicit (non-linear/non-planar) edges interpolation algorithm.
#[derive(Debug, Default, Clone, PartialEq, Eq, Hash)]
pub struct Geography {
crs: Option<String>,
algorithm: Option<GeographyAlgorithm>,
}

impl Geography {
/// Create a new Geography extension type with an optional CRS and algorithm.
pub fn new(crs: Option<String>, algorithm: Option<GeographyAlgorithm>) -> Self {
Self { crs, algorithm }
}

/// Get the CRS of the Geography type, if any.
pub fn crs(&self) -> Option<&str> {
self.crs.as_deref()
}

/// Get the edge interpolation algorithm of the Geography type, if any.
pub fn algorithm(&self) -> Option<&GeographyAlgorithm> {
self.algorithm.as_ref()
}
}

impl ExtensionType for Geography {
const NAME: &'static str = "geoarrow.wkb";

type Metadata = ();

fn metadata(&self) -> &Self::Metadata {
&()
}

fn serialize_metadata(&self) -> Option<String> {
None
}

fn deserialize_metadata(_metadata: Option<&str>) -> Result<Self::Metadata, ArrowError> {
Ok(())
}

fn supports_data_type(&self, data_type: &crate::DataType) -> Result<(), ArrowError> {
match data_type {
crate::DataType::Binary
| crate::DataType::LargeBinary
| crate::DataType::BinaryView => Ok(()),
data_type => Err(ArrowError::InvalidArgumentError(format!(
"Geography data type mismatch, expected one of Binary, LargeBinary, BinaryView, found {data_type}"
))),
}
}

fn try_new(data_type: &crate::DataType, _metadata: Self::Metadata) -> Result<Self, ArrowError> {
// TODO: fix
let geo = Self::default();
geo.supports_data_type(data_type)?;
Ok(geo)
}
}
2 changes: 2 additions & 0 deletions arrow-schema/src/extension/canonical/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ mod uuid;
pub use uuid::Uuid;
mod variable_shape_tensor;
pub use variable_shape_tensor::{VariableShapeTensor, VariableShapeTensorMetadata};
mod geometry;
pub use geometry::{Geography, GeographyAlgorithm, Geometry};

use crate::{ArrowError, Field};

Expand Down
58 changes: 57 additions & 1 deletion parquet/src/arrow/schema/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ use std::sync::Arc;

use arrow_ipc::writer;
#[cfg(feature = "arrow_canonical_extension_types")]
use arrow_schema::extension::{Json, Uuid};
use arrow_schema::extension::{Geography, Geometry, Json, Uuid};
use arrow_schema::{DataType, Field, Fields, Schema, TimeUnit};

use crate::basic::{
Expand Down Expand Up @@ -399,9 +399,32 @@ pub fn parquet_to_arrow_field(parquet_column: &ColumnDescriptor) -> Result<Field
}
#[cfg(feature = "arrow_canonical_extension_types")]
if let Some(logical_type) = basic_info.logical_type() {
use arrow_schema::extension::Geography;

match logical_type {
LogicalType::Uuid => ret.try_with_extension_type(Uuid)?,
LogicalType::Json => ret.try_with_extension_type(Json::default())?,
LogicalType::Geometry { crs } => ret.try_with_extension_type(Geometry::new(crs))?,
LogicalType::Geography { crs, algorithm } => {
use arrow_schema::extension::GeographyAlgorithm;

use crate::format::EdgeInterpolationAlgorithm;

let algorithm = match algorithm {
Some(EdgeInterpolationAlgorithm::ANDOYER) => Some(GeographyAlgorithm::ANDOYER),
Some(EdgeInterpolationAlgorithm::KARNEY) => Some(GeographyAlgorithm::KARNEY),
Some(EdgeInterpolationAlgorithm::SPHERICAL) => {
Some(GeographyAlgorithm::SPHERICAL)
}
Some(EdgeInterpolationAlgorithm::THOMAS) => Some(GeographyAlgorithm::THOMAS),
Some(EdgeInterpolationAlgorithm::VINCENTY) => {
Some(GeographyAlgorithm::VINCENTY)
}
None => None,
_ => None,
};
Comment on lines +408 to +425
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor comment is that it would be nice to encapsulate the geography types a bit more, for so, for example, this looks something more like the following

Suggested change
LogicalType::Geography { crs, algorithm } => {
use arrow_schema::extension::GeographyAlgorithm;
use crate::format::EdgeInterpolationAlgorithm;
let algorithm = match algorithm {
Some(EdgeInterpolationAlgorithm::ANDOYER) => Some(GeographyAlgorithm::ANDOYER),
Some(EdgeInterpolationAlgorithm::KARNEY) => Some(GeographyAlgorithm::KARNEY),
Some(EdgeInterpolationAlgorithm::SPHERICAL) => {
Some(GeographyAlgorithm::SPHERICAL)
}
Some(EdgeInterpolationAlgorithm::THOMAS) => Some(GeographyAlgorithm::THOMAS),
Some(EdgeInterpolationAlgorithm::VINCENTY) => {
Some(GeographyAlgorithm::VINCENTY)
}
None => None,
_ => None,
};
LogicalType::Geography { crs, algorithm } => {
let algorithm = algorithm.map(|algorithm| EdgeInterpolationAlgorithm::from);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment applies below

ret.try_with_extension_type(Geography::new(crs, algorithm))?
}
_ => {}
}
}
Expand Down Expand Up @@ -606,6 +629,39 @@ fn arrow_to_parquet_type(field: &Field, coerce_types: bool) -> Result<Type> {
Type::primitive_type_builder(name, PhysicalType::BYTE_ARRAY)
.with_repetition(repetition)
.with_id(id)
.with_logical_type(
#[cfg(feature = "arrow_canonical_extension_types")]
if let Ok(t) = field.try_extension_type::<Geometry>() {
Some(LogicalType::Geometry {
crs: t.crs().map(|s| s.to_string()),
})
} else if let Ok(t) = field.try_extension_type::<Geography>() {
Some(LogicalType::Geography {
crs: t.crs().map(|s| s.to_string()),
algorithm: t.algorithm().map(|alg| match alg {
arrow_schema::extension::GeographyAlgorithm::ANDOYER => {
crate::format::EdgeInterpolationAlgorithm::ANDOYER
}
arrow_schema::extension::GeographyAlgorithm::KARNEY => {
crate::format::EdgeInterpolationAlgorithm::KARNEY
}
arrow_schema::extension::GeographyAlgorithm::SPHERICAL => {
crate::format::EdgeInterpolationAlgorithm::SPHERICAL
}
arrow_schema::extension::GeographyAlgorithm::THOMAS => {
crate::format::EdgeInterpolationAlgorithm::THOMAS
}
arrow_schema::extension::GeographyAlgorithm::VINCENTY => {
crate::format::EdgeInterpolationAlgorithm::VINCENTY
}
}),
})
} else {
None
},
#[cfg(not(feature = "arrow_canonical_extension_types"))]
None,
)
.build()
}
DataType::FixedSizeBinary(length) => {
Expand Down
38 changes: 26 additions & 12 deletions parquet/src/basic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ use std::str::FromStr;
use std::{fmt, str};

pub use crate::compression::{BrotliLevel, GzipLevel, ZstdLevel};
use crate::format as parquet;
use crate::format::{self as parquet, EdgeInterpolationAlgorithm, GeographyType, GeometryType};

use crate::errors::{ParquetError, Result};

Expand Down Expand Up @@ -231,9 +231,18 @@ pub enum LogicalType {
/// A Variant value.
Variant,
/// A geospatial feature in the Well-Known Binary (WKB) format with linear/planar edges interpolation.
Geometry,
Geometry {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this would be a breaking API change I think so we either have to wait for 57 in October or make some backwards compatibility adjustments

Copy link
Member Author

@kylebarron kylebarron Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, finding this was a bummer. I figure that we'll have to wait till v57 to merge, because we need to be able to carry on the information from the GeometryType and GeographyType.

/// A custom CRS. If unset, it defaults to "OGC:CRS84", which means that the geometries
/// must be stored in longitude, latitude based on the WGS84 datum.
crs: Option<String>,
},
/// A geospatial feature in the WKB format with an explicit (non-linear/non-planar) edges interpolation.
Geography,
Geography {
/// A custom CRS. If unset, the CRS defaults to "OGC:CRS84".
crs: Option<String>,
/// Edge interpolation method.
algorithm: Option<EdgeInterpolationAlgorithm>,
},
}

// ----------------------------------------------------------------------
Expand Down Expand Up @@ -584,9 +593,9 @@ impl ColumnOrder {
LogicalType::Unknown => SortOrder::UNDEFINED,
LogicalType::Uuid => SortOrder::UNSIGNED,
LogicalType::Float16 => SortOrder::SIGNED,
LogicalType::Variant | LogicalType::Geometry | LogicalType::Geography => {
SortOrder::UNDEFINED
}
LogicalType::Variant
| LogicalType::Geometry { .. }
| LogicalType::Geography { .. } => SortOrder::UNDEFINED,
},
// Fall back to converted type
None => Self::get_converted_sort_order(converted_type, physical_type),
Expand Down Expand Up @@ -850,8 +859,11 @@ impl From<parquet::LogicalType> for LogicalType {
parquet::LogicalType::UUID(_) => LogicalType::Uuid,
parquet::LogicalType::FLOAT16(_) => LogicalType::Float16,
parquet::LogicalType::VARIANT(_) => LogicalType::Variant,
parquet::LogicalType::GEOMETRY(_) => LogicalType::Geometry,
parquet::LogicalType::GEOGRAPHY(_) => LogicalType::Geography,
parquet::LogicalType::GEOMETRY(t) => LogicalType::Geometry { crs: t.crs },
parquet::LogicalType::GEOGRAPHY(t) => LogicalType::Geography {
crs: t.crs,
algorithm: t.algorithm,
},
Comment on lines +862 to +866
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be fixed to do a small CRS transform here, since Parquet defines a string: projjson:<crs body>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Arrow C++ this was a bit tricky to deal with:

  • projjson:crs_key defines a CRS where the PROJJSON payload is stored in the Parquet schema metadata. This requires passing the schema metadata into this function (or something between this LogicalType and the Arrow Field output), which by the looks of it is a bit tricky to do here. In Arrow C++ we do handle this case (i.e., we'll return read a Parquet projjson:crs_key as GeoArrow whose CRS payload is the actual PROJJSON that was in the Parquet schema metadata.
  • srid:1234 can be converted to GeoArrow as {"crs": "1234", "crs_type": "srid"}.
  • <any other string> can be converted to GeoArrow as {"crs": "<any other string>"}. In Arrow C++ I also made the choice to ensure that if <any other string> happened to be valid JSON, that it shouldn't be double-escaped as a JSON string.

There are test files with all three of these in the apache/parquet-testing repo 🙂

https://github.com/apache/arrow/blob/f8b20f131a072ef423e81b8a676f42a82255f4ec/cpp/src/parquet/geospatial/util_json_internal.cc#L105-L125

}
}
}
Expand Down Expand Up @@ -894,8 +906,10 @@ impl From<LogicalType> for parquet::LogicalType {
LogicalType::Uuid => parquet::LogicalType::UUID(Default::default()),
LogicalType::Float16 => parquet::LogicalType::FLOAT16(Default::default()),
LogicalType::Variant => parquet::LogicalType::VARIANT(Default::default()),
LogicalType::Geometry => parquet::LogicalType::GEOMETRY(Default::default()),
LogicalType::Geography => parquet::LogicalType::GEOGRAPHY(Default::default()),
LogicalType::Geometry { crs } => parquet::LogicalType::GEOMETRY(GeometryType { crs }),
LogicalType::Geography { crs, algorithm } => {
parquet::LogicalType::GEOGRAPHY(GeographyType { crs, algorithm })
}
Comment on lines +909 to +912
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And likewise for converting back from GeoArrow CRS to parquet logical type CRS

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Arrow C++ I tried to detect crs values that were definitely lon/lat and ensure the Parquet logical type was "unset" to maximize compatibility with readers that didn't want to or couldn't handle CRS values.

  • "EPSG:4326"
  • "OGC:CRS84"
  • {..., "id": {"authority": "EPSG", "code": 4326}
  • {..., "id": {"authority": "EPSG", "code": "4326"}
  • {..., "id": {"authority": "OGC", "code": "CRS84"}

https://github.com/apache/arrow/blob/f8b20f131a072ef423e81b8a676f42a82255f4ec/cpp/src/parquet/geospatial/util_json_internal.cc#L47-L81

}
}
}
Expand Down Expand Up @@ -948,8 +962,8 @@ impl From<Option<LogicalType>> for ConvertedType {
LogicalType::Uuid
| LogicalType::Float16
| LogicalType::Variant
| LogicalType::Geometry
| LogicalType::Geography
| LogicalType::Geometry { .. }
| LogicalType::Geography { .. }
| LogicalType::Unknown => ConvertedType::NONE,
},
None => ConvertedType::NONE,
Expand Down
4 changes: 2 additions & 2 deletions parquet/src/schema/printer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -327,8 +327,8 @@ fn print_logical_and_converted(
LogicalType::Map => "MAP".to_string(),
LogicalType::Float16 => "FLOAT16".to_string(),
LogicalType::Variant => "VARIANT".to_string(),
LogicalType::Geometry => "GEOMETRY".to_string(),
LogicalType::Geography => "GEOGRAPHY".to_string(),
LogicalType::Geometry { .. } => "GEOMETRY".to_string(),
LogicalType::Geography { .. } => "GEOGRAPHY".to_string(),
LogicalType::Unknown => "UNKNOWN".to_string(),
},
None => {
Expand Down
Loading