Skip to content

Conversation

@wgtmac
Copy link
Member

@wgtmac wgtmac commented May 11, 2024

What changes were proposed in this pull request?

Add geometry and geography types to Apache ORC.

Why are the changes needed?

Geospatial support is a missing feature and it is supported by many popular databases, query engines, computing frameworks, etc.

How was this patch tested?

N/A

@wgtmac wgtmac marked this pull request as ready for review May 12, 2024 08:15
@wgtmac wgtmac changed the title WIP: Add geometry type ORC-1717: Add geometry type May 12, 2024
@dongjoon-hyun
Copy link
Member

Thank you, @wgtmac . Is it for Apache ORC 1.1.0? If this is final, I'm +1.

@dongjoon-hyun
Copy link
Member

Also, cc @williamhyun

@wgtmac
Copy link
Member Author

wgtmac commented Jul 19, 2024

@dongjoon-hyun This is not finalized yet. I will update it once ready.

@dongjoon-hyun
Copy link
Member

Thank you! If it's finalized, please send a head-up email to dev@orc once more.

@dongjoon-hyun
Copy link
Member

Thank you for updating. Is this ready, @wgtmac ?

@wgtmac
Copy link
Member Author

wgtmac commented Aug 22, 2024

Yes, I believe so. And I will start PoC implementations later.

@dongjoon-hyun
Copy link
Member

Got it. When do you want to release orc-format v1.1?

  1. Now to help the PoC?
  2. After finishing PoC?

@wgtmac
Copy link
Member Author

wgtmac commented Aug 22, 2024

I think we can wait until both Java and C++ PoC implementations are finished.

@dongjoon-hyun
Copy link
Member

+1 for the decision. Thank you for the confirmation.

@wgtmac wgtmac changed the title ORC-1717: Add geometry type ORC-1717: Add geometry and geography types Apr 3, 2025
@dongjoon-hyun
Copy link
Member

Thank you. Is this ready, @wgtmac ?

@wgtmac
Copy link
Member Author

wgtmac commented Apr 7, 2025

Yes, this is ready for review @dongjoon-hyun

cc @ffacs

optional TimestampStatistics timestampStatistics = 9;
optional bool hasNull = 10;
optional uint64 bytes_on_disk = 11;
optional CollectionStatistics collection_statistics = 12;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we handle these two above lines independently because this is irrelevant to the geometry, @wgtmac ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #22 to address this.

// Statistics specific to Geometry or Geography type
message GeospatialStatistics {
// A bounding box of geospatial instances
optional BoundingBox bbox = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is bbox a well-known name? I'm just wondering if we can use bounding_box like the other field names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, bbox is a well-known geospatial acronym.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @wgtmac .

I have three comments.

  1. bytes_on_disk and collection_statistics (https://github.com/apache/orc-format/pull/18/files#r2032264096)
  2. bbox (https://github.com/apache/orc-format/pull/18/files#r2032265481)
  3. This PR is proposed to support Apache Iceberg. However, are you sure that Apache Iceberg community keeps Apache ORC support in the next table format?

If we are not sure about (3), let's revise the PR description by avoiding mentioning Apache Iceberg community.m
I believe this PR itself has enough meaning independently to Apache ORC community. I'm +1 for this additional feature.

cc @williamhyun too

@dongjoon-hyun
Copy link
Member

Gentle ping, @wgtmac .

@wgtmac
Copy link
Member Author

wgtmac commented Apr 9, 2025

Are you sure that Apache Iceberg community keeps Apache ORC support in the next table format?

I'm not sure about the long-term goal. The recent discussion w.r.t the file format API still considers Apache ORC: https://lists.apache.org/thread/ovyh52m2b6c1hrg4fhw3rx92bzr793n2

@dongjoon-hyun
Copy link
Member

Thank you, @wgtmac .

I'm referring the latest Apache Iceberg community sync-up meeting on March 16th which superseded the above discussion on February 12th.

https://youtu.be/9BBZKTfcU0s?t=44m10s

@wgtmac
Copy link
Member Author

wgtmac commented Apr 9, 2025

Thanks for the info! I just watched that video and understand the concern from the Iceberg community. There are several issues like missing features (variant, geometry, etc.) and implementation (default values, schema evolution, etc.) For the latter issue, does anyone in the Apache ORC community know the detail?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you for updating, @wgtmac .

@dongjoon-hyun dongjoon-hyun merged commit e807a18 into apache:main Apr 9, 2025
2 checks passed
@dongjoon-hyun dongjoon-hyun added this to the 1.1.0 milestone Apr 9, 2025
@dongjoon-hyun
Copy link
Member

Since this is an improvement without a breaking change, shall we release as Apache ORC v1.1.0? WDYT, @wgtmac ?

@wgtmac
Copy link
Member Author

wgtmac commented Apr 10, 2025

@dongjoon-hyun Sounds good! I think this can help the implementation a lot.

dongjoon-hyun added a commit to apache/orc that referenced this pull request Apr 18, 2025
### What changes were proposed in this pull request?

This PR aims to upgrade ORC Format to 1.1.0.

### Why are the changes needed?

To bring the latest feature and bug fixes.
- https://github.com/apache/orc-format/milestone/2?closed=1
  - apache/orc-format#18
  - apache/orc-format#19

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #2190 from dongjoon-hyun/ORC-1876.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit to apache/orc that referenced this pull request Apr 18, 2025
### What changes were proposed in this pull request?

This PR aims to upgrade ORC Format to 1.1.0.

### Why are the changes needed?

To bring the latest feature and bug fixes.
- https://github.com/apache/orc-format/milestone/2?closed=1
  - apache/orc-format#18
  - apache/orc-format#19

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #2190 from dongjoon-hyun/ORC-1876.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit f5e2413)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit to apache/spark that referenced this pull request Apr 18, 2025
### What changes were proposed in this pull request?

This PR aims to upgrade Apache ORC Format to 1.1.0 from 1.1.0 for Apache Spark 4.1.0.

### Why are the changes needed?

Apache ORC Format v1.1.0 is released on 2025-04-18.
- https://github.com/apache/orc-format/releases/tag/v1.1.0

To bring the latest feature and bug fixes.
- https://github.com/apache/orc-format/milestone/2?closed=1
  - apache/orc-format#18
  - apache/orc-format#19

### Does this PR introduce _any_ user-facing change?

No, there is no behavior change.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50588 from dongjoon-hyun/SPARK-51801.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants