Skip to content

[C++][ORC] Fix timestamp type mapping between orc and arrow #34590

@wgtmac

Description

@wgtmac

Describe the enhancement requested

Background: There was an effort to fix inconsistent timestamp types across different SQL-on-Hadoop engines: https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q

In the Apache Orc, two timestamp types are provided:

  • TIMESTAMP: timestamp type without timezone, timestamp value is stored in the writer timezone .
  • TIMESTAMP_INSTANT: timestamp type with local timezone, timestamp value is stored in the UTC timezone.

arrow::TimestampType has an optional timezone field: https://github.com/apache/arrow/blob/main/cpp/src/arrow/type.h#L1385

  • If timezone is provided, values are normalized in UTC.
  • If timezone is missing, values can be in any timezone.

Therefore, the type mapping should be as below:

  • orc::TIMESTAMP <=> arrow::TimestampType w/o timezone
  • orc::TIMESTAMP_INSTANT <=> arrow::TimestampType w/ timezone

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions