Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -86,10 +86,19 @@
*/
public class SchemaConverter {

// Indicates if Int96 should be converted to Arrow Timestamp
private final boolean convertInt96ToArrowTimestamp;

/**
* For when we'll need this to be configurable
*/
public SchemaConverter() {
this(false);
}

// TODO(PARQUET-1511): pass the parameters in a configuration object
public SchemaConverter(final boolean convertInt96ToArrowTimestamp) {
this.convertInt96ToArrowTimestamp = convertInt96ToArrowTimestamp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be possible that in the future we need to convert int96 to some other type in arrow (or for any other type convertion)? If that's something likely to happen, perhaps it would be better to pass the parameter in a configuration object (such as org.apache.hadoop.conf.Configuration)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of Parquet files INT96 was really only used for nanosecond timestamps. Thus we probably don't provide an alternative option to convert it to another type.

Please note that the use of the INT96 is discouraged in general. There is now a new TIMESTAMP_NANOS type to replace it and TIMESTAMP_MILLIS is the general default for timestamps nowadays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, INT96 is only used for timestamps, however, its semantics depend on what component wrote the file. Hive, Spark and Impala all contain a check footerFileMetaData.getCreatedBy().startsWith("parquet-mr") or similar, because if the file was written by Hive, Spark, or any other applications using the parquet-mr library then the timestamps are normalized to UTC, but if the file was written by Impala then it is in LocalDateTime semantics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to pass this later on as an option from the outside. The current routines don't know anything about the file creator. @yongyanw can you add this as a TODO comment and open a JIRA for it? I don't expect that we can solve this in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PARQUET-1511 was created and TODO comment was added.

}

/**
Expand Down Expand Up @@ -492,8 +501,11 @@ private String getTimeZone(LogicalTypeAnnotation.TimestampLogicalTypeAnnotation

@Override
public TypeMapping convertINT96(PrimitiveTypeName primitiveTypeName) throws RuntimeException {
// Possibly timestamp
return field(new ArrowType.Binary());
if (convertInt96ToArrowTimestamp) {
return field(new ArrowType.Timestamp(TimeUnit.NANOSECOND, null));
} else {
return field(new ArrowType.Binary());
}
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.FLOAT;
import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.INT32;
import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.INT64;
import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.INT96;

import java.io.IOException;
import java.util.List;
Expand Down Expand Up @@ -439,6 +440,27 @@ public void testParquetFixedBinaryToArrowDecimal() {
Assert.assertEquals(expected, converter.fromParquet(parquet).getArrowSchema());
}

@Test
public void testParquetInt96ToArrowBinary() {
MessageType parquet = Types.buildMessage()
.addField(Types.optional(INT96).named("a")).named("root");
Schema expected = new Schema(asList(
field("a", new ArrowType.Binary())
));
Assert.assertEquals(expected, converter.fromParquet(parquet).getArrowSchema());
}

@Test
public void testParquetInt96ToArrowTimestamp() {
final SchemaConverter converterInt96ToTimestamp = new SchemaConverter(true);
MessageType parquet = Types.buildMessage()
.addField(Types.optional(INT96).named("a")).named("root");
Schema expected = new Schema(asList(
field("a", new ArrowType.Timestamp(TimeUnit.NANOSECOND, null))
));
Assert.assertEquals(expected, converterInt96ToTimestamp.fromParquet(parquet).getArrowSchema());
}

@Test(expected = IllegalStateException.class)
public void testParquetInt64TimeMillisToArrow() {
converter.fromParquet(Types.buildMessage()
Expand Down