Skip to content

Conversation

@yongtang
Copy link
Member

This fix is the continuation of PR tensorflow/tensorflow#19461, which adds Apache Parquet format support in tensorflow-io.

The PR tensorflow/tensorflow#19461 essentially was cherry-picked with some additional fix ups to correct import path.

This fix fixes #12

@yongtang yongtang force-pushed the parquet branch 3 times, most recently from d23eda7 to 4d02894 Compare December 13, 2018 02:12
@yongtang
Copy link
Member Author

The last commit in this PR is not necessary, once the other PR #22 is merged in.

Apache Parquet is a widely used columnar storage format available in
the Hadoop ecosystem. Its popularity spans across big data community
with usage in many production environments. For example, most of our
big data projects use Parquet format for storage on S3 processed by
Spark/EMR. We started transition some of projects to TensorFlow.
However, as there was no Parquet support in TensorFlow, we have to
transform our existing Parquet data to formats accepted by TensorFlow first.

This PR is a preliminary attempt to add Apache Parquet support for
TensorFlow Dataset. It may not cover all the use cases though it could
be served as a starting point for further improvement in the future.

The ParquetDataset depends on parquet-cpp (Apache) project as well
as other dependencies. The ParquetDataset only builds on Linux at the
moment. This PR also adds the option in `./configure` so
that those dependencies could be skipped.

Signed-off-by: Yong Tang <[email protected]>
@yongtang
Copy link
Member Author

The PR is updated to capture PR #22, think the PR is complete.

@yongtang
Copy link
Member Author

/cc @mrry, also /cc @BryanCutler as I think some of the external libraries (e.g., arrow.BUILD and boost.BUILD) may help the Apache Arrow support.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I just noticed a couple minor things while skimming the PR

deps = [
"@local_config_tf//:libtensorflow_framework",
"@local_config_tf//:tf_header_lib",
"@parquet//:parquet",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: alignment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler, the alignment has been updated.

@@ -0,0 +1,7719 @@
diff -ru -p1 --new-file src/parquet/parquet_types.cpp src/parquet/parquet_types.cpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this file intentional to add?

Copy link
Member

@BryanCutler BryanCutler Dec 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I see it is applied as part of the parquet build. Mind if I ask what needed to be patched?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler. The patch file actually just added two files:

src/parquet/parquet_types.cpp
src/parquet/parquet_types.h

Those two files are thrift type files. The files are generated automatically. This is also a one time effort for each version and will not change when regenerate. However, it need both flex and bison to generate. I tried to build flex and bison through bazel but realized that they are more complicated than thrift itself. So I think maybe we could just build those two thrift files offline (and not to build flex and bison to generate those two files in compile time).

Also, I could not find an option in Bazel to support "override" or "additional extra" file on top of the http_archive to add those two files in the build. So I use a patch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, sounds good. Thanks for the explanation!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to document how to build this patch, either as a comment block or a separate "build instructions" file? It might be useful if we want to refresh this patch down the road.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we want to check in the thrift type files used to generate this patch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler @mhong The PR has been updated with detailed steps to generate the parquet.patch. Now it is possible to reproduce the parquet.patch by running the following command in third_party directory:

docker run -i -t --rm -v $PWD:/v -w /v ubuntu:16.04 bash -x /v/parquet.type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhong The parquet.thrift file (used to generate the parquet_types.[cpp|h]) is part of the parquet source code src/parquet/parquet.thrift in apache-parquet-cpp-1.4.0.tar.gz.

Signed-off-by: Yong Tang <[email protected]>
@yongtang
Copy link
Member Author

/cc @mhong for Apache Parquet support.

@BryanCutler
Copy link
Member

@yongtang does this need to be linked in the root BUILD file here to be included in the pip package?

as otherwise it will not be included in the dist.

Signed-off-by: Yong Tang <[email protected]>
@yongtang
Copy link
Member Author

@BryanCutler Thanks! Yes adding to root BUILD file is needed. The PR has been updated.

Copy link

@mhong mhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you for the great work! Left some minor comments.

@@ -0,0 +1,7719 @@
diff -ru -p1 --new-file src/parquet/parquet_types.cpp src/parquet/parquet_types.cpp
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to document how to build this patch, either as a comment block or a separate "build instructions" file? It might be useful if we want to refresh this patch down the road.

@@ -0,0 +1,7719 @@
diff -ru -p1 --new-file src/parquet/parquet_types.cpp src/parquet/parquet_types.cpp
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we want to check in the thrift type files used to generate this patch?

// Loop until we find a row to read or there are no more files left to
// read
while (true) {
if (parquet_reader_) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: consider inverting the if condition (if (!parquet_reader_)) to reduce indentation?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to further reduce indentation, should we refactor some code into a helper function, say a helper to read all row groups from the current file (the body of the while loop below)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhong The if (parquet_reader_) may not return at the end and has a fullback out of the if scope. So I think it may not be easy to change the order.

mutex_lock l(mu_);

// Loop until we find a row to read or there are no more files left to
// read
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider extending this comment a bit more to describe the conceptual hierarchy. If I understand correctly, the hierarchical levels from coasest to finest are: current_file_index_ -> current_row_group_ -> current_row_?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhong Done. The comment has been updated.

}

template <typename DType>
Status FillTensorValue(parquet::ColumnReader* column_reader,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this and some functions below be private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhong Done. Thanks for the suggestion.

.Attr("output_types: list(type) >= 1")
.Attr("output_shapes: list(shape) >= 1")
.SetIsStateful()
.SetShapeFn(shape_inference::ScalarShape); // TODO (yongtang): check that
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i saw some shape checking code above. Has this TODO been addressed?

v2 = i * 1000 * 1000 * 1000 * 1000
v4 = 1.1 * i
v5 = 1.1111111 * i
self.assertAllClose((v0, v1, v2, v4, v5), sess.run(get_next))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice!

so that those files could be generated.

Signed-off-by: Yong Tang <[email protected]>
@yongtang
Copy link
Member Author

@BryanCutler @mhong Thanks for the review. The PR has been updated.

@mhong
Copy link

mhong commented Dec 22, 2018

LGTM. Thank you!

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

#
# We use the following step to generate the parquet_types.h and parquet_types.cpp files:
# - In third_party directory, run `docker run -i -t --rm -v $PWD:/v -w /v ubuntu:16.04 bash -x /v/parquet.type`
# - Once complete, a parquet.patch file will be generated which could be used as a patch in bazel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this note!

@yongtang yongtang merged commit 24eb10d into tensorflow:master Dec 22, 2018
@yongtang yongtang deleted the parquet branch December 22, 2018 05:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support parquet format for tensorflow-io

4 participants