Support Apache Parquet format #21

yongtang · 2018-12-13T00:02:05Z

This fix is the continuation of PR tensorflow/tensorflow#19461, which adds Apache Parquet format support in tensorflow-io.

The PR tensorflow/tensorflow#19461 essentially was cherry-picked with some additional fix ups to correct import path.

This fix fixes #12

yongtang · 2018-12-13T02:50:03Z

The last commit in this PR is not necessary, once the other PR #22 is merged in.

Apache Parquet is a widely used columnar storage format available in the Hadoop ecosystem. Its popularity spans across big data community with usage in many production environments. For example, most of our big data projects use Parquet format for storage on S3 processed by Spark/EMR. We started transition some of projects to TensorFlow. However, as there was no Parquet support in TensorFlow, we have to transform our existing Parquet data to formats accepted by TensorFlow first. This PR is a preliminary attempt to add Apache Parquet support for TensorFlow Dataset. It may not cover all the use cases though it could be served as a starting point for further improvement in the future. The ParquetDataset depends on parquet-cpp (Apache) project as well as other dependencies. The ParquetDataset only builds on Linux at the moment. This PR also adds the option in `./configure` so that those dependencies could be skipped. Signed-off-by: Yong Tang <[email protected]>

Signed-off-by: Yong Tang <[email protected]>

yongtang · 2018-12-13T15:52:52Z

The PR is updated to capture PR #22, think the PR is complete.

Signed-off-by: Yong Tang <[email protected]>

yongtang · 2018-12-13T17:28:21Z

/cc @mrry, also /cc @BryanCutler as I think some of the external libraries (e.g., arrow.BUILD and boost.BUILD) may help the Apache Arrow support.

BryanCutler

LGTM, I just noticed a couple minor things while skimming the PR

BryanCutler · 2018-12-13T20:13:56Z

tensorflow_io/parquet/BUILD

+    deps = [
+        "@local_config_tf//:libtensorflow_framework",
+        "@local_config_tf//:tf_header_lib",
+	"@parquet//:parquet",


nit: alignment

Thanks @BryanCutler, the alignment has been updated.

BryanCutler · 2018-12-13T21:57:22Z

third_party/parquet.patch

@@ -0,0 +1,7719 @@
+diff -ru -p1 --new-file src/parquet/parquet_types.cpp src/parquet/parquet_types.cpp


was this file intentional to add?

nvm, I see it is applied as part of the parquet build. Mind if I ask what needed to be patched?

Thanks @BryanCutler. The patch file actually just added two files:

src/parquet/parquet_types.cpp src/parquet/parquet_types.h

Those two files are thrift type files. The files are generated automatically. This is also a one time effort for each version and will not change when regenerate. However, it need both flex and bison to generate. I tried to build flex and bison through bazel but realized that they are more complicated than thrift itself. So I think maybe we could just build those two thrift files offline (and not to build flex and bison to generate those two files in compile time).

Also, I could not find an option in Bazel to support "override" or "additional extra" file on top of the http_archive to add those two files in the build. So I use a patch.

Ah ok, sounds good. Thanks for the explanation!

Would it be useful to document how to build this patch, either as a comment block or a separate "build instructions" file? It might be useful if we want to refresh this patch down the road.

Also, do we want to check in the thrift type files used to generate this patch?

@BryanCutler @mhong The PR has been updated with detailed steps to generate the parquet.patch. Now it is possible to reproduce the parquet.patch by running the following command in third_party directory:

docker run -i -t --rm -v $PWD:/v -w /v ubuntu:16.04 bash -x /v/parquet.type

@mhong The parquet.thrift file (used to generate the parquet_types.[cpp|h]) is part of the parquet source code src/parquet/parquet.thrift in apache-parquet-cpp-1.4.0.tar.gz.

Signed-off-by: Yong Tang <[email protected]>

yongtang · 2018-12-13T23:35:09Z

/cc @mhong for Apache Parquet support.

BryanCutler · 2018-12-14T23:29:54Z

@yongtang does this need to be linked in the root BUILD file here to be included in the pip package?

as otherwise it will not be included in the dist. Signed-off-by: Yong Tang <[email protected]>

yongtang · 2018-12-15T01:10:51Z

@BryanCutler Thanks! Yes adding to root BUILD file is needed. The PR has been updated.

mhong

LGTM

Thank you for the great work! Left some minor comments.

mhong · 2018-12-18T19:27:00Z

third_party/parquet.patch

@@ -0,0 +1,7719 @@
+diff -ru -p1 --new-file src/parquet/parquet_types.cpp src/parquet/parquet_types.cpp


Would it be useful to document how to build this patch, either as a comment block or a separate "build instructions" file? It might be useful if we want to refresh this patch down the road.

mhong · 2018-12-18T19:27:42Z

third_party/parquet.patch

@@ -0,0 +1,7719 @@
+diff -ru -p1 --new-file src/parquet/parquet_types.cpp src/parquet/parquet_types.cpp


Also, do we want to check in the thrift type files used to generate this patch?

mhong · 2018-12-18T19:30:28Z

tensorflow_io/parquet/kernels/parquet_dataset_ops.cc

+        // Loop until we find a row to read or there are no more files left to
+        // read
+        while (true) {
+          if (parquet_reader_) {


minor: consider inverting the if condition (if (!parquet_reader_)) to reduce indentation?

Also to further reduce indentation, should we refactor some code into a helper function, say a helper to read all row groups from the current file (the body of the while loop below)?

@mhong The if (parquet_reader_) may not return at the end and has a fullback out of the if scope. So I think it may not be easy to change the order.

mhong · 2018-12-18T19:34:36Z

tensorflow_io/parquet/kernels/parquet_dataset_ops.cc

+        mutex_lock l(mu_);
+
+        // Loop until we find a row to read or there are no more files left to
+        // read


consider extending this comment a bit more to describe the conceptual hierarchy. If I understand correctly, the hierarchical levels from coasest to finest are: current_file_index_ -> current_row_group_ -> current_row_?

@mhong Done. The comment has been updated.

mhong · 2018-12-18T19:38:45Z

tensorflow_io/parquet/kernels/parquet_dataset_ops.cc

+      }
+
+      template <typename DType>
+      Status FillTensorValue(parquet::ColumnReader* column_reader,


should this and some functions below be private?

@mhong Done. Thanks for the suggestion.

mhong · 2018-12-18T19:40:28Z

tensorflow_io/parquet/ops/dataset_ops.cc

+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetIsStateful()
+    .SetShapeFn(shape_inference::ScalarShape);  // TODO (yongtang): check that


i saw some shape checking code above. Has this TODO been addressed?

mhong · 2018-12-18T19:41:25Z

tensorflow_io/parquet/python/kernel_tests/parquet_test.py

+          v2 = i * 1000 * 1000 * 1000 * 1000
+          v4 = 1.1 * i
+          v5 = 1.1111111 * i
+          self.assertAllClose((v0, v1, v2, v4, v5), sess.run(get_next))


so that those files could be generated. Signed-off-by: Yong Tang <[email protected]>

Signed-off-by: Yong Tang <[email protected]>

yongtang · 2018-12-21T03:29:06Z

@BryanCutler @mhong Thanks for the review. The PR has been updated.

mhong · 2018-12-22T01:57:39Z

LGTM. Thank you!

BryanCutler

LGTM

BryanCutler · 2018-12-22T04:47:06Z

WORKSPACE

+#
+# We use the following step to generate the parquet_types.h and parquet_types.cpp files:
+#  - In third_party directory, run `docker run -i -t --rm -v $PWD:/v -w /v ubuntu:16.04 bash -x /v/parquet.type`
+#  - Once complete, a parquet.patch file will be generated which could be used as a patch in bazel


Thanks for adding this note!

yongtang force-pushed the parquet branch 3 times, most recently from d23eda7 to 4d02894 Compare December 13, 2018 02:12

yongtang added 6 commits December 13, 2018 15:48

Register ParquetDataset to dataset_ops.cc

1667fe3

Signed-off-by: Yong Tang <[email protected]>

Add python wrapper for ParquetDataset

2aa05d2

Signed-off-by: Yong Tang <[email protected]>

Expose tensorflow_io.parquet.ParquetDataset in tensorflow_io namespace

ccc829b

Signed-off-by: Yong Tang <[email protected]>

Add test case based on sample parquet file parquet_cpp_example.parquet

15e59e9

Signed-off-by: Yong Tang <[email protected]>

Add Bazel BUILD file for tensorflow_io/parquet/BUILD

3b3e3c6

Signed-off-by: Yong Tang <[email protected]>

yongtang force-pushed the parquet branch from 4d02894 to 2ff2f67 Compare December 13, 2018 15:48

yongtang added 8 commits December 13, 2018 15:53

Add Snappy BUILD file

a2732f3

Signed-off-by: Yong Tang <[email protected]>

Add Apache Arrow BUILD file

6fc1370

Signed-off-by: Yong Tang <[email protected]>

Add Boost C++ Library BUILD file.

aba397e

Signed-off-by: Yong Tang <[email protected]>

Add Apache Thrift Library BUILD file

e7ef1de

Signed-off-by: Yong Tang <[email protected]>

Add Apache Parquet library BUILD file

5ad836e

Signed-off-by: Yong Tang <[email protected]>

Fix parquet_dataset_ops.py file

543da95

Signed-off-by: Yong Tang <[email protected]>

Update parquet_test.py fiel

add8b87

Signed-off-by: Yong Tang <[email protected]>

Add __init__.py files to parquet modules

0018608

Signed-off-by: Yong Tang <[email protected]>

yongtang force-pushed the parquet branch from 2ff2f67 to 0018608 Compare December 13, 2018 15:53

yongtang mentioned this pull request Dec 13, 2018

Add Apache Parquet support for TensorFlow Dataset tensorflow/tensorflow#19461

Closed

BryanCutler reviewed Dec 13, 2018

View reviewed changes

Fix alignment

0232009

Signed-off-by: Yong Tang <[email protected]>

Add //tensorflow_io/parquet:parquet_py to BUILD file

2e7afd0

as otherwise it will not be included in the dist. Signed-off-by: Yong Tang <[email protected]>

mhong approved these changes Dec 18, 2018

View reviewed changes

yongtang added 2 commits December 21, 2018 02:58

Add instructons to build parquet_types.h and parquet_types.cpp

3e42daf

so that those files could be generated. Signed-off-by: Yong Tang <[email protected]>

Address further review feedbacks

1532441

Signed-off-by: Yong Tang <[email protected]>

BryanCutler mentioned this pull request Dec 21, 2018

Add Apache Arrow dataset support #36

Merged

BryanCutler approved these changes Dec 22, 2018

View reviewed changes

yongtang merged commit 24eb10d into tensorflow:master Dec 22, 2018

yongtang deleted the parquet branch December 22, 2018 05:38

		@@ -0,0 +1,7719 @@
		diff -ru -p1 --new-file src/parquet/parquet_types.cpp src/parquet/parquet_types.cpp

Support Apache Parquet format #21

Support Apache Parquet format #21

Uh oh!

Conversation

yongtang commented Dec 13, 2018

Uh oh!

yongtang commented Dec 13, 2018

Uh oh!

yongtang commented Dec 13, 2018

Uh oh!

yongtang commented Dec 13, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler Dec 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yongtang commented Dec 13, 2018

Uh oh!

BryanCutler commented Dec 14, 2018

Uh oh!

yongtang commented Dec 15, 2018

Uh oh!

mhong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yongtang commented Dec 21, 2018

Uh oh!

mhong commented Dec 22, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

BryanCutler Dec 13, 2018 •

edited

Loading