-
Notifications
You must be signed in to change notification settings - Fork 306
Support Apache Parquet format #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d23eda7 to
4d02894
Compare
|
The last commit in this PR is not necessary, once the other PR #22 is merged in. |
Apache Parquet is a widely used columnar storage format available in the Hadoop ecosystem. Its popularity spans across big data community with usage in many production environments. For example, most of our big data projects use Parquet format for storage on S3 processed by Spark/EMR. We started transition some of projects to TensorFlow. However, as there was no Parquet support in TensorFlow, we have to transform our existing Parquet data to formats accepted by TensorFlow first. This PR is a preliminary attempt to add Apache Parquet support for TensorFlow Dataset. It may not cover all the use cases though it could be served as a starting point for further improvement in the future. The ParquetDataset depends on parquet-cpp (Apache) project as well as other dependencies. The ParquetDataset only builds on Linux at the moment. This PR also adds the option in `./configure` so that those dependencies could be skipped. Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
|
The PR is updated to capture PR #22, think the PR is complete. |
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
|
/cc @mrry, also /cc @BryanCutler as I think some of the external libraries (e.g., arrow.BUILD and boost.BUILD) may help the Apache Arrow support. |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I just noticed a couple minor things while skimming the PR
tensorflow_io/parquet/BUILD
Outdated
| deps = [ | ||
| "@local_config_tf//:libtensorflow_framework", | ||
| "@local_config_tf//:tf_header_lib", | ||
| "@parquet//:parquet", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: alignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BryanCutler, the alignment has been updated.
third_party/parquet.patch
Outdated
| @@ -0,0 +1,7719 @@ | |||
| diff -ru -p1 --new-file src/parquet/parquet_types.cpp src/parquet/parquet_types.cpp | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this file intentional to add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I see it is applied as part of the parquet build. Mind if I ask what needed to be patched?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BryanCutler. The patch file actually just added two files:
src/parquet/parquet_types.cpp
src/parquet/parquet_types.h
Those two files are thrift type files. The files are generated automatically. This is also a one time effort for each version and will not change when regenerate. However, it need both flex and bison to generate. I tried to build flex and bison through bazel but realized that they are more complicated than thrift itself. So I think maybe we could just build those two thrift files offline (and not to build flex and bison to generate those two files in compile time).
Also, I could not find an option in Bazel to support "override" or "additional extra" file on top of the http_archive to add those two files in the build. So I use a patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, sounds good. Thanks for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be useful to document how to build this patch, either as a comment block or a separate "build instructions" file? It might be useful if we want to refresh this patch down the road.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we want to check in the thrift type files used to generate this patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler @mhong The PR has been updated with detailed steps to generate the parquet.patch. Now it is possible to reproduce the parquet.patch by running the following command in third_party directory:
docker run -i -t --rm -v $PWD:/v -w /v ubuntu:16.04 bash -x /v/parquet.type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhong The parquet.thrift file (used to generate the parquet_types.[cpp|h]) is part of the parquet source code src/parquet/parquet.thrift in apache-parquet-cpp-1.4.0.tar.gz.
Signed-off-by: Yong Tang <[email protected]>
|
/cc @mhong for Apache Parquet support. |
as otherwise it will not be included in the dist. Signed-off-by: Yong Tang <[email protected]>
|
@BryanCutler Thanks! Yes adding to root BUILD file is needed. The PR has been updated. |
mhong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you for the great work! Left some minor comments.
third_party/parquet.patch
Outdated
| @@ -0,0 +1,7719 @@ | |||
| diff -ru -p1 --new-file src/parquet/parquet_types.cpp src/parquet/parquet_types.cpp | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be useful to document how to build this patch, either as a comment block or a separate "build instructions" file? It might be useful if we want to refresh this patch down the road.
third_party/parquet.patch
Outdated
| @@ -0,0 +1,7719 @@ | |||
| diff -ru -p1 --new-file src/parquet/parquet_types.cpp src/parquet/parquet_types.cpp | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we want to check in the thrift type files used to generate this patch?
| // Loop until we find a row to read or there are no more files left to | ||
| // read | ||
| while (true) { | ||
| if (parquet_reader_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: consider inverting the if condition (if (!parquet_reader_)) to reduce indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also to further reduce indentation, should we refactor some code into a helper function, say a helper to read all row groups from the current file (the body of the while loop below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhong The if (parquet_reader_) may not return at the end and has a fullback out of the if scope. So I think it may not be easy to change the order.
| mutex_lock l(mu_); | ||
|
|
||
| // Loop until we find a row to read or there are no more files left to | ||
| // read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider extending this comment a bit more to describe the conceptual hierarchy. If I understand correctly, the hierarchical levels from coasest to finest are: current_file_index_ -> current_row_group_ -> current_row_?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhong Done. The comment has been updated.
| } | ||
|
|
||
| template <typename DType> | ||
| Status FillTensorValue(parquet::ColumnReader* column_reader, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this and some functions below be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhong Done. Thanks for the suggestion.
| .Attr("output_types: list(type) >= 1") | ||
| .Attr("output_shapes: list(shape) >= 1") | ||
| .SetIsStateful() | ||
| .SetShapeFn(shape_inference::ScalarShape); // TODO (yongtang): check that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i saw some shape checking code above. Has this TODO been addressed?
| v2 = i * 1000 * 1000 * 1000 * 1000 | ||
| v4 = 1.1 * i | ||
| v5 = 1.1111111 * i | ||
| self.assertAllClose((v0, v1, v2, v4, v5), sess.run(get_next)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice!
so that those files could be generated. Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
|
@BryanCutler @mhong Thanks for the review. The PR has been updated. |
|
LGTM. Thank you! |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| # | ||
| # We use the following step to generate the parquet_types.h and parquet_types.cpp files: | ||
| # - In third_party directory, run `docker run -i -t --rm -v $PWD:/v -w /v ubuntu:16.04 bash -x /v/parquet.type` | ||
| # - Once complete, a parquet.patch file will be generated which could be used as a patch in bazel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this note!
This fix is the continuation of PR tensorflow/tensorflow#19461, which adds Apache Parquet format support in tensorflow-io.
The PR tensorflow/tensorflow#19461 essentially was cherry-picked with some additional fix ups to correct import path.
This fix fixes #12