-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-27805][Connectors/ORC] bump orc version to 1.7.8 #22481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@lirui-apache @JingsongLi Does one of you want to review this PR? |
|
Aslo cc @pnowojski / @akalash that might be interested |
|
Please add @liujiawinds as a co-author [1] to the commit. |
| <exclusions> | ||
| <exclusion> | ||
| <groupId>ch.qos.reload4j</groupId> | ||
| <artifactId>reload4j</artifactId> | ||
| </exclusion> | ||
| <exclusion> | ||
| <groupId>org.slf4j</groupId> | ||
| <artifactId>slf4j-reload4j</artifactId> | ||
| </exclusion> | ||
| </exclusions> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need these excludes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats because we dont allow Reload4J dependencies due to their conflict with Log4j -- we use maven-enforcer [rules] for that (
Line 1773 in 85efa13
| <message>Log4j 1 and Reload4J dependencies are not allowed because they conflict with Log4j 2. If the dependency absolutely requires the Log4j 1 API, use 'org.apache.logging.log4j:log4j-1.2-api'.</message> |
...rc-nohive/src/test/java/org/apache/flink/orc/nohive/OrcColumnarRowSplitReaderNoHiveTest.java
Show resolved
Hide resolved
flink-formats/flink-orc/src/test/java/org/apache/flink/orc/OrcColumnarRowSplitReaderTest.java
Show resolved
Hide resolved
flink-formats/flink-orc/src/test/java/org/apache/flink/orc/util/OrcBulkWriterTestUtil.java
Outdated
Show resolved
Hide resolved
flink-formats/flink-orc/src/test/java/org/apache/flink/orc/util/OrcBulkWriterTestUtil.java
Outdated
Show resolved
Hide resolved
| // Don't close the internal stream here to avoid | ||
| // Stream Closed or ClosedChannelException when Flink performs checkpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this tested in any way? how do we close files then to avoid leaking resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question - although it looks like the original customized PhysicalWriterImpl does the same thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, its the same functionality -- looks like we need the stream open for snapshotting, that is then cleaned up as part of snapshotContext.closeExceptionally method
I also replicated the ClosedChannelException issue described above when keeping the stream open in the existing tests so I believe we are good here
PS: we also do the same for other formats, e.g., Avro
|
Thank you, @pgaref and @dmvk . cc @williamhyun |
| * <p>NOTE: If the ORC dependency version is updated, this file may have to be updated as well to be | ||
| * in sync with the new version's PhysicalFsWriter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pgaref quick question before I dive deeper:
I assume that the new PhysicalFsWriter using provided FSDataOutputStream has the exact same functionality as what was implemented with the original custom PhysicalWriterImpl and NoHivePhysicalWriterImpl? I did not do a line-by-line cross check, but for example, this Javadoc in the original PhysicalWriterImpl has me wondering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @tzulitai, thats correct the original (removed) PhysicalWriterImpl was a copy of the ORC PhysicalWriter with support for FSDataOutputStream. https://github.com/apache/orc/blob/a85b4c8852a894a701ddb73c15fb84ed1035abb9/java/core/src/java/org/apache/orc/impl/PhysicalFsWriter.java
ORC-1198 recently introduced a PhysicalFsWriter constructor with FSDataOutputStream as a parameter and there is no need to internally maintain this anymore 🥳
85efa13 to
446bb61
Compare
|
@flinkbot run azure |
|
If we need some patches in order to help this PR, Apache ORC community can cut RC2 accordingly to help Apache Flink community. |
|
hey @dongjoon-hyun -- thanks for keeping an eye! |
|
Thanks @dongjoon-hyun and @pgaref. I will keep an eye on it and prepare RC2 if required. BTW, it would be better to test 1.7.9-RC1 instead of 1.7.9-SNAPSHOT. But I think they have the same content except the version. |
ed0f291 to
e4d2d89
Compare
FYI @dongjoon-hyun @wgtmac we had a green run for 1.7.9. Switching back to 1.7.8 to get this PR merged and I will create a new ticket for the 1.7.9 bump when its ready. |
|
It's great! Thank you so much, @pgaref . |
|
BTW, could you cast your +1 with that information on Apache ORC 1.7.9 vote thread? |
Sure, will do |
|
Also, cc @williamhyun once more too.. |
Co-authored-by: Jia Liu <[email protected]>
tzulitai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes looks good. To summarize what I gathered from understanding the PR changes:
- Main change: previously custom maintained
NoHivePhysicalWriterImplandPhysicalWriterImplwas removed, in favor of the newPhysicalFsWriterthat takes aFSDataOutputStreamfor instantiation. The code in the oldPhysicalWriterImplwas already a copy of ORC'sPhysicalFsWriter(with the addition of accepting aFSDataOutputStream), so we are not loosing any functional features or have any behavioral changes with this PR. - Test change # 1: renamed all
f*field names to be_col*due to naming convention change in ORC 1.7.x. - Test change # : extended the ORC writer / reader tests to cover the new compression schemes in the new ORC version.
I'm slightly worried about 2) specifically, i.e. this field name convention change in the ORC upgrade. Would it cause any compatibility issues for Flink?
Context:
- Flink filesystem sinks use 2PC to write with exactly-once guarantees.
- This means that a Flink savepoint/checkpoint may contain staged "pre-committed" files with ORC format waiting to be committed, if the Flink job uses a sink that writes using ORC.
- When restoring from that savepoint, those pre-commit files will be resumed from.
So, imagine if the savepoint was taken with a Flink version that was using ORC 1.5.x, but then the savepoint was restored using a Flink version that was using ORC 1.7.x. Would that be an issue with the field naming convention changes?
|
This PR is being marked as stale since it has not had any activity in the last 180 days. If you are having difficulty finding a reviewer, please reach out to the If this PR is no longer valid or desired, please feel free to close it. |
|
This PR has been closed since it has not had any activity in 120 days. |
https://issues.apache.org/jira/browse/FLINK-27805
Apache ORC 1.5.x is EOL -- last HF release happened on Sep 2021 https://orc.apache.org/news/2021/09/15/ORC-1.5.13/
Need to bump to 1.7.x -- Release Notes
ORC now supports writers with FSDataOutputStream (instead of just paths previously) so cleaning NoHivePhysicalWriterImpl and PhysicalWriterImpl