Skip to content

Conversation

@pgaref
Copy link
Contributor

@pgaref pgaref commented Apr 25, 2023

https://issues.apache.org/jira/browse/FLINK-27805

Apache ORC 1.5.x is EOL -- last HF release happened on Sep 2021 https://orc.apache.org/news/2021/09/15/ORC-1.5.13/

Need to bump to 1.7.x -- Release Notes
ORC now supports writers with FSDataOutputStream (instead of just paths previously) so cleaning NoHivePhysicalWriterImpl and PhysicalWriterImpl

@flinkbot
Copy link
Collaborator

flinkbot commented Apr 25, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@MartijnVisser
Copy link
Contributor

@lirui-apache @JingsongLi Does one of you want to review this PR?

@pgaref
Copy link
Contributor Author

pgaref commented Apr 25, 2023

Aslo cc @pnowojski / @akalash that might be interested

@dmvk
Copy link
Member

dmvk commented May 3, 2023

Please add @liujiawinds as a co-author [1] to the commit.

[1] https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors

Comment on lines +97 to +106
<exclusions>
<exclusion>
<groupId>ch.qos.reload4j</groupId>
<artifactId>reload4j</artifactId>
</exclusion>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-reload4j</artifactId>
</exclusion>
</exclusions>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need these excludes?

Copy link
Contributor Author

@pgaref pgaref May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats because we dont allow Reload4J dependencies due to their conflict with Log4j -- we use maven-enforcer [rules] for that (

flink/pom.xml

Line 1773 in 85efa13

<message>Log4j 1 and Reload4J dependencies are not allowed because they conflict with Log4j 2. If the dependency absolutely requires the Log4j 1 API, use 'org.apache.logging.log4j:log4j-1.2-api'.</message>
):

Comment on lines +43 to +44
// Don't close the internal stream here to avoid
// Stream Closed or ClosedChannelException when Flink performs checkpoint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this tested in any way? how do we close files then to avoid leaking resources?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question - although it looks like the original customized PhysicalWriterImpl does the same thing

Copy link
Contributor Author

@pgaref pgaref May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, its the same functionality -- looks like we need the stream open for snapshotting, that is then cleaned up as part of snapshotContext.closeExceptionally method

I also replicated the ClosedChannelException issue described above when keeping the stream open in the existing tests so I believe we are good here

PS: we also do the same for other formats, e.g., Avro

@dongjoon-hyun
Copy link
Member

Thank you, @pgaref and @dmvk .

cc @williamhyun

Comment on lines -53 to -54
* <p>NOTE: If the ORC dependency version is updated, this file may have to be updated as well to be
* in sync with the new version's PhysicalFsWriter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pgaref quick question before I dive deeper:

I assume that the new PhysicalFsWriter using provided FSDataOutputStream has the exact same functionality as what was implemented with the original custom PhysicalWriterImpl and NoHivePhysicalWriterImpl? I did not do a line-by-line cross check, but for example, this Javadoc in the original PhysicalWriterImpl has me wondering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tzulitai, thats correct the original (removed) PhysicalWriterImpl was a copy of the ORC PhysicalWriter with support for FSDataOutputStream. https://github.com/apache/orc/blob/a85b4c8852a894a701ddb73c15fb84ed1035abb9/java/core/src/java/org/apache/orc/impl/PhysicalFsWriter.java

ORC-1198 recently introduced a PhysicalFsWriter constructor with FSDataOutputStream as a parameter and there is no need to internally maintain this anymore 🥳

@pgaref pgaref changed the title [FLINK-27805] bump orc version to 1.7.8 [FLINK-27805][Connectors/ORC] bump orc version to 1.7.8 May 4, 2023
@pgaref pgaref force-pushed the FLINK-27805 branch 3 times, most recently from 85efa13 to 446bb61 Compare May 4, 2023 06:10
@pgaref
Copy link
Contributor Author

pgaref commented May 4, 2023

@flinkbot run azure

@dongjoon-hyun
Copy link
Member

BTW, @pgaref . As you know, @wgtmac started Apache ORC 1.7.9 RC1. Can we test RC1 in this PR?

@dongjoon-hyun
Copy link
Member

If we need some patches in order to help this PR, Apache ORC community can cut RC2 accordingly to help Apache Flink community.

@pgaref
Copy link
Contributor Author

pgaref commented May 4, 2023

hey @dongjoon-hyun -- thanks for keeping an eye!
Just triggered a run on 1.7.9-SNAPSHOT

@wgtmac
Copy link
Member

wgtmac commented May 5, 2023

Thanks @dongjoon-hyun and @pgaref. I will keep an eye on it and prepare RC2 if required.

BTW, it would be better to test 1.7.9-RC1 instead of 1.7.9-SNAPSHOT. But I think they have the same content except the version.

@pgaref pgaref force-pushed the FLINK-27805 branch 2 times, most recently from ed0f291 to e4d2d89 Compare May 5, 2023 06:04
@pgaref
Copy link
Contributor Author

pgaref commented May 5, 2023

Thanks @dongjoon-hyun and @pgaref. I will keep an eye on it and prepare RC2 if required.

BTW, it would be better to test 1.7.9-RC1 instead of 1.7.9-SNAPSHOT. But I think they have the same content except the version.

FYI @dongjoon-hyun @wgtmac we had a green run for 1.7.9.

Switching back to 1.7.8 to get this PR merged and I will create a new ticket for the 1.7.9 bump when its ready.

@dongjoon-hyun
Copy link
Member

It's great! Thank you so much, @pgaref .

@dongjoon-hyun
Copy link
Member

BTW, could you cast your +1 with that information on Apache ORC 1.7.9 vote thread?

@pgaref
Copy link
Contributor Author

pgaref commented May 5, 2023

BTW, could you cast your +1 with that information on Apache ORC 1.7.9 vote thread?

Sure, will do

@dongjoon-hyun
Copy link
Member

Also, cc @williamhyun once more too..

@pgaref
Copy link
Contributor Author

pgaref commented May 5, 2023

@tzulitai / @dmvk can you please take another look? I think we are almost there

Copy link
Contributor

@tzulitai tzulitai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes looks good. To summarize what I gathered from understanding the PR changes:

  1. Main change: previously custom maintained NoHivePhysicalWriterImpl and PhysicalWriterImpl was removed, in favor of the new PhysicalFsWriter that takes a FSDataOutputStream for instantiation. The code in the old PhysicalWriterImpl was already a copy of ORC's PhysicalFsWriter (with the addition of accepting a FSDataOutputStream), so we are not loosing any functional features or have any behavioral changes with this PR.
  2. Test change # 1: renamed all f* field names to be _col* due to naming convention change in ORC 1.7.x.
  3. Test change # : extended the ORC writer / reader tests to cover the new compression schemes in the new ORC version.

I'm slightly worried about 2) specifically, i.e. this field name convention change in the ORC upgrade. Would it cause any compatibility issues for Flink?

Context:

  • Flink filesystem sinks use 2PC to write with exactly-once guarantees.
  • This means that a Flink savepoint/checkpoint may contain staged "pre-committed" files with ORC format waiting to be committed, if the Flink job uses a sink that writes using ORC.
  • When restoring from that savepoint, those pre-commit files will be resumed from.

So, imagine if the savepoint was taken with a Flink version that was using ORC 1.5.x, but then the savepoint was restored using a Flink version that was using ORC 1.7.x. Would that be an issue with the field naming convention changes?

@github-actions
Copy link

This PR is being marked as stale since it has not had any activity in the last 180 days.
If you would like to keep this PR alive, please leave a comment asking for a review.
If the PR has merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out to the
community, contact details can be found here: https://flink.apache.org/what-is-flink/community/

If this PR is no longer valid or desired, please feel free to close it.
If no activity occurs in the next 90 days, it will be automatically closed.

@github-actions github-actions bot added the stale label Jan 16, 2025
@github-actions
Copy link

github-actions bot commented Apr 3, 2025

This PR has been closed since it has not had any activity in 120 days.
If you feel like this was a mistake, or you would like to continue working on it,
please feel free to re-open the PR and ask for a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants