Skip to content

Conversation

@liupc
Copy link

@liupc liupc commented Jan 2, 2019

What's the problem?

Snappy compressor/decompressor will not release direct buffer in time, it might cause direct memory oom, especially for those application whose direct memory is limited while heap memory is redundant.

What changes were proposed in this pull request?

This PR is to fix this issue by calling cleaner.clean to release directly memory in SnappyCompressor/SnappyDecompressor, it's implemented through reflection to solve incompatibility issues on java9+.

This PR also includes some minor changes: add initial buffer size to boost the memory allocation and avoid memory wasting in SnappyCompressor/SnappyDecompressor.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this code is that it is incompatible with java9+. We are currently working to make the build compatible with java11.
If the fix of this issue is urgent I would suggest using reflection to get the Cleaner and invoke its clean method. Otherwise, I would suggest postponing the fix until we are on java11 an then we can use the official java.lang.ref.Cleaner.

@liupc
Copy link
Author

liupc commented Feb 9, 2019

@gszadovszky Thanks for review, I will update. This is an urgent problem, it may causes direct memory OOM, I'd prefer your first suggestion -- using reflection to refactor this code.
I was also wondering if there are some compatibility problems(e.g. runtime library changes) when moving to java11, can it also work under java7 or java8? so use reflection here maybe the best choice?

@liupc liupc changed the title [Parquet-1485]Fix Snappy direct memory leak [PARQUET-1485]Fix Snappy direct memory leak Feb 9, 2019
@liupc liupc changed the title [PARQUET-1485]Fix Snappy direct memory leak PARQUET-1485:Fix Snappy direct memory leak Feb 9, 2019
@liupc liupc changed the title PARQUET-1485:Fix Snappy direct memory leak PARQUET-1485: Fix Snappy direct memory leak Feb 9, 2019
@liupc
Copy link
Author

liupc commented Feb 10, 2019

@gszadovszky I have updated this PR as your suggestion, and passed the tests. can you review again?

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for rewriting your code to use reflections. I have some notes with the new CleanUtil though.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional comments about CleanUtil...

cleanerField.setAccessible(true);
Object cleaner = cleanerField.get(buf);
cleanMethod = cleaner.getClass().getDeclaredMethod("clean");
} catch (Throwable e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching Throwable is not nice. I think, it would be better to catch only the known exceptions and let all the others (some serious JVM errors like OOM) throw out.

Copy link
Author

@liupc liupc Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It's better, but it's a little bit long, I think if we just need to handle NonFatal exceptions, there should be a NonFatal(Customed exception) exception which can match all non-fatal concrete exceptions rather than list all them all.
I know scala can do this, I'm not sure whether java can do this also.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the changes. It looks good to me.
I'll wait another 24hours for additional comments.

@gszadovszky gszadovszky merged commit 7dcdcdc into apache:master Feb 12, 2019
@liupc
Copy link
Author

liupc commented Feb 12, 2019

Thanks for merging! @gszadovszky

LantaoJin pushed a commit to LantaoJin/parquet-mr that referenced this pull request Jun 15, 2021
shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Mar 1, 2023
Summary:
This is to merge from upstream
PARQUET-1305: Backward incompatible change introduced in 1.8 (apache#483)

PARQUET-1452: Deprecate old logical types API (apache#535)

PARQUET-1414: Simplify next row count check calculation (apache#537)

PARQUET-1435: Benchmark filtering column-indexes (apache#536)

PARQUET-1365: Don't write page level statistics (apache#549)

Page level statistics were never used in production and became pointless after adding column indexes.

PARQUET-1456: Use page index, ParquetFileReader throw ArrayIndexOutOfBoundsException (apache#548)

The usage of static caching in the page index implementation did not allow using multiple readers at the same time.

PARQUET-1407: Avro: Fix binary values returned from dictionary encoding (apache#552)

* PARQUET-1407: Add test case for PARQUET-1407 to demonstrate the issue
* PARQUET-1407: Fix binary values from dictionary encoding.

Closes apache#551.

PARQUET-1460: Fix javadoc errors and include javadoc checking in Travis checks (apache#554)

Experiment.

Revert "Experiment."

This reverts commit 97a880c.

PARQUET-1434: Update CHANGES.md for 1.11.0 release.

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

PARQUET-1461: Third party code does not compile after parquet-mr minor version update (apache#556)

PARQUET-1434: Update CHANGES.md for 1.11.0 release candidate 2.

PARQUET-1258: Update scm developer connection to github

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

PARQUET-1462: Allow specifying new development version in prepare-release.sh (apache#557)

Before this change, prepare-release.sh only took the release version as a
parameter, the new development version was asked interactively for each
individual pom.xml file, which made answering them tedious.

PARQUET-1472: Dictionary filter fails on FIXED_LEN_BYTE_ARRAY (apache#562)

PARQUET-1474: Less verbose and lower level logging for missing column/offset indexes (apache#563)

PARQUET-1476: Don't emit a warning message for files without new logical type (apache#577)

Update CHANGES.md for 1.11.0 release candidate 2.

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

PARQUET-1478: Can't read spec compliant, 3-level lists via parquet-proto (apache#578)

PARQUET-1489: Insufficient documentation for UserDefinedPredicate.keep(T) (apache#588)

PARQUET-1487: Do not write original type for timezone-agnostic timestamps (apache#585)

Update CHANGES.md for 1.11.0 release candidate 3.

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

PARQUET-1490: Add branch-specific Travis steps (apache#590)

The possiblity of branch-specific scripts allows feature branches to build
SNAPSHOT versions of parquet-format (and depend on them in the POM files). Even
if such branch-specific scripts get merged into master accidentally, they will
not have any effect there.

The script for the main branch checks the POM files to make sure that SNAPSHOT
dependencies are not added to or merged into master accidentally.

PARQUET-1280: [parquet-protobuf] Use maven protoc plugin (apache#506)

PARQUET-1466: Upgrade to the latest guava 27.0-jre (apache#559)

PARQUET-1475: Fix lack of cause propagation in DirectCodecFactory.ParquetCompressionCodecException. (apache#564)

PARQUET-1492: Remove protobuf build (apache#592)

We do not need to build protobuf (protoc) ourselves since we rely on maven protoc plugin to compile protobuf.
This should save about 10 minutes travis build time (time for building protobuf itself).

PARQUET-1498: Add instructions to install thrift via homebrew (apache#595)

PARQUET-1502: Convert FIXED_LEN_BYTE_ARRAY to arrow type in logicalTypeAnnotation if it is not null (apache#593)

[PARQUET-1506] Migrate  maven-thrift-plugin to thrift-maven-plugin (apache#600)

maven-thrift-plugin (Aug 13, 2013) https://mvnrepository.com/artifact/org.apache.thrift.tools/maven-thrift-plugin/0.1.11
thrift-maven-plugin (Jan 18, 2017) https://mvnrepository.com/artifact/org.apache.thrift/thrift-maven-plugin/0.10.0

The maven-thrift-plugin is the old one which has been migrated to the ASF
and continued as thrift-maven-plugin:
https://issues.apache.org/jira/browse/THRIFT-4083

[PARQUET-1500] Replace Closeables with try-with-resources (apache#597)

PARQUET-1503: Remove Ints Utility Class (apache#598)

PARQUET-1513: Update HiddenFileFilter to avoid extra startsWith (apache#606)

PARQUET-1504: Add an option to convert Int96 to Arrow Timestamp (apache#594)

PARQUET-1504: Add an option to convert Parquet Int96 to Arrow Timestamp

PARQUET-1509: Note Hive deprecation in README. (apache#602)

PARQUET-1510: Fix notEq for optional columns with null values. (apache#603)

Dictionaries cannot contain null values, so notEq filters cannot
conclude that a block cannot match using only the dictionary. Instead,
it must also check whether the block may have at least one null value.
If there are no null values, then the existing check is correct.

[PARQUET-1507] Bump Apache Thrift to 0.12.0 (apache#601)

PARQUET-1518: Use Jackson2 version 2.9.8 in parquet-cli (apache#609)

There are some vulnerabilities:
https://ossindex.sonatype.org/vuln/1205a1ec-0837-406f-b081-623b9fb02992
https://ossindex.sonatype.org/vuln/b85a00e3-7d9b-49cf-9b19-b73f8ee60275
https://ossindex.sonatype.org/vuln/4f7e98ad-2212-45d3-ac21-089b3b082e6c
https://ossindex.sonatype.org/vuln/ab9013f0-09a2-4f01-bce5-751dc7437494
https://ossindex.sonatype.org/vuln/3f596fc0-9615-4b93-b30a-d4e0532e667f
https://ossindex.sonatype.org/vuln/4f7e98ad-2212-45d3-ac21-089b3b082e6c

PARQUET-138: Allow merging more restrictive field in less restrictive field (apache#550)

* Allow merging more restrictive field in less restrictive field
* Make class and function names more explicit

Add javax.annotation-api dependency for JDK >= 9 (apache#604)

PARQUET-1470: Inputstream leakage in ParquetFileWriter.appendFile (apache#611)

PARQUET-1514: ParquetFileWriter Records Compressed Bytes instead of Uncompressed Bytes (apache#607)

PARQUET-1505: Use Java 7 NIO StandardCharsets (apache#599)

PARQUET-1480 INT96 to avro not yet implemented error should mention deprecation (apache#579)

PARQUET-1485: Fix Snappy direct memory leak (apache#581)

PARQUET-1527:  [parquet-tools] cat command throw java.lang.ClassCastException (apache#612)

PARQUET-1529: Shade fastutil in all modules where used (apache#617)

Update CHANGES.md for 1.11.0rc4

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

Merge from upstream

Test Plan: Testing in Spark/Hive/Presto wil be performed before roll out to production!

Reviewers: pavi, leisun

Reviewed By: leisun

Differential Revision: https://code.uberinternal.com/D2512639

Revert "PARQUET-1485: Fix Snappy direct memory leak (apache#581)"

This reverts commit 7dcdcdc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants