Skip to content

Conversation

@Juli0q
Copy link
Contributor

@Juli0q Juli0q commented Mar 28, 2025

Binary license files are now detected and excluded during license info resolution. In previous cases, such files caused the reporter to enter an endless loop during report generation, as the archiver attempted to include the binary content in the report.

To verify this behavior, a test file named LICENSE-BIN is added to archive.zip. It contains 4 arbitrary bytes created with a hex editor to simulate a non-text license file.

Apache Tika is introduced to detect MIME types and distinguish between text and non-text files. This ensures that only valid, readable license files are processed and included in the final report.

@codecov
Copy link

codecov bot commented Mar 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.66%. Comparing base (0c6934d) to head (bbcceb9).
Report is 4 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #10109      +/-   ##
============================================
+ Coverage     56.65%   56.66%   +0.01%     
- Complexity     1615     1619       +4     
============================================
  Files           332      332              
  Lines         12277    12281       +4     
  Branches       1138     1139       +1     
============================================
+ Hits           6955     6959       +4     
  Misses         4877     4877              
  Partials        445      445              
Flag Coverage Δ
funTest-docker 70.89% <ø> (ø)
funTest-non-docker 33.56% <0.00%> (+0.14%) ⬆️
test-ubuntu-24.04 40.62% <100.00%> (+0.02%) ⬆️
test-windows-2022 40.60% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sschuberth

This comment was marked as resolved.

@Juli0q Juli0q force-pushed the exclude-binary-license-files branch from d1a62bf to ad3e8b6 Compare March 28, 2025 13:53
@Juli0q

This comment was marked as resolved.

@mnonnenmacher

This comment was marked as resolved.

@Juli0q Juli0q marked this pull request as ready for review April 2, 2025 11:43
@Juli0q Juli0q requested a review from a team as a code owner April 2, 2025 11:43
@sschuberth

This comment was marked as resolved.

@fviernau

This comment was marked as resolved.

@sschuberth

This comment was marked as resolved.

@sschuberth

This comment was marked as resolved.

@fviernau
Copy link
Member

fviernau commented Apr 2, 2025

Mind elaborating on the use case for that? The original use case for these archives was to maintain original license texts for using them in attribution documents. Binary files do not play a role there.

To allow having a feature toggle for the filtering, for the reasons I outlined above. And also if there is a bug in the isBinaryFile() heuristic, it can be more easily fixed.

@fviernau

This comment was marked as resolved.

@sschuberth
Copy link
Member

Mind elaborating on the use case for that? The original use case for these archives was to maintain original license texts for using them in attribution documents. Binary files do not play a role there.

To allow having a feature toggle for the filtering, for the reasons I outlined above.

But why implement a feature toggle to begin with? Why ever maintaining binary files as part of the license file archive?

@mnonnenmacher
Copy link
Member

mnonnenmacher commented Apr 2, 2025

I would also prefer to not include license files in the archives at all, even if there is a certain risk that the detection might fail in some cases. That's also missing in this PR, IIUC it currently only throws an exception if an archive already contains a binary file which I guess means that the report will not be created at all. But there is no way to recover from that, because even when the archive file is deleted it will be recreated with the same input an contain a binary file again.

@sschuberth

This comment was marked as resolved.

@Juli0q Juli0q force-pushed the exclude-binary-license-files branch from ad3e8b6 to 36d89b8 Compare April 4, 2025 10:59
@sschuberth
Copy link
Member

Outcomes from today's core dev meeting:

  • Binary files should be filtered out before creation of the archive
    • Probably at least log filtered out files
  • Implement tests for the binary check
    • Verify correct behavior for UTF8 / UTF16 encoded text files, e.g. with Asian characters
    • Verify that all of ScanCode's license text files are recognized as text
  • Revisit the need for Apache Tika; maybe just implement a context-based check ourselves instead

@Juli0q Juli0q force-pushed the exclude-binary-license-files branch from 36d89b8 to 8c7ac6b Compare May 9, 2025 12:09
@Juli0q Juli0q marked this pull request as draft May 9, 2025 12:10
@Juli0q Juli0q force-pushed the exclude-binary-license-files branch 2 times, most recently from a064c99 to 735bbb0 Compare May 13, 2025 13:45
@Juli0q Juli0q marked this pull request as ready for review May 13, 2025 14:24
@Juli0q Juli0q requested a review from mnonnenmacher May 13, 2025 14:33
@Juli0q Juli0q force-pushed the exclude-binary-license-files branch from 735bbb0 to 5488b2e Compare May 15, 2025 08:08
@sschuberth

This comment was marked as outdated.

@Juli0q Juli0q requested a review from oheger-bosch May 15, 2025 09:16
@oheger-bosch
Copy link
Member

My points have been addressed, but I would leave the approval to the guys who have been deeper involved in this topic.

@sschuberth sschuberth requested a review from a team May 15, 2025 19:28
Copy link
Member

@sschuberth sschuberth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm basically ok with introducing Tika as a dependency now, as I've checked that the library is not huge, and we mostly already use its transitive dependencies elsewhere.

@Juli0q Juli0q force-pushed the exclude-binary-license-files branch from 5488b2e to 59cb538 Compare May 16, 2025 13:07
@Juli0q Juli0q force-pushed the exclude-binary-license-files branch 2 times, most recently from a94cd09 to 03173e5 Compare May 16, 2025 14:49
@Juli0q Juli0q requested a review from sschuberth May 16, 2025 15:00
@Juli0q Juli0q force-pushed the exclude-binary-license-files branch from 03173e5 to a9b856c Compare May 16, 2025 15:38
@Juli0q Juli0q force-pushed the exclude-binary-license-files branch from a9b856c to 58b3d3f Compare May 19, 2025 12:11
Copy link
Member

@sschuberth sschuberth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two final nits.

Detect and exclude binary license files using Apache Tika. When a
non-text file is found during the license info creation process, a
warning is printed, and it is excluded from the final report.

This prevents the inclusion of binary files that previously caused the
reporter to enter an endless loop during report generation.

Signed-off-by: Julian Olderdissen <[email protected]>
@Juli0q Juli0q force-pushed the exclude-binary-license-files branch from 58b3d3f to bbcceb9 Compare May 19, 2025 12:58
@Juli0q Juli0q requested a review from sschuberth May 19, 2025 12:58
@sschuberth sschuberth dismissed mnonnenmacher’s stale review May 19, 2025 13:13

Comments addressed.

@sschuberth sschuberth enabled auto-merge (rebase) May 19, 2025 13:13
@sschuberth sschuberth merged commit feecb6d into oss-review-toolkit:main May 20, 2025
31 of 32 checks passed
@Juli0q Juli0q deleted the exclude-binary-license-files branch May 20, 2025 07:41
Comment on lines +179 to +188
"include utf8 file with japanese chars" {
createFile("License") { writeText("ぁあぃいぅうぇえぉおかが") }

val archiver = FileArchiver.createDefault()
archiver.archive(workingDir, PROVENANCE)
val result = archiver.unarchive(targetDir, PROVENANCE)

result shouldBe true
targetDir should containFile("License")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this does not work as expected. Can you have a look at that thread in Slack, @Juli0q?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants