Skip NameObject when building outline #1068

michaeleveringham · 2022-07-07T02:55:14Z

Fixes #193, uses @hannal's solution to resolve read/merge issues with wkhtmltopdf pdfs by skipping NameObject entities. Related to #778.

michaeleveringham · 2022-07-07T02:55:47Z

@MartinThoma Think I got it now. Thanks again for the help.

michaeleveringham · 2022-07-07T03:00:48Z

Never mind, need to update some tests it seems.

michaeleveringham · 2022-07-07T04:02:41Z

I tried numerous "bad" PDFs but couldn't get this error to raise. Will have to look into creating a PDF with garbage anchors and uploading to the samples repo.

MartinThoma · 2022-07-10T04:39:11Z

@LamerLink Could it be that your PR just solved an issue that test_unexpected_destination has? Maybe we could simply adjust the test to:

def test_unexpected_destination():
    url = "https://corpora.tika.apache.org/base/docs/govdocs1/913/913678.pdf"
    name = "tika-913678.pdf"
    reader = PdfReader(BytesIO(get_pdf_from_url(url, name=name)))
    merger = PdfMerger()
    merger.append(reader)

in this PR?

michaeleveringham · 2022-07-11T04:25:19Z

@MartinThoma Good call, updated. I thought about adding an assert len(reader.outlines) == 0 since there'll be no valid destinations added to the outline in this case, but I decided not too since it's a little ubiquitous fir this case. Not to mention some normal PDFs can have no outlines, especially if they're pure images.

Looks like all checks have passed!

codecov · 2022-07-11T04:25:26Z

Codecov Report

Merging #1068 (9d34e86) into main (c529884) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #1068   +/-   ##
=======================================
  Coverage   92.26%   92.26%           
=======================================
  Files          24       24           
  Lines        4794     4794           
  Branches      990      990           
=======================================
  Hits         4423     4423           
  Misses        230      230           
  Partials      141      141

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c529884...9d34e86. Read the comment docs.

MartinThoma · 2022-07-23T06:27:58Z

@LamerLink Thank you for your effort. There was another PR #1076 which also solved the same issue, but that one is clearer to me why it solves the issue. As the issue is now (hopefully) gone, this PR seems no longer necessary.

Please let me know if I got something wrong - I can always re-open a PR

michaeleveringham · 2022-07-23T06:29:14Z

@MartinThoma No problem at all, I read over that PR and it makes sense to me! Thanks for all the effort.

MartinThoma · 2022-07-23T06:54:04Z

I've just noticed that your pr increases coverage quite a bit. That is unexpected to me. I want to understand that better

tests/test_reader.py

MartinThoma · 2022-07-24T05:58:58Z

I'm not sure why it originally showed that the test coverage was increased by that much. If we would do it now with #1158, the coverage would decrease slightly (which is what I expected)

Skip NameObject when building outline

22ff519

MartinThoma added the PdfReader The PdfReader component is affected label Jul 10, 2022

Update test test_unexpected_destination to not check for exception.

9aaef4c

mtd91429 mentioned this pull request Jul 15, 2022

ROB: Handle outlines without destination #1076

Merged

MartinThoma closed this in #1076 Jul 23, 2022

MartinThoma closed this in 89c0ff2 Jul 23, 2022

MartinThoma reopened this Jul 23, 2022

Merge branch 'main' into wkhtmltopdf-outline-193

db7bae5

MartinThoma reviewed Jul 24, 2022

View reviewed changes

tests/test_reader.py Show resolved Hide resolved

Update tests/test_reader.py

9d34e86

MartinThoma mentioned this pull request Jul 24, 2022

Skip NameObjects in _build_outline #1158

Closed

MartinThoma closed this Jul 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Skip NameObject when building outline #1068

Skip NameObject when building outline #1068

Uh oh!

michaeleveringham commented Jul 7, 2022

Uh oh!

michaeleveringham commented Jul 7, 2022

Uh oh!

michaeleveringham commented Jul 7, 2022

Uh oh!

michaeleveringham commented Jul 7, 2022

Uh oh!

MartinThoma commented Jul 10, 2022

Uh oh!

michaeleveringham commented Jul 11, 2022

Uh oh!

codecov bot commented Jul 11, 2022 •

edited

Loading

Uh oh!

MartinThoma commented Jul 23, 2022

Uh oh!

michaeleveringham commented Jul 23, 2022

Uh oh!

MartinThoma commented Jul 23, 2022

Uh oh!

Uh oh!

MartinThoma commented Jul 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Skip NameObject when building outline #1068

Skip NameObject when building outline #1068

Uh oh!

Conversation

michaeleveringham commented Jul 7, 2022

Uh oh!

michaeleveringham commented Jul 7, 2022

Uh oh!

michaeleveringham commented Jul 7, 2022

Uh oh!

michaeleveringham commented Jul 7, 2022

Uh oh!

MartinThoma commented Jul 10, 2022

Uh oh!

michaeleveringham commented Jul 11, 2022

Uh oh!

codecov bot commented Jul 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MartinThoma commented Jul 23, 2022

Uh oh!

michaeleveringham commented Jul 23, 2022

Uh oh!

MartinThoma commented Jul 23, 2022

Uh oh!

Uh oh!

MartinThoma commented Jul 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Jul 11, 2022 •

edited

Loading