Skip to content

Conversation

michaeleveringham
Copy link

Fixes #193, uses @hannal's solution to resolve read/merge issues with wkhtmltopdf pdfs by skipping NameObject entities. Related to #778.

@michaeleveringham
Copy link
Author

@MartinThoma Think I got it now. Thanks again for the help.

@michaeleveringham
Copy link
Author

Never mind, need to update some tests it seems.

@michaeleveringham
Copy link
Author

I tried numerous "bad" PDFs but couldn't get this error to raise. Will have to look into creating a PDF with garbage anchors and uploading to the samples repo.

@MartinThoma
Copy link
Member

@LamerLink Could it be that your PR just solved an issue that test_unexpected_destination has? Maybe we could simply adjust the test to:

def test_unexpected_destination():
    url = "https://corpora.tika.apache.org/base/docs/govdocs1/913/913678.pdf"
    name = "tika-913678.pdf"
    reader = PdfReader(BytesIO(get_pdf_from_url(url, name=name)))
    merger = PdfMerger()
    merger.append(reader)

in this PR?

@MartinThoma MartinThoma added the PdfReader The PdfReader component is affected label Jul 10, 2022
@michaeleveringham
Copy link
Author

@MartinThoma Good call, updated. I thought about adding an assert len(reader.outlines) == 0 since there'll be no valid destinations added to the outline in this case, but I decided not too since it's a little ubiquitous fir this case. Not to mention some normal PDFs can have no outlines, especially if they're pure images.

Looks like all checks have passed!

@codecov
Copy link

codecov bot commented Jul 11, 2022

Codecov Report

Merging #1068 (9d34e86) into main (c529884) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #1068   +/-   ##
=======================================
  Coverage   92.26%   92.26%           
=======================================
  Files          24       24           
  Lines        4794     4794           
  Branches      990      990           
=======================================
  Hits         4423     4423           
  Misses        230      230           
  Partials      141      141           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c529884...9d34e86. Read the comment docs.

@MartinThoma
Copy link
Member

@LamerLink Thank you for your effort. There was another PR #1076 which also solved the same issue, but that one is clearer to me why it solves the issue. As the issue is now (hopefully) gone, this PR seems no longer necessary.

Please let me know if I got something wrong - I can always re-open a PR

@michaeleveringham
Copy link
Author

@MartinThoma No problem at all, I read over that PR and it makes sense to me! Thanks for all the effort.

@MartinThoma MartinThoma reopened this Jul 23, 2022
@MartinThoma
Copy link
Member

I've just noticed that your pr increases coverage quite a bit. That is unexpected to me. I want to understand that better

@MartinThoma
Copy link
Member

I'm not sure why it originally showed that the test coverage was increased by that much. If we would do it now with #1158, the coverage would decrease slightly (which is what I expected)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PdfReader The PdfReader component is affected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PyPDF2.utils.PdfReadError: Unexpected destination '/__WKANCHOR_2'

2 participants