Skip to content

PDF.js does not take into account /ActualText #12237

@TimotheAlbouy

Description

@TimotheAlbouy

ESET_Okrum_and_Ketrican.pdf

I'm using PDF.js 2.4.456. The issue I'm describing here is independent of the OS and the web browser.

To reproduce the problem, open the attached document with PDF.js (a quick way to reproduce the bug is to open the document in Firefox), then copy and paste some of the text. In the extracted text, the period characters are missing for the main font:

The Ke3chang group, also known as APT15, is a threat group believed to be operating out of China Its attacks were first reported in 2012,

This is due to a bad mapping in the /ToUnicode property of the font: the 0x2E code (i.e. the period character in ASCII) in the text is mapped to U+0020 (a space in UTF-16).

But when we copy-paste the content of the PDF using Adobe Acrobat, we extract all the period characters correctly. It is because Acrobat takes into account the /ActualText marked text properties inside the PDF.

This is what we see when we open the document using a PDF inspector like PDFDebugger of PDFBox:

(, also known as APT15, is a threat group believed to be operating out of China)Tj
/Span<</ActualText<FEFF002E>>> BDC 
(.)Tj
EMC

We see that the 0x2E code in (.)Tj, which according to the /ToUnicode map represents a space character, is marked to actually represent U+FEFF (a BOM) and U+002E in, a period character.

Thus, Acrobat Reader extract the periods correctly in the given report because it takes into account the /ActualText content, whereas PDF.js doesn't extract the periods because it only considers the /ToUnicode map.

Is there an ongoing effort to fix this problem?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions