PDF.js does not take into account /ActualText

[ESET_Okrum_and_Ketrican.pdf](https://github.com/mozilla/pdf.js/files/5089612/ESET_Okrum_and_Ketrican.pdf)

I'm using PDF.js 2.4.456. The issue I'm describing here is independent of the OS and the web browser.

To reproduce the problem, open the attached document with PDF.js (a quick way to reproduce the bug is to open the document in Firefox), then copy and paste some of the text. In the extracted text, the period characters are missing for the main font:

> The Ke3chang group, also known as APT15, is a threat group believed to be operating out of China  Its attacks were first reported in 2012,

This is due to a bad mapping in the `/ToUnicode` property of the font: the `0x2E` code (i.e. the period character in ASCII) in the text is mapped to `U+0020` (a space in UTF-16).

But when we copy-paste the content of the PDF using Adobe Acrobat, we extract all the period characters correctly. It is because Acrobat takes into account the `/ActualText` marked text properties inside the PDF.

This is what we see when we open the document using a PDF inspector like PDFDebugger of PDFBox:

```
(, also known as APT15, is a threat group believed to be operating out of China)Tj
/Span<</ActualText<FEFF002E>>> BDC 
(.)Tj
EMC
```

We see that the 0x2E code in `(.)Tj`, which according to the `/ToUnicode` map represents a space character, is marked to *actually* represent U+FEFF (a BOM) and U+002E in, a period character.

Thus, Acrobat Reader extract the periods correctly in the given report because it takes into account the `/ActualText` content, whereas PDF.js doesn't extract the periods because it only considers the `/ToUnicode` map.

Is there an ongoing effort to fix this problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDF.js does not take into account /ActualText #12237

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PDF.js does not take into account /ActualText #12237

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions