-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
I am trying to read the content of a PDF
Environment
$ python -m platform
Linux-6.16.12+deb14+1-amd64-x86_64-with-glibc2.36
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.2.0, crypt_provider=('cryptography', '3.4.8'), PIL=9.0.1Code + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
reader = PdfReader("/path-to-file.pdf")
for page in reader.pages:
text = page.extract_text()The PDF is cedolini_esempio-1.pdf.
While debugging, I found out that the image it is trying to parse is:
\x00\xf8\xff\x00\x00\x02\xfe\xff\x00\x80\xff\x00\x00?\x00\xff\x00\xfe\xfe\x00\xfc\xff\x00\x80\xff\x00\x00[...]\xfbU\x00\x7f\x80\r\nEI
The problem seems to be that
pypdf/pypdf/generic/_image_inline.py
Line 131 in 85b53d8
| pos_tok = data_buffered.find(b"\x80") |
\x80 inside the image, so the following tokens are not EI as expected.
I read the PDF documentation (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.0.pdf) and it says:
The value 128 is placed at the end of the compressed data, as an EOD marker.
but I can't see such value 128.
Why aren't we looking for the EI directly like it is done in the default handler
pypdf/pypdf/generic/_image_inline.py
Line 199 in 85b53d8
| # Read the inline image, while checking for EI (End Image) operator. |
Traceback
This is the relevant part of the traceback I see:
...
for value in page.extract_text().split():
File "/usr/local/lib/python3.10/site-packages/pypdf/_page.py", line 2043, in extract_text
return self._extract_text(
File "/usr/local/lib/python3.10/site-packages/pypdf/_page.py", line 1726, in _extract_text
for operands, operator in content.operations:
File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1406, in operations
self._parse_content_stream(BytesIO(self._data))
File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1285, in _parse_content_stream
ii = self._read_inline_image(stream)
File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1328, in _read_inline_image
data = extract_inline_RL(stream)
File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_image_inline.py", line 142, in extract_inline_RL
raise PdfReadError("EI stream not found")
pypdf.errors.PdfReadError: EI stream not found