Skip to content

Raises "EI stream not found" while reading RunLengthDecode (RL) inline image #3517

@SirPyTech

Description

@SirPyTech

I am trying to read the content of a PDF

Environment

$ python -m platform
Linux-6.16.12+deb14+1-amd64-x86_64-with-glibc2.36

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.2.0, crypt_provider=('cryptography', '3.4.8'), PIL=9.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
reader = PdfReader("/path-to-file.pdf")

for page in reader.pages:
    text = page.extract_text()

The PDF is cedolini_esempio-1.pdf.

While debugging, I found out that the image it is trying to parse is:

\x00\xf8\xff\x00\x00\x02\xfe\xff\x00\x80\xff\x00\x00?\x00\xff\x00\xfe\xfe\x00\xfc\xff\x00\x80\xff\x00\x00[...]\xfbU\x00\x7f\x80\r\nEI

The problem seems to be that

pos_tok = data_buffered.find(b"\x80")
finds the \x80 inside the image, so the following tokens are not EI as expected.

I read the PDF documentation (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.0.pdf) and it says:

The value 128 is placed at the end of the compressed data, as an EOD marker.

but I can't see such value 128.

Why aren't we looking for the EI directly like it is done in the default handler

# Read the inline image, while checking for EI (End Image) operator.
?

Traceback

This is the relevant part of the traceback I see:

...
    for value in page.extract_text().split():
  File "/usr/local/lib/python3.10/site-packages/pypdf/_page.py", line 2043, in extract_text
    return self._extract_text(
  File "/usr/local/lib/python3.10/site-packages/pypdf/_page.py", line 1726, in _extract_text
    for operands, operator in content.operations:
  File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1406, in operations
    self._parse_content_stream(BytesIO(self._data))
  File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1285, in _parse_content_stream
    ii = self._read_inline_image(stream)
  File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1328, in _read_inline_image
    data = extract_inline_RL(stream)
  File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_image_inline.py", line 142, in extract_inline_RL
    raise PdfReadError("EI stream not found")
pypdf.errors.PdfReadError: EI stream not found

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-discussionThe PR/issue needs more discussion before we can continueworkflow-imagesFrom a users perspective, image handling is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions