Skip to content

bug: pdf partitioning fails for certain NASA docs #2212

@cragwolfe

Description

@cragwolfe

Describe the bug
curl -X POST 'http://localhost:5000/general/v0/general' --form files="@$file" --form strategy=hi_res
{"detail":"invalid length: 6"}%

File "/home/notebook-user/.local/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 997, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/home/notebook-user/.local/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1014, in render_contents
self.init_resources(resources)
File "/home/notebook-user/unstructured/unstructured/partition/pdf.py", line 541, in pdfminer_interpreter_init_resources
return wrapped(resources)
File "/home/notebook-user/.local/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 384, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/home/notebook-user/.local/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 234, in get_font
font = self.get_font(None, subspec)
File "/home/notebook-user/.local/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 225, in get_font
font = PDFCIDFont(self, spec)
File "/home/notebook-user/.local/lib/python3.10/site-packages/pdfminer/pdffont.py", line 1078, in init
CMapParser(self.unicode_map, BytesIO(strm.get_data())).run()
File "/home/notebook-user/.local/lib/python3.10/site-packages/pdfminer/cmapdb.py", line 294, in run
self.nextobject()
File "/home/notebook-user/.local/lib/python3.10/site-packages/pdfminer/psparser.py", line 656, in nextobject
self.do_keyword(pos, token)
File "/home/notebook-user/.local/lib/python3.10/site-packages/pdfminer/cmapdb.py", line 454, in do_keyword
self.cmap.add_cid2unichr(nunpack(cid), code)
File "/home/notebook-user/.local/lib/python3.10/site-packages/pdfminer/utils.py", line 363, in nunpack
raise TypeError("invalid length: %d" % length)
TypeError: invalid length: 6

where the posted file is
NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf

or
NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-Pg54.pdf

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions