Skip to content

KeyError and RuntimeError occurred when opening a document(docx) #1645

@chaos798

Description

@chaos798

Bug

When using docling to parse a DOCX file, a KeyError and RuntimeError occurred

Steps to reproduce

When using the llama_index.readers.docling package to parse a DOCX file, the aforementioned errors (KeyError and RuntimeError ) occurred.

An unexpected error occurred while opening the document 322--中国人民银行关于发布存款保险费率管理和保费核定办法(试行)的通知(银发2016177号).docx
Traceback (most recent call last):
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/backend/msword_backend.py", line 80, in init
self.docx_obj = Document(str(self.path_or_stream))
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/api.py", line 27, in Document
document_part = cast("DocumentPart", Package.open(docx).main_document_part)
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/package.py", line 127, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 25, in from_file
sparts = PackageReader._load_serialized_parts(
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 53, in _load_serialized_parts
for partname, blob, reltype, srels in part_walker:
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 86, in _walk_phys_parts
for partname, blob, reltype, srels in next_walker:
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 81, in _walk_phys_parts
blob = phys_reader.blob_for(partname)
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/phys_pkg.py", line 83, in blob_for
return self._zipf.read(pack_uri.membername)
File "/home/hao/miniconda3/envs/10/lib/python3.10/zipfile.py", line 1486, in read
with self.open(name, "r", pwd) as fp:
File "/home/hao/miniconda3/envs/10/lib/python3.10/zipfile.py", line 1525, in open
zinfo = self.getinfo(name)
File "/home/hao/miniconda3/envs/10/lib/python3.10/zipfile.py", line 1452, in getinfo
raise KeyError(
KeyError: "There is no item named 'NULL' in the archive"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/datamodel/document.py", line 136, in init
self._init_doc(backend, path_or_stream)
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/datamodel/document.py", line 185, in _init_doc
self._backend = backend(self, path_or_stream=path_or_stream)
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/backend/msword_backend.py", line 84, in init
raise RuntimeError(
RuntimeError: MsPowerpointDocumentBackend could not load document with hash f6b61b74e024c498bc7b3611d4733e00f765bd3d553cd3275952b533596de09b
Failed to load file /home/hao/python_projects/xmbank/compa/../demoData/人行资料/322--中国人民银行关于发布存款保险费率管理和保费核定办法(试行)的通知(银发2016177号).docx with error: Input document /home/hao/python_projects/xmbank/compa/../demoData/人行资料/322--中国人民银行关于发布存款保险费率管理和保费核定办法(试行)的通知(银发2016177号).docx is not valid.. Skipping...

Docling version

docling-2.34.0

Python version

Python 3.10.16

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdocxissue related to docx backend

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions