-
Couldn't load subscription status.
- Fork 3k
Description
Bug
When using docling to parse a DOCX file, a KeyError and RuntimeError occurred
Steps to reproduce
When using the llama_index.readers.docling package to parse a DOCX file, the aforementioned errors (KeyError and RuntimeError ) occurred.
An unexpected error occurred while opening the document 322--中国人民银行关于发布存款保险费率管理和保费核定办法(试行)的通知(银发2016177号).docx
Traceback (most recent call last):
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/backend/msword_backend.py", line 80, in init
self.docx_obj = Document(str(self.path_or_stream))
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/api.py", line 27, in Document
document_part = cast("DocumentPart", Package.open(docx).main_document_part)
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/package.py", line 127, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 25, in from_file
sparts = PackageReader._load_serialized_parts(
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 53, in _load_serialized_parts
for partname, blob, reltype, srels in part_walker:
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 86, in _walk_phys_parts
for partname, blob, reltype, srels in next_walker:
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 81, in _walk_phys_parts
blob = phys_reader.blob_for(partname)
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/phys_pkg.py", line 83, in blob_for
return self._zipf.read(pack_uri.membername)
File "/home/hao/miniconda3/envs/10/lib/python3.10/zipfile.py", line 1486, in read
with self.open(name, "r", pwd) as fp:
File "/home/hao/miniconda3/envs/10/lib/python3.10/zipfile.py", line 1525, in open
zinfo = self.getinfo(name)
File "/home/hao/miniconda3/envs/10/lib/python3.10/zipfile.py", line 1452, in getinfo
raise KeyError(
KeyError: "There is no item named 'NULL' in the archive"
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/datamodel/document.py", line 136, in init
self._init_doc(backend, path_or_stream)
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/datamodel/document.py", line 185, in _init_doc
self._backend = backend(self, path_or_stream=path_or_stream)
File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/backend/msword_backend.py", line 84, in init
raise RuntimeError(
RuntimeError: MsPowerpointDocumentBackend could not load document with hash f6b61b74e024c498bc7b3611d4733e00f765bd3d553cd3275952b533596de09b
Failed to load file /home/hao/python_projects/xmbank/compa/../demoData/人行资料/322--中国人民银行关于发布存款保险费率管理和保费核定办法(试行)的通知(银发2016177号).docx with error: Input document /home/hao/python_projects/xmbank/compa/../demoData/人行资料/322--中国人民银行关于发布存款保险费率管理和保费核定办法(试行)的通知(银发2016177号).docx is not valid.. Skipping...
Docling version
docling-2.34.0
Python version
Python 3.10.16