Skip to content

bug/notion ingestion not working #56

@ribhu97

Description

@ribhu97

Describe the bug
When reading a notion page, I encounter the following error:

2024-08-18 20:49:08,267 SpawnPoolWorker-2 ERROR    failed to get data associated with source doc: {"processor_config": {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "notion-ingest-output", "num_processes": 2, "raise_on_error": false}, "read_config": {"download_dir": "/root/.cache/unstructured/ingest/notion/fe36aac3b4", "re_download": false, "preserve_downloads": false, "download_only": false, "max_docs": null}, "connector_config": {"access_config": {"notion_api_key": "*******"}, "page_ids": ["***"], "database_ids": ["***"], "recursive": false}, "_source_metadata": null, "_date_processed": null, "database_id": "***", "retry_strategy_config": null, "registry_name": "notion_database", "base_filename": "/****.html", "filename": "/root/.cache/unstructured/ingest/notion/***.html", "_output_filename": "notion-ingest-output/****.json", "record_locator": null, "unique_id": "/root/.cache/unstructured/ingest/notion/fe36aac3b4/***.html"}, Database.__init__() got an unexpected keyword argument 'in_trash'
Traceback (most recent call last):
  File "/root/.pyenv/versions/bstacks/lib/python3.10/site-packages/unstructured_ingest/pipeline/source.py", line 62, in run
    return self.get_single(doc=doc, ingest_doc_dict=ingest_doc_dict)
  File "/root/.pyenv/versions/bstacks/lib/python3.10/site-packages/unstructured_ingest/pipeline/source.py", line 36, in get_single
    doc.get_file()
  File "/root/.pyenv/versions/bstacks/lib/python3.10/site-packages/unstructured_ingest/interfaces.py", line 523, in wrapper
    return func(self, *args, **kwargs)
  File "/root/.pyenv/versions/bstacks/lib/python3.10/site-packages/unstructured_ingest/utils/dep_check.py", line 45, in wrapper
    return func(*args, **kwargs)
  File "/root/.pyenv/versions/bstacks/lib/python3.10/site-packages/unstructured_ingest/connector/notion/connector.py", line 222, in get_file
    text_extraction = extract_database_html(
  File "/root/.pyenv/versions/bstacks/lib/python3.10/site-packages/unstructured_ingest/connector/notion/helpers.py", line 149, in extract_database_html
    database: Database = client.databases.retrieve(database_id=database_id)  # type: ignore
  File "/root/.pyenv/versions/bstacks/lib/python3.10/site-packages/unstructured_ingest/connector/notion/client.py", line 106, in retrieve
    return Database.from_dict(data=resp)
  File "/root/.pyenv/versions/bstacks/lib/python3.10/site-packages/unstructured_ingest/connector/notion/types/database.py", line 50, in from_dict
    page = cls(
TypeError: Database.__init__() got an unexpected keyword argument 'in_trash'

Database ID, Page ID, and API keys are redacted here.

To Reproduce
I have followed the notion connector tutorial with the verbatim code that is there in this page: https://docs.unstructured.io/api-reference/ingest/source-connectors/notion

I am using python 3.10 on ubuntu 22.04 LTS.

Expected behavior
The notion page gets ingested and processed

Environment Info

OS version:  Linux-5.15.0-25-generic-x86_64-with-glibc2.35
Python version:  3.10.12
unstructured version:  0.15.5
unstructured-inference is not installed
pytesseract is not installed
Torch is not installed
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
Traceback (most recent call last):
  File "/root/unstructured/scripts/collect_env.py", line 242, in <module>
    main()
  File "/root/unstructured/scripts/collect_env.py", line 234, in main
    libreoffice_version = get_libreoffice_version()
  File "/root/unstructured/scripts/collect_env.py", line 163, in get_libreoffice_version
    result = subprocess.run(
  File "/usr/lib/python3.10/subprocess.py", line 503, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.10/subprocess.py", line 1863, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions