Skip to content

Conversation

mart-r
Copy link
Collaborator

@mart-r mart-r commented Jul 17, 2025

This PR allows one to save data to disk when multiprocessing.
The main goal was to leave the saved output as similar to v1's cat.multiprocessing_batch_char_size as possible.

It does so by adding a pair of new arguments to the CAT.get_entities_multi_texts method:

  • save_dir_path: Optional[str] = None - if specified, the path to be used for the data
  • batches_per_save: int = 20 - the number of batches to save at once

In addition, it adds a new module

  • medcat.storage.mp_ents_save
  • Which is responsible for the saving action
  • So that it's not cluttering the cat.py module

There's also a minor test suite to make sure

  • The saved files are in order
    • All indices are saved in the annotated_ids.pickle
    • All parts (part_<num>.pickle) are accounted for
  • All the output is saved
    • All the parts texts' output is saved in various parts
    • The output from the method is equal to the saved data

PS:
The PR also fixes a fundamental multiprocessing issue where not all the data was actually processed.
And adds more comprehensive tests to make sure that's not an issue going forward.

@mart-r mart-r merged commit 9ea9137 into main Jul 18, 2025
18 checks passed
@mart-r mart-r deleted the CU-8699upt9a-add-save-dir-path-to-get-entities branch July 18, 2025 08:57
mart-r added a commit that referenced this pull request Jul 18, 2025
* CU-8699upt9a: Add option to save multiprocessing output

* CU-8699upt9a: Add a test for multiprocessing saved data.

Make sure all the data is saved. That all the files are present. That the saved data is equal to the returned data.

* CU-8699upt9a: Fix typo in output saving

* CU-8699upt9a: Add more comprehensive multiprocessing tests with proper batching

* CU-8699upt9a: Fix issue with limited number of jobs submitted per process.

* CU-8699upt9a: Add a few more tests regarding multiprocessing with batches for saved data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant