CU-8699upt9a Allow saving output onto disk when multiprocessing #52

mart-r · 2025-07-17T14:48:19Z

This PR allows one to save data to disk when multiprocessing.
The main goal was to leave the saved output as similar to v1's cat.multiprocessing_batch_char_size as possible.

It does so by adding a pair of new arguments to the CAT.get_entities_multi_texts method:

save_dir_path: Optional[str] = None - if specified, the path to be used for the data
batches_per_save: int = 20 - the number of batches to save at once

In addition, it adds a new module

medcat.storage.mp_ents_save
Which is responsible for the saving action
So that it's not cluttering the cat.py module

There's also a minor test suite to make sure

The saved files are in order
- All indices are saved in the annotated_ids.pickle
- All parts (part_<num>.pickle) are accounted for
All the output is saved
- All the parts texts' output is saved in various parts
- The output from the method is equal to the saved data

PS:
The PR also fixes a fundamental multiprocessing issue where not all the data was actually processed.
And adds more comprehensive tests to make sure that's not an issue going forward.

Make sure all the data is saved. That all the files are present. That the saved data is equal to the returned data.

…r batching

…cess.

…ches for saved data

* CU-8699upt9a: Add option to save multiprocessing output * CU-8699upt9a: Add a test for multiprocessing saved data. Make sure all the data is saved. That all the files are present. That the saved data is equal to the returned data. * CU-8699upt9a: Fix typo in output saving * CU-8699upt9a: Add more comprehensive multiprocessing tests with proper batching * CU-8699upt9a: Fix issue with limited number of jobs submitted per process. * CU-8699upt9a: Add a few more tests regarding multiprocessing with batches for saved data

mart-r added 6 commits July 17, 2025 14:50

CU-8699upt9a: Add option to save multiprocessing output

1be9f68

CU-8699upt9a: Add a test for multiprocessing saved data.

38a0496

Make sure all the data is saved. That all the files are present. That the saved data is equal to the returned data.

CU-8699upt9a: Fix typo in output saving

074f2fc

CU-8699upt9a: Add more comprehensive multiprocessing tests with prope…

18d9f18

…r batching

CU-8699upt9a: Fix issue with limited number of jobs submitted per pro…

c599a1c

…cess.

CU-8699upt9a: Add a few more tests regarding multiprocessing with bat…

ef03148

…ches for saved data

mart-r merged commit 9ea9137 into main Jul 18, 2025
18 checks passed

mart-r deleted the CU-8699upt9a-add-save-dir-path-to-get-entities branch July 18, 2025 08:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CU-8699upt9a Allow saving output onto disk when multiprocessing #52

CU-8699upt9a Allow saving output onto disk when multiprocessing #52

Uh oh!

mart-r commented Jul 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

CU-8699upt9a Allow saving output onto disk when multiprocessing #52

CU-8699upt9a Allow saving output onto disk when multiprocessing #52

Uh oh!

Conversation

mart-r commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mart-r commented Jul 17, 2025 •

edited

Loading