Skip to content

Conversation

mart-r
Copy link
Collaborator

@mart-r mart-r commented Jul 8, 2025

Turns out the current multiprocessing pipeline had a few issues.
It would work fine if number of process wasn't specified (since it would do them in sequence).

However, if the number of processes specified was 2 (or greater) processing an empty list of texts would fail. This was becaue the assumption of the method was that work would be distributed accross all workers, and subsequently there's a future for every worked. But in case of an empty input there were no futures.

Furthermore, if the number of processes was 3 (or greater), passing a small list of input texts also often resulted in an exception being raised. This is because with 2 processes, only 1 external worker is spawned, and it gets some texts to manage, and subsequently leaves nothing for the main process (which is fine). But if there's 3 processes, only 1 of the 2 extra processes gets any work. And as such, the assumption of the number of futures available to be waited for was incorrect.

This PR fixes the above issues.
It also adds a few simple tests to make sure they remain fixed.
NOTE: I ran the new tests without the change in this PR and they did in fact error out (as was expected due to the above).

@mart-r mart-r merged commit 9b3ea10 into main Jul 8, 2025
13 checks passed
@mart-r mart-r deleted the CU-8699qbr8e-multiprocessing-empty-batches branch July 8, 2025 11:08
mart-r added a commit that referenced this pull request Jul 8, 2025
* CU-8699qbr8e: Fix issue with multiprocessing in case of empty batches

* CU-8699qbr8e: Add a few simple tests to make sure empty and small datasets can be multiprocessed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant