-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
This is opened as the continuation of GH-7962, GH-8161, GH-8162, GH-3981 and GH-4654 and is one of the approaches to solve GH-825.
Why multithreading?
One common inconvenience with using pip
is the delay for networking, since most package indices are not really fast[citation needed] and during package management pip
needs to fetch many things (the package list, the packages themselves, etc.). Parallelization is one obvious solution to tackle this, and I hope it will the cheaper one, hence this issue is open to ensure that the implementation process will not be a labor-expensive work.
Until next year when Python 2 support is dropped, there are two options: multithreading and multiprocessing. While the latter is safer, (1) not every platform has multiple CPU cores and (2) the modified code will need to undergo a huge refactoring to give each core the data it needs. So we are left with multiprocessing. The Python 3 asyncio
immediate solution however (plus it also require making many existing routines awaitable).
What is the problem with multithreading?
Putting thread-safety aside (not because it's not a problem, but rather because I think everyone knows how problematic it is), the most obvious solution provided by Python multiprocessing.dummy.Pool
requires sem_open
(bpo-3770), which seems to raises ImportError
during initialization of the pool's attributes. Since sem_open
is to be provided by the operating system, this raises the question that whether multiprocessing.dummy
is supported on platforms that pip
care to support and is (the more generic?) threading
suffers the same issue if we implement the Pool
ourselves. How about concurrent.futures
(GH-3981)? Would it be worth it to do it, from the developers' perspective as well as that of our users, if things go wrong on their platform?
If we decide to do it anyway, how?
From GH-8162, IMHO it is safe to assume that (this is a really dangerous thing to say 😞) we can fallback to map
if multiprocessing.dummy.Pool
can't have sem_open
. If this works, personally I suggest to declare a higher order function to reuse in other places, namely for parallel downloading of packages (GH-825). Still under the assumption that this is correct, we can easily mock the failing behavior for testing. However, with my modest experience in threading and the overwhelming responsibility of not breaking thousands[citation needed, could be millions] of people's workflows, please do not take my words for granted and kindly share your thoughts on this particular matter.