Skip to content

Limit the reading size from Unix sockets to avoid memory overallocation #123557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aplaikner opened this issue Sep 1, 2024 · 0 comments
Open
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@aplaikner
Copy link
Contributor

aplaikner commented Sep 1, 2024

Feature or enhancement

Proposal:

This issue is strongly related to the following issue, where code was merged to limit the reading size of pipes to avoid memory overallocation and runtime slowdowns: #121313

import multiprocessing
import os

def sender(pipe, data):
    pipe.send(data)
    pipe.close()

def receiver(pipe):
    r = pipe.recv()
    print(f"Received data size: {len(r)} Bytes")

socket1, socket2 = multiprocessing.Pipe()

data = 'a' * (512**3)  

p1 = multiprocessing.Process(target=sender, args=(socket2, data))
p2 = multiprocessing.Process(target=receiver, args=(socket1,))

p1.start()
p2.start()

p1.join()
p2.join()
print("Transfer complete.")

Consider the simple Python script above, that creates a supposed pipe using multiprocessing.Pipe() between two processes so that one process can send data to the other. One would suppose that a Unix pipe is created, but that is not the case.

multiprocessing.Pipe()by default creates what the documentation describes as a bidirectional pipe, which, when having a closer look at the actual multiprocessing source code, is a pair of Unix sockets.

When reading from such a Unix socket, the same behavior as when reading from a pipe can be observed: The _recv() function is called with the total remaining amount of data that needs to be read, and that parameter is passed down to the os_read_impl(), which results in the allocation of a huge VMA, installation of a PMD-sized THP, resizing and finally unmapping of the VMA. Since the problem is nearly identical for sockets as it is for pipes, I've slightly extended the previously merged pipe solution to mitigate this problem as well:

        is_pipe = is_socket = False
        if size > self._default_pipe_size > 0:
            mode = os.fstat(handle).st_mode
            is_pipe = stat.S_ISFIFO(mode)
            is_socket = stat.S_ISSOCK(mode)
        limit = self._default_pipe_size if is_pipe or is_socket else remaining

The only difference between using pipes or sockets is the amount of data returned by the read() system call. Unlike pipes having a default limit of 64 KiB,strace shows that Unix sockets can, by default, transfer around 200 KiB-300 KiB per read().

It is important to note that on Linux the default socket size can be printed by checking the /proc/sys/net/core/(r/w)mem_default}. This value is not really relevant since, due to process interleaving, reading more data is possible.

Furthermore, although it is possible to read 200 KiB-300 KiB, testing has shown that using the same limit as the pipe buffer solution --> 64 KiB on systems with 4 KiB base pages, results in the best performance.

Here's some performance numbers: Without the patch the runtime of above code snippet is approx. 0.234s, and with the patch approx. 0.135s, resulting in a speedup of 1.7x.

I've done some testing with larger limits closer to the actual read size of a Unix socket, like 250 KiB. The problem, thereby, is that although no new VMA is created for the above code, meaning the data is put onto the default heap, the heap top is constantly shifted since the input buffers are deleted after each reading round.

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

https://discuss.python.org/t/request-for-review-of-gh-121313-limit-the-reading-size-from-pipes-to-their-default-buffer-size-on-unix-systems/62389/3

Linked PRs

@aplaikner aplaikner added the type-feature A feature request or enhancement label Sep 1, 2024
@picnixz picnixz added the stdlib Python modules in the Lib dir label Sep 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants