Skip to content

Optimized Python IPC: Uses shared memory to bypass multiprocessing queue I/O bottlenecks, ideal for large data (1MB+) in scientific computing, RL, etc. Reduces system load and improves latency

License

Notifications You must be signed in to change notification settings

tobiaswuerth/python_shared_memory_queue

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Optimized Large-Scale IPC in Python with Shared Memory Queues

PyPI version

Install using:

pip install py-sharedmemory

Python's standard multiprocessing.Queue relies on _winapi.CreateFile for inter-process communication (IPC), introducing significant I/O overhead. This can become a performance bottleneck in demanding applications like Reinforcement Learning, scientific computing, or distributed systems that transfer large amounts of data (e.g. NumPy arrays or tensors) between processes (actors, replay buffers, trainers, etc.).

py-sharedmemory provides an alternative that utilizes multiprocessing.shared_memory (and therefore _winapi.CreateFileMapping) for the main data and sends only lightweight metadata through the queues. This eliminates most inter-process I/O, reducing system load and latency. If you're hitting performance limits with standard queues, py-sharedmemory may help.


Usage example

import multiprocessing as mp
from memory import create_shared_memory_pair, SharedMemorySender, SharedMemoryReceiver

def producer_sm(sender:SharedMemorySender):
    your_data = "your data"

    sender.put(your_data) # blocks until space is available
    sender.put(your_data, timeout=3) # raises queue.Full exception after 3s
    sender.put(your_data, block=False) # raises queue.Full exception if no space available
    sender.put_nowait(your_data) # ^ equivalent to above
    # ...

    # wait for all data to be received before closing the sender
    # to properly close all shared memory objects
    sender.wait_for_all_ack()

def consumer_sm(receiver:SharedMemoryReceiver):
    data = receiver.get() # blocks
    data = receiver.get(timeout=3) # raises queue.Empty exception after 3s
    data = receiver.get(block=False) # raises queue.Empty exception if no data available
    data = receiver.get_nowait() # ^ equivalent to above
    # ...

if __name__ == '__main__':
    sender, receiver = create_shared_memory_pair(capacity=5)
    mp.Process(target=producer_sm, args=(sender,)).start()
    mp.Process(target=consumer_sm, args=(receiver,)).start()

Considerations

There is a certain overhead to allocating shared memory which is especially noticable for smaller objects. Use the following heuristic depending on the size of the data you are handling:

10B 100B 1KB 10KB 100KB 1MB 10MB 100MB 1GB 10GB
mp.Queue()
py-sharedmemory

Performance Testing

I benchmarked data transfer performance using both the standard multiprocessing.Queue and my py-sharedmemory implementation:

Performance Comparison

Starting around 1MB per message, py-sharedmemory matches or slightly trails the standard queue in speed. However, the key advantage is that it avoids generating I/O, which becomes critical at larger data sizes. Notably, the standard implementation fails with 10GB messages, while py-sharedmemory handles them reliably.

Here’s the I/O load on my Windows system using the standard queue:

I/O for standard implementation

And here’s the I/O load using py-sharedmemory:

SharedMemory I/O

In practice, py-sharedmemory delivers smoother and more stable performance, with consistent put/get times and no slowdowns, especially under high data throughput.

About

Optimized Python IPC: Uses shared memory to bypass multiprocessing queue I/O bottlenecks, ideal for large data (1MB+) in scientific computing, RL, etc. Reduces system load and improves latency

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published