-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Description
CUDA pinned memory is important for efficient execution because it allows for faster data transfers and non-blocking CUDA copies.
The copy from normal memory to pinned memory can take significant time. A batch of 256x3x224x224 FloatTensor takes about 110ms on my computer to copy. Currently we can only do the copy on the main process because inter-process shared Tensor/Storages are copied to non-page locked shared memory. For small conv nets on fast GPUs, we probably need to do the copy in the background.
I believe we can page-lock the shared memory via cudaHostRegister. We would probably need to unregister it via cudaHostUnregister before freeing the memory.
This would require some knowledge of CUDA in the shared memory code or at least a free hooks to call cudaHostUnregister.