🐛 Describe the bug
When enabling cpu offload in FSDP, the pin_memory operation can cause CUDA error: invalid argument sometimes (observed this on A40 but not H100). See https://github.com/hao-ai-lab/FastVideo/actions/runs/15932900017/job/44946105934
Looking at the line, it seems confusing that the parameter is moved to CPU, then pinned to GPU. AFIK pinning on CPU's page locked memory can accelerate transfer, but I don't know if it even makes sense to pin a CPU tensor on GPU.
Versions
It failed in a remote CI so hard to run this, but we used torch 2.7.1+cu128 on A40 GPU.