Skip to content

Commit 9df97a7

Browse files
Zhengjun Xingmeta-codesync[bot]
authored andcommitted
Fix OSError: [Errno 24] Too many open files in multi-copy benchmark (#5083)
Summary: Pull Request resolved: #5083 X-link: https://github.com/facebookresearch/FBGEMM/pull/2089 When running benchmarks with a large number of copies, the process may raise: OSError: [Errno 24] Too many open files. Example command: (fbgemm_gpu_env)$ ulimit -n 1048576 (fbgemm_gpu_env)$ python ./bench/tbe/tbe_inference_benchmark.py nbit-cpu \ --num-embeddings=40000000 --bag-size=2 --embedding-dim=96 \ --batch-size=162 --num-tables=8 --weights-precision=int4 \ --output-dtype=fp32 --copies=96 --iters=30000 PyTorch multiprocessing provides two shared-memory strategies: 1.file_descriptor (default) 2.file_system The default file_descriptor strategy uses file descriptors as shared memory handles, which can result in a large number of open FDs when many tensors are shared. If the total number of open FDs exceeds the system limit and cannot be raised, the file_system strategy should be used instead. This patch allows switching to the file_system strategy by setting: export PYTORCH_SHARE_STRATEGY='file_system' Reference: https://pytorch.org/docs/stable/multiprocessing.html#sharing-strategies Pull Request resolved: #5037 Reviewed By: spcyppt Differential Revision: D86135817 Pulled By: q10 fbshipit-source-id: 15f6fe7e1de5e9fef828f5a1496dc1cf9b41c293
1 parent be1b514 commit 9df97a7

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

fbgemm_gpu/fbgemm_gpu/tbe/bench/bench_runs.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,13 @@ def benchmark_cpu_requests_mp(
153153
float: The average runtime per iteration in seconds.
154154
155155
"""
156+
import os
157+
158+
strategy = os.environ.get("PYTORCH_SHARE_STRATEGY")
159+
current_strategy = torch.multiprocessing.get_sharing_strategy()
160+
if strategy is not None and current_strategy != strategy:
161+
torch.multiprocessing.set_sharing_strategy(strategy)
162+
156163
cpu_bm_barrier.create_barrier(num_copies)
157164
worker_pool = torch.multiprocessing.Pool(num_copies)
158165

0 commit comments

Comments
 (0)