Skip to content

Commit efd3ec3

Browse files
pdesupinskimarkc-614
authored andcommitted
Support NUMA Binding for Callable Entrypoints, Take 2 (pytorch#161183)
# Context In pytorch#160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See pytorch#160006 for a full description of the challenges. Although pytorch#160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set. # This PR On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](pytorch#160006 (comment)) Specifically, for each local rank `i`: 1. The parent process sets its own CPU affinity to what local rank `i`'s should be. 2. Then, the parent spawns the subprocess for local rank `i`. 3. Finally, the parent resets its own CPU affinity to what it was originally. There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for *both* `str` and `Callable`, and it's pretty simple. This required a bit of refactoring: 1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`. 2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.). 3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process. 4. Use the context manager for both `str` and `Callable` paths # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled. Pull Request resolved: pytorch#161183 Approved by: https://github.com/d4l3k
1 parent a73f248 commit efd3ec3

File tree

6 files changed

+353
-489
lines changed

6 files changed

+353
-489
lines changed

docs/source/conf.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1221,6 +1221,9 @@
12211221
"reduce_typed_storage_child",
12221222
"storage_from_cache",
12231223
# torch.multiprocessing.spawn
1224+
# Added docstring for this but I think we need to go through
1225+
# and add the entire torch.multiprocessing.spawn module to a .rst...
1226+
"should_use_parallel_start",
12241227
"start_processes",
12251228
# torch.nn.functional
12261229
"adaptive_max_pool1d_with_indices", # documented as adaptive_max_pool1d

0 commit comments

Comments
 (0)