-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
fabriclightning.fabric.Fabriclightning.fabric.Fabricrefactorstrategy: ddpDistributedDataParallelDistributedDataParallel
Milestone
Description
Outline & Motivation
The DDP strategy and its subclasses have a feature called "deadlock detection and process reconciliation". It can ensure that all processes terminate properly when an error occurs on a subset of the ranks. Without this feature, the processes where no errors occur would continue to run and hang/wait at the collectives.
Pro:
- Can save you costs when running in the cloud.
- No zombie processes you have to manually kill
Con:
- Implementation is hardcoded into the DDPStrategy, does not work well with inheritance
- Makes trainer exception handling complex
- Assumes a shared filesystem
Pitch
- Remove the feature from PL strategies (Remove deadlock detection / process reconciliation logic #16204)
- Introduce it as a plugin under fabric
- Introduce
Strategy.on_exception
that the exception handler can call in a standardized way - Re-introduce it in PL strategies once flattening preparations for Fabric integration are done
- Subtask: Add the plugin to other strategies such as deepspeed: DeepSpeed Training Deadlock #16518
A strategy can enable the plugin like so:
if not torch_greater_equal_foo and has_shared_filesystem:
enable_plugin()
and by implementing
def on_exception(self, exception):
self.plugin.reconciliate_processes(...)
Additional context
Credit for the ideas @carmocca
No response
carmocca, eaplatanios and justusschock
Metadata
Metadata
Assignees
Labels
fabriclightning.fabric.Fabriclightning.fabric.Fabricrefactorstrategy: ddpDistributedDataParallelDistributedDataParallel