Process reconciliation as a plugin

### Outline & Motivation

The DDP strategy and its subclasses have a feature called "deadlock detection and process reconciliation". It can ensure that all processes terminate properly when an error occurs on a subset of the ranks. Without this feature, the processes where no errors occur would continue to run and hang/wait at the collectives. 

Pro:
- Can save you costs when running in the cloud.
- No zombie processes you have to manually kill

Con:
- Implementation is hardcoded into the DDPStrategy, does not work well with inheritance
- Makes [trainer exception handling](https://github.com/Lightning-AI/lightning/blob/fc195b95405e9e2629466e5b28c6a9243209d596/src/pytorch_lightning/trainer/call.py#L57-L59) complex 
- Assumes a shared filesystem

### Pitch

1. Remove the feature from PL strategies (#16204)
2. Introduce it as a plugin under fabric
3. Introduce `Strategy.on_exception` that the exception handler can call in a standardized way
4. Re-introduce it in PL strategies once flattening preparations for Fabric integration are done
- [ ] Subtask: Add the plugin to other strategies such as deepspeed:  #16518

A strategy can enable the plugin like so:
```py
if not torch_greater_equal_foo and has_shared_filesystem:
    enable_plugin()
```

and by implementing 
```py
def on_exception(self, exception):
    self.plugin.reconciliate_processes(...)
```

### Additional context

Credit for the ideas @carmocca 

_No response_

cc @justusschock @awaelchli @carmocca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Process reconciliation as a plugin #16410

Outline & Motivation

Pitch

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Process reconciliation as a plugin #16410

Description

Outline & Motivation

Pitch

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions