Skip to content

Process reconciliation as a plugin #16410

@awaelchli

Description

@awaelchli

Outline & Motivation

The DDP strategy and its subclasses have a feature called "deadlock detection and process reconciliation". It can ensure that all processes terminate properly when an error occurs on a subset of the ranks. Without this feature, the processes where no errors occur would continue to run and hang/wait at the collectives.

Pro:

  • Can save you costs when running in the cloud.
  • No zombie processes you have to manually kill

Con:

  • Implementation is hardcoded into the DDPStrategy, does not work well with inheritance
  • Makes trainer exception handling complex
  • Assumes a shared filesystem

Pitch

  1. Remove the feature from PL strategies (Remove deadlock detection / process reconciliation logic #16204)
  2. Introduce it as a plugin under fabric
  3. Introduce Strategy.on_exception that the exception handler can call in a standardized way
  4. Re-introduce it in PL strategies once flattening preparations for Fabric integration are done

A strategy can enable the plugin like so:

if not torch_greater_equal_foo and has_shared_filesystem:
    enable_plugin()

and by implementing

def on_exception(self, exception):
    self.plugin.reconciliate_processes(...)

Additional context

Credit for the ideas @carmocca

No response

cc @justusschock @awaelchli @carmocca

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions