DeepSpeed Integration #5954

SeanNaren · 2021-02-13T12:20:48Z

What does this PR do?

Closes #817.

Allows users to enable DeepSpeed training type plugin. Requires some user constraints when training, as this library is built to be more research focused.

The API:

from pytorch_lightning import Trainer

model = MyModel()
trainer = Trainer(gpus=4, plugins='deepspeed', precision=16) # default enables ZeRO optimization/offload
trainer.fit(model)

Using config:

from pytorch_lightning import Trainer

model = MyModel()
trainer = Trainer(accelerator='deepspeed', gpus=4, deepspeed_config="/path/to/deepspeed_config.json", precision=16) # zero offload requires mixed precision
trainer.fit(model)

Or via config object:

from pytorch_lightning import Trainer

deepspeed_config = {
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 3e-5,
            "betas": [0.998, 0.999],
            "eps": 1e-5,
            "weight_decay": 1e-9,
        },
    },
    'scheduler': {
        "type": "WarmupLR",
        "params": {
            "last_batch_iteration": -1,
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 100,
        }
    },
    "zero_optimization": {
        "stage": 2,
        "cpu_offload": True,
        "contiguous_gradients": True,
        "overlap_comm": True
    }
}

model = MyModel()
trainer = Trainer(accelerator='deepspeed', gpus=4, deepspeed_config=deepspeed_config, precision=16) # zero offload requires mixed precision
trainer.fit(model)

Limitations

The largest limitation is that currently we have to define the optimizer/scheduler within the configs for the DeepSpeed engine to initialise, hence initialisation of Optimisers/schedulers via configure_optimizers are ignored. This needs to be made clear within the README we now support configure optimizers with 1 optimizer/scheduler! and deepspeed config options :)
A limitation of the current lightning accelerator API means the precision plugin needs to contain logic for the loop even if precision is handled within the DeepSpeed plugin. This is temporary hopefully till we decide on where the precision logic should live.

Performance

I have tested across large transformer models, similar to this and this which showed really in depth breakdowns of DeepSpeed and have replicated similar results in a different training environment using the DeepSpeed plugin. Both also highlight the fin-nicking with parameters needed to get the most optimum performance, so I'll be adding some references to these great pieces of info + bring some of the information in our docs to highlight this!

Still need to Address

Currently we do not re-initialize the deepspeed engine when we do save/load. This means some of the state variables are not saved/reloaded for training. This means currently resume_from_checkpoint isn't supported, and a note has to be made in the docs for this.
Currently due to using the latest DeepSpeed release, AMD processors with 1-bit ADAM will segfault, with a fix already being worked on here. This is one of the key pieces to the memory reduction, so I'll be keeping an eye out on this, and adding info in the docs in the meantime. Also there are some nice helper functions to reduce the logging which have been merged into DeepSpeed master, but not been made into a release yet!
Ensure test function works, few issues where this crashes (and it might be due to an allocation of more memory)

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Check that target branch and milestone match!

SeanNaren · 2021-02-13T12:21:53Z

The other thing I forgot to mention is the requirement for mpi4py is needed till a new release of DeepSpeed is made, and may work out the box for CI, I'm not sure.

codecov · 2021-02-13T12:26:49Z

Codecov Report

Merging #5954 (981e735) into master (e0bb33c) will decrease coverage by 0%.
The diff coverage is 97%.

@@          Coverage Diff           @@
##           master   #5954   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         160     160           
  Lines       11343   11371   +28     
======================================
+ Hits        10554   10557    +3     
- Misses        789     814   +25

tchaton

Awesome addition !

pytorch_lightning/accelerators/accelerator_connector.py

pytorch_lightning/plugins/precision/deepspeed_precision.py

pytorch_lightning/plugins/training_type/deepspeed.py

tchaton · 2021-02-13T12:51:00Z

pytorch_lightning/plugins/training_type/deepspeed.py

+                "Within the DeepSpeed config, do not set gradient_accumulation_steps "
+                "as this will be set via accumulate_grad_batches=x argument passed via the Lightning Trainer."
+            )
+        self.config["train_micro_batch_size_per_gpu"] = self.lightning_module.train_dataloader().batch_size


What happens if the model doesn't a train_dataloader as it will be attached by the datamodule ?

That's how I'm testing now, using a datamodule. I think internally the function also caches the train_dataloader which means we don't create it twice

I don't think this works, if not batchsize but directly a batchsampler was provided to the loader, right?

I think this logic can actually be omitted as its only used for timer purposes it seems

I think this logic can actually be omitted as its only used for timer purposes it seems

This unfortunately cannot be omitted, there are some assertions internally that rely on this being set, even if it's just for throughput calculation.

I've added a comment to highlight that this default may be incorrect for certain uses that use a BatchSampler. I think for now this is acceptable considering that the DeepSpeed info messages that are printed are suppressed unless the user enables them.

To address this long term, we can make the change in the DeepSpeed repo to make this parameter optional for the DeepSpeedEngine

pytorch_lightning/utilities/apply_func.py

tests/plugins/test_deepspeed_plugin.py

pytorch_lightning/accelerators/accelerator_connector.py

pytorch_lightning/plugins/training_type/deepspeed.py

pytorch_lightning/trainer/trainer.py

docs/source/advanced/multi_gpu.rst

pytorch_lightning/accelerators/accelerator_connector.py

pytorch_lightning/plugins/precision/deepspeed_precision.py

pytorch_lightning/plugins/training_type/deepspeed.py

justusschock · 2021-02-15T08:08:16Z

pytorch_lightning/plugins/training_type/deepspeed.py

+                "Within the DeepSpeed config, do not set gradient_accumulation_steps "
+                "as this will be set via accumulate_grad_batches=x argument passed via the Lightning Trainer."
+            )
+        self.config["train_micro_batch_size_per_gpu"] = self.lightning_module.train_dataloader().batch_size


I don't think this works, if not batchsize but directly a batchsampler was provided to the loader, right?

requirements/extra.txt

docs/source/advanced/multi_gpu.rst

pytorch_lightning/plugins/precision/deepspeed_precision.py

pytorch_lightning/accelerators/accelerator_connector.py

pytorch_lightning/plugins/training_type/deepspeed.py

pytorch_lightning/accelerators/accelerator_connector.py

pytorch_lightning/accelerators/accelerator.py

tchaton · 2021-02-17T18:16:31Z

pytorch_lightning/plugins/training_type/deepspeed.py

+    def batch_to(data):
+        return data.half()
+
+    def _move_float_tensors_to_half(self, batch: Any):


Could be a staticmethod.

Agreed, not a huge issue however but I think this function will eventually be useful for other accelerators.

tchaton · 2021-02-17T18:18:23Z

pytorch_lightning/plugins/training_type/deepspeed.py

+        precision = self.lightning_module.trainer.accelerator_backend.precision
+        model = LightningDeepSpeedModule(pl_module=self.model, precision=precision)
+
+        if self.lightning_module.trainer.training:


tchaton · 2021-02-17T18:19:40Z

tests/plugins/test_deepspeed_plugin.py

+    """
+        Test to ensure that the plugin can be passed via a string with an environment variable.
+    """
+    config_path = os.path.join(tmpdir, 'temp.json')


tchaton · 2021-02-17T18:21:36Z

pytorch_lightning/plugins/training_type/deepspeed.py

+                "You have not specified an optimizer or scheduler within the DeepSpeed config."
+                "Using `configure_optimizers` to define optimizer and scheduler."
+            )
+            optimizer, lightning_scheduler, optimizer_frequencies = self._init_scheduler_optimizer()


tchaton · 2021-02-17T18:22:02Z

pytorch_lightning/plugins/training_type/deepspeed.py

+
+    def _initialize_deepspeed_train(self, model):
+        optimizer, lightning_scheduler, optimizer_frequencies = None, None, None
+        if "optimizer" not in self.config:


Could the user specify scheduler and not the optimizer (we make choose the one from the config by default) ?

Yes they could!

I plan to do a few followup PRs to ease DeepSpeed integration in these cases, these are not super essential but very valid points :)

tchaton

Amazing work !

tests/plugins/test_deepspeed_plugin.py

awaelchli

what a beast of a plugin!

tests/plugins/test_deepspeed_plugin.py

docs/source/advanced/multi_gpu.rst

Co-authored-by: Adrian Wälchli <[email protected]>

Borda · 2021-02-17T20:08:30Z

@SeanNaren seems to be missing chlog

Borda · 2021-02-17T20:10:35Z

pytorch_lightning/plugins/training_type/deepspeed.py

+        reduce_bucket_size: int = 2e8,
+        zero_allow_untested_optimizer: bool = True,
+        config: Optional[Union[Path, str, dict]] = None,
+        logging_level: int = logging.WARN,


why would you need to separate logging level, shall be default the global level?

There are a lot of messages out from DeepSpeed, this helps to surpress some of their logging messages, but the user can enable them should they wish!

Borda · 2021-02-17T20:11:07Z

pytorch_lightning/plugins/training_type/deepspeed.py

+    distributed_backend = "deepspeed"
+    DEEPSPEED_ENV_VAR = "PL_DEEPSPEED_CONFIG_PATH"
+
+    def __init__(


are these default for most models or just very large?

Great point, its set for large models, but it's going to be slow without some tuning.

Borda · 2021-02-17T20:12:46Z

pytorch_lightning/plugins/training_type/deepspeed.py

+            if os.path.exists(config):
+                with open(config) as f:
+                    config = json.load(f)
+            else:
+                raise MisconfigurationException(
+                    f"You passed in a path to a DeepSpeed config but the path does not exist: {config}"
+                )


Suggested change

if os.path.exists(config):

with open(config) as f:

config = json.load(f)

else:

raise MisconfigurationException(

f"You passed in a path to a DeepSpeed config but the path does not exist: {config}"

)

if not os.path.isfile(config):

raise MisconfigurationException(

f"You passed in a path to a DeepSpeed config but the path does not exist: {config}"

)

with open(config) as f:

config = json.load(f)

Borda · 2021-02-17T20:14:21Z

pytorch_lightning/plugins/training_type/deepspeed.py

+        optimizers, schedulers, optimizer_frequencies = self.lightning_module.trainer.init_optimizers(
+            self.lightning_module
+        )
+        if (len(optimizers) != 1) or len(schedulers) > 1:


Suggested change

if (len(optimizers) != 1) or len(schedulers) > 1:

if len(optimizers) > 1 or len(schedulers) > 1:

Borda · 2021-02-17T20:17:20Z

pytorch_lightning/plugins/training_type/deepspeed.py

+        # set optimizer for save/load, but deepspeed manages the specific optimizer logic
+        trainer = self.lightning_module.trainer
+        trainer.optimizers = [optimizer]
+        self.model = model


what is the diff between self.model and self._model bellow?

Good point, I think this is an artifact that should be fixed in the ddp.py file as well

yep, early on in the refactor we didn't have a setter yet, so we referred to _model and this seems to be a leftover :)

Borda · 2021-02-17T20:18:07Z

pytorch_lightning/trainer/connectors/accelerator_connector.py

    HorovodPlugin,
    NativeMixedPrecisionPlugin,
+    Plugin,


do we want to expose this one?

Add initial deepspeed changes

2163721

SeanNaren added the feature Is an improvement or enhancement label Feb 13, 2021

SeanNaren added this to the 1.2 milestone Feb 13, 2021

SeanNaren requested review from awaelchli, Borda, carmocca, tchaton and justusschock February 13, 2021 12:20

SeanNaren requested review from edenlightning and williamFalcon as code owners February 13, 2021 12:20

SeanNaren self-assigned this Feb 13, 2021

tchaton approved these changes Feb 13, 2021

View reviewed changes

Borda reviewed Feb 13, 2021

View reviewed changes

awaelchli reviewed Feb 14, 2021

View reviewed changes

docs/source/advanced/multi_gpu.rst Outdated Show resolved Hide resolved

SeanNaren and others added 2 commits February 14, 2021 13:36

Address code review

14c7b61

Merge branch 'master' into feature/deepspeed2

11a8ab9

justusschock reviewed Feb 15, 2021

View reviewed changes

Move static method outside of function

e8ab7fd

carmocca reviewed Feb 15, 2021

View reviewed changes

SeanNaren and others added 9 commits February 15, 2021 15:19

Fixes

5b1e091

Add missing annotation

d6d90be

Remove seed setting

ab4efdf

Merge branch 'master' into feature/deepspeed2

91abbc4

Doc changes

bffb916

Doc changes, add address reviews

5c4444d

Fix docs

978470c

Try fixing issue by moving to torch adam

41389b9

Clean up check

b1cf9c0

tchaton reviewed Feb 17, 2021

View reviewed changes

pytorch_lightning/accelerators/accelerator_connector.py Outdated Show resolved Hide resolved

pytorch_lightning/accelerators/accelerator.py Outdated Show resolved Hide resolved

pytorch_lightning/accelerators/accelerator.py Show resolved Hide resolved

Address review, add todo

e0a2d6b

tchaton reviewed Feb 17, 2021

View reviewed changes

tchaton approved these changes Feb 17, 2021

View reviewed changes

Add note about unsupported functionality

c565bb1

mergify bot added the has conflicts label Feb 17, 2021

Merge branch 'master' into feature/deepspeed2

a9ba173

mergify bot removed the has conflicts label Feb 17, 2021

SeanNaren enabled auto-merge (squash) February 17, 2021 18:49

kaushikb11 reviewed Feb 17, 2021

View reviewed changes

tests/plugins/test_deepspeed_plugin.py Outdated Show resolved Hide resolved

awaelchli approved these changes Feb 17, 2021

View reviewed changes

tests/plugins/test_deepspeed_plugin.py Outdated Show resolved Hide resolved

docs/source/advanced/multi_gpu.rst Outdated Show resolved Hide resolved

SeanNaren and others added 4 commits February 17, 2021 18:59

Update docs/source/advanced/multi_gpu.rst

fbaf86f

Co-authored-by: Adrian Wälchli <[email protected]>

Address review

1e7dcd6

Remove import

ea1d78c

Add tmpdir

10da87e

tchaton added the _Will label Feb 17, 2021

SeanNaren added 2 commits February 17, 2021 19:18

Add note

f789b77

Add note

31d6267

Borda requested review from Borda, kaushikb11, justusschock and carmocca February 17, 2021 20:05

Borda reviewed Feb 17, 2021

View reviewed changes

williamFalcon approved these changes Feb 17, 2021

View reviewed changes

SeanNaren merged commit 7189d67 into master Feb 17, 2021

SeanNaren deleted the feature/deepspeed2 branch February 17, 2021 20:23

SeanNaren mentioned this pull request Feb 17, 2021

DeepSpeed minor refactors #6042

Merged

11 tasks

carmocca mentioned this pull request Feb 21, 2021

Collapse 2 DeepSpeed tests #6108

Merged

9 tasks

janEbert mentioned this pull request Apr 1, 2021

Out of memory errors no matter what parameters with deep speed lucidrains/DALLE-pytorch#145

Closed

	if (len(optimizers) != 1) or len(schedulers) > 1:
	if len(optimizers) > 1 or len(schedulers) > 1:

DeepSpeed Integration #5954

DeepSpeed Integration #5954

Uh oh!

Conversation

SeanNaren commented Feb 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Limitations

Performance

Still need to Address

Before submitting

PR review

Uh oh!

SeanNaren commented Feb 13, 2021

Uh oh!

codecov bot commented Feb 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SeanNaren Feb 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SeanNaren commented Feb 13, 2021 •

edited

Loading

codecov bot commented Feb 13, 2021 •

edited

Loading

SeanNaren Feb 16, 2021 •

edited

Loading

tchaton Feb 17, 2021 •

edited

Loading