DeepSpeed ZeRO Docs update #6752

SeanNaren · 2021-03-30T18:14:43Z

What does this PR do?

Adds some more information about ZeRO Stage 3, but still missing a reasonable amount of info. Lots of additional stuff to come once we iron out of some the current kinks with the API.

Fixes #<issue_number>

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

docs/source/advanced/multi_gpu.rst

kaushikb11

Awesome!

docs/source/advanced/multi_gpu.rst

tchaton

Looks great !

codecov · 2021-03-30T21:43:59Z

Codecov Report

Merging #6752 (20913f5) into master (583fcf2) will decrease coverage by 7%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #6752    +/-   ##
=======================================
- Coverage      91%     85%    -7%     
=======================================
  Files         192     192            
  Lines       12174   12731   +557     
=======================================
- Hits        11133   10808   -325     
- Misses       1041    1923   +882

awaelchli · 2021-03-30T21:46:44Z

docs/source/advanced/multi_gpu.rst

+DeepSpeed ZeRO Stage 3 shards the optimizer states, gradients and the model parameters (also optionally activations). Sharding model parameters and activations comes with an increase in distributed communication, however allows you to scale your models massively from one GPU to multiple GPUs.
+**The DeepSpeed team report the ability to fine-tune models with over 40B parameters on a single GPU and over 2 Trillion parameters on 512 GPUs.** For more information we suggest checking the `DeepSpeed ZeRO-3 Offload documentation <https://www.deepspeed.ai/news/2021/03/07/zero3-offload.html>`__.
+
+We've ran benchmarks and give a simple example of how all these features in Lightning, which you can see at `minGPT <https://github.com/SeanNaren/minGPT/tree/stage3>`_.


Suggested change

We've ran benchmarks and give a simple example of how all these features in Lightning, which you can see at `minGPT <https://github.com/SeanNaren/minGPT/tree/stage3>`_.

We've ran benchmarks and give a simple example of all these features in Lightning, which you can see at `minGPT <https://github.com/SeanNaren/minGPT/tree/stage3>`_.

of all these features
of how all these features work

awaelchli · 2021-03-30T21:49:38Z

docs/source/advanced/multi_gpu.rst

+
+.. note::
+    Currently we only support non-elastic checkpointing. This means saving the model across GPUs will save shards of the model on all processes, which will then require the same amount of GPUS to load.
+    This additionally means for inference you must use the ``Trainer.test` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.


Suggested change

This additionally means for inference you must use the ``Trainer.test` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.

This additionally means for inference you must use the ``Trainer.test`` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.

unbalanced ticks

awaelchli · 2021-03-30T21:52:54Z

docs/source/advanced/multi_gpu.rst

+This reduces the time taken to initialize very large models, as well as ensure we do not run out of memory when instantiating larger models. For more information you can refer to the DeepSpeed docs for `Constructing Massive Models <https://deepspeed.readthedocs.io/en/latest/zero3.html>`_.
+
+.. note::
+    When using ``configure_sharded_model`` hook to shard models, note that ``LightningModule.load_from_checkpoint`` for loading saved checkpoints may not work. If you've trained on one GPU, you can manually instantiate the model and call the hook,


Suggested change

When using ``configure_sharded_model`` hook to shard models, note that ``LightningModule.load_from_checkpoint`` for loading saved checkpoints may not work. If you've trained on one GPU, you can manually instantiate the model and call the hook,

When using the ``configure_sharded_model`` hook to shard models, note that ``LightningModule.load_from_checkpoint`` for loading saved checkpoints may not work. If you've trained on one GPU, you can manually instantiate the model and call the hook,

awaelchli · 2021-03-30T21:55:38Z

docs/source/advanced/multi_gpu.rst

+    Currently we only support non-elastic checkpointing. This means saving the model across GPUs will save shards of the model on all processes, which will then require the same amount of GPUS to load.
+    This additionally means for inference you must use the ``Trainer.test` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.
+
+    This limitation is actively being worked on and will be resolved in the near future.


totally naive idea:

we save the files like this:

checkpoint_name_shard0.ckpt
checkpoint_name_shard1.ckpt
checkpoint_name_shard2.ckpt
...

if user loads a checkpoint that ends with _shard_x and we detect other shard checkpoints in the same folder, then we do a combined loading?

awaelchli · 2021-03-30T21:58:50Z

docs/source/advanced/multi_gpu.rst

+DeepSpeed ZeRO Stage 3 Tips
+"""""""""""""""""""""""""""
+
+Here are some helpful information when setting up DeepSpeed ZeRO Stage 3 with Lightning.


is information plural here?

awaelchli · 2021-03-30T22:00:23Z

too slow ^^ got merged while I was reading.
cheers, nice docs!

SeanNaren added 2 commits March 30, 2021 16:25

Added base docs

0581c2e

Add more information

a452dbb

SeanNaren added the docs Documentation related label Mar 30, 2021

SeanNaren requested review from carmocca, tchaton and a team March 30, 2021 18:14

SeanNaren requested review from awaelchli, Borda and edenlightning as code owners March 30, 2021 18:14

SeanNaren self-assigned this Mar 30, 2021

kaushikb11 reviewed Mar 30, 2021

View reviewed changes

docs/source/advanced/multi_gpu.rst Show resolved Hide resolved

kaushikb11 approved these changes Mar 30, 2021

View reviewed changes

Borda approved these changes Mar 30, 2021

View reviewed changes

docs/source/advanced/multi_gpu.rst Show resolved Hide resolved

docs/source/advanced/multi_gpu.rst Outdated Show resolved Hide resolved

Apply suggestions from code review

cf1cb4a

Borda added the ready PRs ready to be merged label Mar 30, 2021

tchaton approved these changes Mar 30, 2021

View reviewed changes

SeanNaren enabled auto-merge (squash) March 30, 2021 21:02

SeanNaren added 2 commits March 30, 2021 22:05

Fix reference

a606cc1

Add reference

20913f5

SeanNaren merged commit f9bb7c6 into master Mar 30, 2021

SeanNaren deleted the docs/zero branch March 30, 2021 21:52

awaelchli approved these changes Mar 30, 2021

View reviewed changes

kaushikb11 added this to the 1.3 milestone Apr 5, 2021

carmocca mentioned this pull request Apr 7, 2021

DeepSpeed ZeRO docs fixes #6870

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSpeed ZeRO Docs update #6752

DeepSpeed ZeRO Docs update #6752

Uh oh!

SeanNaren commented Mar 30, 2021 •

edited

Loading

Uh oh!

Uh oh!

kaushikb11 left a comment

Uh oh!

Uh oh!

Uh oh!

tchaton left a comment

Uh oh!

codecov bot commented Mar 30, 2021 •

edited

Loading

Uh oh!

awaelchli Mar 30, 2021

Uh oh!

awaelchli Mar 30, 2021

Uh oh!

awaelchli Mar 30, 2021

Uh oh!

awaelchli Mar 30, 2021

Uh oh!

awaelchli Mar 30, 2021

Uh oh!

awaelchli Mar 30, 2021

Uh oh!

awaelchli commented Mar 30, 2021

Uh oh!

Uh oh!

	We've ran benchmarks and give a simple example of how all these features in Lightning, which you can see at `minGPT <https://github.com/SeanNaren/minGPT/tree/stage3>`_.
	We've ran benchmarks and give a simple example of all these features in Lightning, which you can see at `minGPT <https://github.com/SeanNaren/minGPT/tree/stage3>`_.

	This additionally means for inference you must use the ``Trainer.test` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.
	This additionally means for inference you must use the ``Trainer.test`` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.

	When using ``configure_sharded_model`` hook to shard models, note that ``LightningModule.load_from_checkpoint`` for loading saved checkpoints may not work. If you've trained on one GPU, you can manually instantiate the model and call the hook,
	When using the ``configure_sharded_model`` hook to shard models, note that ``LightningModule.load_from_checkpoint`` for loading saved checkpoints may not work. If you've trained on one GPU, you can manually instantiate the model and call the hook,

DeepSpeed ZeRO Docs update #6752

DeepSpeed ZeRO Docs update #6752

Uh oh!

Conversation

SeanNaren commented Mar 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

Uh oh!

kaushikb11 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

awaelchli Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

awaelchli Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

awaelchli Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

awaelchli Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

awaelchli Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

awaelchli Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

awaelchli commented Mar 30, 2021

Uh oh!

Uh oh!

SeanNaren commented Mar 30, 2021 •

edited

Loading

codecov bot commented Mar 30, 2021 •

edited

Loading