Skip to content

Conversation

SeanNaren
Copy link
Contributor

@SeanNaren SeanNaren commented Mar 30, 2021

What does this PR do?

Adds some more information about ZeRO Stage 3, but still missing a reasonable amount of info. Lots of additional stuff to come once we iron out of some the current kinks with the API.

Fixes #<issue_number>

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@SeanNaren SeanNaren added the docs Documentation related label Mar 30, 2021
@SeanNaren SeanNaren requested review from carmocca, tchaton and a team March 30, 2021 18:14
@SeanNaren SeanNaren self-assigned this Mar 30, 2021
Copy link
Contributor

@kaushikb11 kaushikb11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@Borda Borda added the ready PRs ready to be merged label Mar 30, 2021
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great !

@SeanNaren SeanNaren enabled auto-merge (squash) March 30, 2021 21:02
@codecov
Copy link

codecov bot commented Mar 30, 2021

Codecov Report

Merging #6752 (20913f5) into master (583fcf2) will decrease coverage by 7%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #6752    +/-   ##
=======================================
- Coverage      91%     85%    -7%     
=======================================
  Files         192     192            
  Lines       12174   12731   +557     
=======================================
- Hits        11133   10808   -325     
- Misses       1041    1923   +882     

@SeanNaren SeanNaren merged commit f9bb7c6 into master Mar 30, 2021
@SeanNaren SeanNaren deleted the docs/zero branch March 30, 2021 21:52
DeepSpeed ZeRO Stage 3 shards the optimizer states, gradients and the model parameters (also optionally activations). Sharding model parameters and activations comes with an increase in distributed communication, however allows you to scale your models massively from one GPU to multiple GPUs.
**The DeepSpeed team report the ability to fine-tune models with over 40B parameters on a single GPU and over 2 Trillion parameters on 512 GPUs.** For more information we suggest checking the `DeepSpeed ZeRO-3 Offload documentation <https://www.deepspeed.ai/news/2021/03/07/zero3-offload.html>`__.

We've ran benchmarks and give a simple example of how all these features in Lightning, which you can see at `minGPT <https://github.com/SeanNaren/minGPT/tree/stage3>`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We've ran benchmarks and give a simple example of how all these features in Lightning, which you can see at `minGPT <https://github.com/SeanNaren/minGPT/tree/stage3>`_.
We've ran benchmarks and give a simple example of all these features in Lightning, which you can see at `minGPT <https://github.com/SeanNaren/minGPT/tree/stage3>`_.

of all these features
of how all these features work


.. note::
Currently we only support non-elastic checkpointing. This means saving the model across GPUs will save shards of the model on all processes, which will then require the same amount of GPUS to load.
This additionally means for inference you must use the ``Trainer.test` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This additionally means for inference you must use the ``Trainer.test` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.
This additionally means for inference you must use the ``Trainer.test`` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unbalanced ticks

This reduces the time taken to initialize very large models, as well as ensure we do not run out of memory when instantiating larger models. For more information you can refer to the DeepSpeed docs for `Constructing Massive Models <https://deepspeed.readthedocs.io/en/latest/zero3.html>`_.

.. note::
When using ``configure_sharded_model`` hook to shard models, note that ``LightningModule.load_from_checkpoint`` for loading saved checkpoints may not work. If you've trained on one GPU, you can manually instantiate the model and call the hook,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When using ``configure_sharded_model`` hook to shard models, note that ``LightningModule.load_from_checkpoint`` for loading saved checkpoints may not work. If you've trained on one GPU, you can manually instantiate the model and call the hook,
When using the ``configure_sharded_model`` hook to shard models, note that ``LightningModule.load_from_checkpoint`` for loading saved checkpoints may not work. If you've trained on one GPU, you can manually instantiate the model and call the hook,

Currently we only support non-elastic checkpointing. This means saving the model across GPUs will save shards of the model on all processes, which will then require the same amount of GPUS to load.
This additionally means for inference you must use the ``Trainer.test` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.

This limitation is actively being worked on and will be resolved in the near future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally naive idea:

we save the files like this:

checkpoint_name_shard0.ckpt
checkpoint_name_shard1.ckpt
checkpoint_name_shard2.ckpt
...

if user loads a checkpoint that ends with _shard_x and we detect other shard checkpoints in the same folder, then we do a combined loading?

DeepSpeed ZeRO Stage 3 Tips
"""""""""""""""""""""""""""

Here are some helpful information when setting up DeepSpeed ZeRO Stage 3 with Lightning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is information plural here?

@awaelchli
Copy link
Contributor

too slow ^^ got merged while I was reading.
cheers, nice docs!

@kaushikb11 kaushikb11 added this to the 1.3 milestone Apr 5, 2021
@carmocca carmocca mentioned this pull request Apr 7, 2021
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants