-
Notifications
You must be signed in to change notification settings - Fork 3.6k
DeepSpeed Integration #5954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeed Integration #5954
Conversation
The other thing I forgot to mention is the requirement for |
Codecov Report
@@ Coverage Diff @@
## master #5954 +/- ##
======================================
- Coverage 93% 93% -0%
======================================
Files 160 160
Lines 11343 11371 +28
======================================
+ Hits 10554 10557 +3
- Misses 789 814 +25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome addition !
"Within the DeepSpeed config, do not set gradient_accumulation_steps " | ||
"as this will be set via accumulate_grad_batches=x argument passed via the Lightning Trainer." | ||
) | ||
self.config["train_micro_batch_size_per_gpu"] = self.lightning_module.train_dataloader().batch_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the model doesn't a train_dataloader as it will be attached by the datamodule ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how I'm testing now, using a datamodule. I think internally the function also caches the train_dataloader which means we don't create it twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works, if not batchsize but directly a batchsampler was provided to the loader, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic can actually be omitted as its only used for timer purposes it seems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic can actually be omitted as its only used for timer purposes it seems
This unfortunately cannot be omitted, there are some assertions internally that rely on this being set, even if it's just for throughput calculation.
I've added a comment to highlight that this default may be incorrect for certain uses that use a BatchSampler. I think for now this is acceptable considering that the DeepSpeed info messages that are printed are suppressed unless the user enables them.
To address this long term, we can make the change in the DeepSpeed repo to make this parameter optional for the DeepSpeedEngine
"Within the DeepSpeed config, do not set gradient_accumulation_steps " | ||
"as this will be set via accumulate_grad_batches=x argument passed via the Lightning Trainer." | ||
) | ||
self.config["train_micro_batch_size_per_gpu"] = self.lightning_module.train_dataloader().batch_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works, if not batchsize but directly a batchsampler was provided to the loader, right?
def batch_to(data): | ||
return data.half() | ||
|
||
def _move_float_tensors_to_half(self, batch: Any): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be a staticmethod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, not a huge issue however but I think this function will eventually be useful for other accelerators.
precision = self.lightning_module.trainer.accelerator_backend.precision | ||
model = LightningDeepSpeedModule(pl_module=self.model, precision=precision) | ||
|
||
if self.lightning_module.trainer.training: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart !
""" | ||
Test to ensure that the plugin can be passed via a string with an environment variable. | ||
""" | ||
config_path = os.path.join(tmpdir, 'temp.json') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart !
"You have not specified an optimizer or scheduler within the DeepSpeed config." | ||
"Using `configure_optimizers` to define optimizer and scheduler." | ||
) | ||
optimizer, lightning_scheduler, optimizer_frequencies = self._init_scheduler_optimizer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it !
|
||
def _initialize_deepspeed_train(self, model): | ||
optimizer, lightning_scheduler, optimizer_frequencies = None, None, None | ||
if "optimizer" not in self.config: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the user specify scheduler and not the optimizer (we make choose the one from the config by default) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they could!
I plan to do a few followup PRs to ease DeepSpeed integration in these cases, these are not super essential but very valid points :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what a beast of a plugin!
Co-authored-by: Adrian Wälchli <[email protected]>
@SeanNaren seems to be missing chlog |
reduce_bucket_size: int = 2e8, | ||
zero_allow_untested_optimizer: bool = True, | ||
config: Optional[Union[Path, str, dict]] = None, | ||
logging_level: int = logging.WARN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would you need to separate logging level, shall be default the global level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of messages out from DeepSpeed, this helps to surpress some of their logging messages, but the user can enable them should they wish!
distributed_backend = "deepspeed" | ||
DEEPSPEED_ENV_VAR = "PL_DEEPSPEED_CONFIG_PATH" | ||
|
||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these default for most models or just very large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, its set for large models, but it's going to be slow without some tuning.
if os.path.exists(config): | ||
with open(config) as f: | ||
config = json.load(f) | ||
else: | ||
raise MisconfigurationException( | ||
f"You passed in a path to a DeepSpeed config but the path does not exist: {config}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if os.path.exists(config): | |
with open(config) as f: | |
config = json.load(f) | |
else: | |
raise MisconfigurationException( | |
f"You passed in a path to a DeepSpeed config but the path does not exist: {config}" | |
) | |
if not os.path.isfile(config): | |
raise MisconfigurationException( | |
f"You passed in a path to a DeepSpeed config but the path does not exist: {config}" | |
) | |
with open(config) as f: | |
config = json.load(f) |
optimizers, schedulers, optimizer_frequencies = self.lightning_module.trainer.init_optimizers( | ||
self.lightning_module | ||
) | ||
if (len(optimizers) != 1) or len(schedulers) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (len(optimizers) != 1) or len(schedulers) > 1: | |
if len(optimizers) > 1 or len(schedulers) > 1: |
# set optimizer for save/load, but deepspeed manages the specific optimizer logic | ||
trainer = self.lightning_module.trainer | ||
trainer.optimizers = [optimizer] | ||
self.model = model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the diff between self.model
and self._model
bellow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I think this is an artifact that should be fixed in the ddp.py file as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, early on in the refactor we didn't have a setter yet, so we referred to _model and this seems to be a leftover :)
HorovodPlugin, | ||
NativeMixedPrecisionPlugin, | ||
Plugin, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to expose this one?
What does this PR do?
Closes #817.
Allows users to enable DeepSpeed training type plugin. Requires some user constraints when training, as this library is built to be more research focused.
The API:
Using config:
Or via config object:
Limitations
The largest limitation is that currently we have to define the optimizer/scheduler within the configs for the DeepSpeed engine to initialise, hence initialisation of Optimisers/schedulers viawe now support configure optimizers with 1 optimizer/scheduler! and deepspeed config options :)configure_optimizers
are ignored. This needs to be made clear within the READMEPerformance
Still need to Address
save/load
. This means some of the state variables are not saved/reloaded for training. This means currentlyresume_from_checkpoint
isn't supported, and a note has to be made in the docs for this.Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list: