-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingstaleIssues that haven't received updatesIssues that haven't received updates
Description
Describe the bug
I am trying to finetune SDXL but the training script crashes when saving the model at a checkpoint.
Training runs fine.
Reproduction
Here is my accelerate config choices:
- This machine
- No distributed training
- No
- No
- yes (to use deepspeed)
- no (don't specify a json)
- 2 (deepspeed's zero optimization stage 2)
- cpu (to offload optimizer states on the cpu)
- none (don't offload parameters)
- 4
- no
- no
- 1
- fp16
Then I run this taken from the example in examples/text_to_image/README_sdxl.md
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
Here I only modified the checkpointing_steps to cause the error to happen faster
accelerate launch train_text_to_image_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--pretrained_vae_model_name_or_path=$VAE_NAME \
--dataset_name=$DATASET_NAME \
--enable_xformers_memory_efficient_attention \
--resolution=512 --center_crop --random_flip \
--proportion_empty_prompts=0.2 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 --gradient_checkpointing \
--max_train_steps=10000 \
--use_8bit_adam \
--learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0 \
--mixed_precision="fp16" \
--validation_prompt="a cute Sundar Pichai creature" --validation_epochs 5 \
--checkpointing_steps=5\
--output_dir="sdxl-pokemon-model"
Logs
... (training starts) ...
Steps: 0%| | 4/10000 [00:36<21:14:27, 7.65s/it, lr=1e-6, step_loss=0.0118][2024-03-14 02:46:19,909] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
Steps: 0%| | 5/10000 [00:38<21:02:03, 7.58s/it, lr=1e-6, step_loss=0.0118]03/14/2024 02:46:19 - INFO - accelerate.accelerator - Saving current state to sdxl-pokemon-model/checkpoint-5
03/14/2024 02:46:19 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[2024-03-14 02:46:19,913] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is about to be saved!
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1877: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
[2024-03-14 02:46:19,937] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: sdxl-pokemon-model/checkpoint-5/pytorch_model/mp_rank_00_model_states.pt
[2024-03-14 02:46:19,937] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sdxl-pokemon-model/checkpoint-5/pytorch_model/mp_rank_00_model_states.pt...
[2024-03-14 02:46:36,762] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved sdxl-pokemon-model/checkpoint-5/pytorch_model/mp_rank_00_model_states.pt.
[2024-03-14 02:46:36,766] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sdxl-pokemon-model/checkpoint-5/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2024-03-14 02:47:03,094] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved sdxl-pokemon-model/checkpoint-5/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2024-03-14 02:47:03,095] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved sdxl-pokemon-model/checkpoint-5/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2024-03-14 02:47:03,095] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint pytorch_model is ready now!
03/14/2024 02:47:03 - INFO - accelerate.accelerator - DeepSpeed Model and Optimizer saved to output dir sdxl-pokemon-model/checkpoint-5/pytorch_model
Configuration saved in sdxl-pokemon-model/checkpoint-5/unet/config.json
Traceback (most recent call last):
File "/root/diffusers/examples/text_to_image/train_text_to_image_sdxl.py", line 1312, in <module>
main(args)
File "/root/diffusers/examples/text_to_image/train_text_to_image_sdxl.py", line 1169, in main
accelerator.save_state(save_path)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2706, in save_state
hook(self._models, weights, output_dir)
File "/root/diffusers/examples/text_to_image/train_text_to_image_sdxl.py", line 731, in save_model_hook
model.save_pretrained(os.path.join(output_dir, "unet"))
File "/root/diffusers/src/diffusers/models/modeling_utils.py", line 369, in save_pretrained
safetensors.torch.save_file(
File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 232, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 394, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'down_blocks.2.attentions.0.transformer_blocks.9.norm1.weight', 'up_blocks.0.attentions.0.transformer_blocks.5.attn2.to_out.0.bias', 'up_blocks.1.attentions.0.transformer_blocks.1.attn2.to_v.weight',
...... (lots of layers) .....
'up_blocks.0.attentions.2.transformer_blocks.8.norm1.bias', 'up_blocks.0.attentions.0.transformer_blocks.1.attn2.to_out.0.bias'}].
A potential way to correctly save your model is to use `save_model`.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors
[2024-03-14 02:47:08,088] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 10510) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1002, in launch_command
deepspeed_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 718, in deepspeed_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_text_to_image_sdxl.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-14_02:47:08
host : 4e28de93c858
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 10510)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================System Info
diffusersversion: 0.27.0.dev0- Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- Huggingface_hub version: 0.21.4
- Transformers version: 4.36.2
- Accelerate version: 0.25.0
- xFormers version: 0.0.24
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
I have an RTX4090.
Who can help?
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstaleIssues that haven't received updatesIssues that haven't received updates