Skip to content

train_text_to_image_sdxl.py Can't save model at checkpoint #7311

@clement-swk

Description

@clement-swk

Describe the bug

I am trying to finetune SDXL but the training script crashes when saving the model at a checkpoint.
Training runs fine.

Reproduction

Here is my accelerate config choices:

  • This machine
  • No distributed training
  • No
  • No
  • yes (to use deepspeed)
  • no (don't specify a json)
  • 2 (deepspeed's zero optimization stage 2)
  • cpu (to offload optimizer states on the cpu)
  • none (don't offload parameters)
  • 4
  • no
  • no
  • 1
  • fp16

Then I run this taken from the example in examples/text_to_image/README_sdxl.md

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

Here I only modified the checkpointing_steps to cause the error to happen faster

accelerate launch train_text_to_image_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --pretrained_vae_model_name_or_path=$VAE_NAME \
  --dataset_name=$DATASET_NAME \
  --enable_xformers_memory_efficient_attention \
  --resolution=512 --center_crop --random_flip \
  --proportion_empty_prompts=0.2 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 --gradient_checkpointing \
  --max_train_steps=10000 \
  --use_8bit_adam \
  --learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --mixed_precision="fp16" \
  --validation_prompt="a cute Sundar Pichai creature" --validation_epochs 5 \
  --checkpointing_steps=5\
  --output_dir="sdxl-pokemon-model"

Logs

... (training starts) ...
Steps:   0%|          | 4/10000 [00:36<21:14:27,  7.65s/it, lr=1e-6, step_loss=0.0118][2024-03-14 02:46:19,909] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728

Steps:   0%|          | 5/10000 [00:38<21:02:03,  7.58s/it, lr=1e-6, step_loss=0.0118]03/14/2024 02:46:19 - INFO - accelerate.accelerator - Saving current state to sdxl-pokemon-model/checkpoint-5
03/14/2024 02:46:19 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[2024-03-14 02:46:19,913] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is about to be saved!
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1877: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[2024-03-14 02:46:19,937] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: sdxl-pokemon-model/checkpoint-5/pytorch_model/mp_rank_00_model_states.pt
[2024-03-14 02:46:19,937] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sdxl-pokemon-model/checkpoint-5/pytorch_model/mp_rank_00_model_states.pt...
[2024-03-14 02:46:36,762] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved sdxl-pokemon-model/checkpoint-5/pytorch_model/mp_rank_00_model_states.pt.
[2024-03-14 02:46:36,766] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sdxl-pokemon-model/checkpoint-5/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2024-03-14 02:47:03,094] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved sdxl-pokemon-model/checkpoint-5/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2024-03-14 02:47:03,095] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved sdxl-pokemon-model/checkpoint-5/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2024-03-14 02:47:03,095] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint pytorch_model is ready now!
03/14/2024 02:47:03 - INFO - accelerate.accelerator - DeepSpeed Model and Optimizer saved to output dir sdxl-pokemon-model/checkpoint-5/pytorch_model
Configuration saved in sdxl-pokemon-model/checkpoint-5/unet/config.json
Traceback (most recent call last):
  File "/root/diffusers/examples/text_to_image/train_text_to_image_sdxl.py", line 1312, in <module>
    main(args)
  File "/root/diffusers/examples/text_to_image/train_text_to_image_sdxl.py", line 1169, in main
    accelerator.save_state(save_path)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2706, in save_state
    hook(self._models, weights, output_dir)
  File "/root/diffusers/examples/text_to_image/train_text_to_image_sdxl.py", line 731, in save_model_hook
    model.save_pretrained(os.path.join(output_dir, "unet"))
  File "/root/diffusers/src/diffusers/models/modeling_utils.py", line 369, in save_pretrained
    safetensors.torch.save_file(
  File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 232, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
  File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 394, in _flatten
    raise RuntimeError(
RuntimeError: 
            Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'down_blocks.2.attentions.0.transformer_blocks.9.norm1.weight', 'up_blocks.0.attentions.0.transformer_blocks.5.attn2.to_out.0.bias', 'up_blocks.1.attentions.0.transformer_blocks.1.attn2.to_v.weight', 
...... (lots of layers) .....
'up_blocks.0.attentions.2.transformer_blocks.8.norm1.bias', 'up_blocks.0.attentions.0.transformer_blocks.1.attn2.to_out.0.bias'}].
            A potential way to correctly save your model is to use `save_model`.
            More information at https://huggingface.co/docs/safetensors/torch_shared_tensors
            
[2024-03-14 02:47:08,088] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 10510) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1002, in launch_command
    deepspeed_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 718, in deepspeed_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_text_to_image_sdxl.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-14_02:47:08
  host      : 4e28de93c858
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 10510)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

System Info

  • diffusers version: 0.27.0.dev0
  • Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • PyTorch version (GPU?): 2.2.1+cu121 (True)
  • Huggingface_hub version: 0.21.4
  • Transformers version: 4.36.2
  • Accelerate version: 0.25.0
  • xFormers version: 0.0.24
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

I have an RTX4090.

Who can help?

@sayakpaul

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions