Accelerate error when training with train_dreambooth_lora_sdxl_advanced.py #7728

nayan-dhabarde · 2024-04-20T03:22:20Z

Describe the bug

Encountered this error with zero information, when using 'train_dreambooth_lora_sdxl_advanced.py',

Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1075, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 681, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', 'train_dreambooth_lora_sdxl_advanced.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--instance_data_dir=training_images', '--instance_prompt=photo of ohwx man', '--class_prompt=photo of man', '--class_data_dir=man_dataset', '--output_dir=result', '--mixed_precision=fp16', '--resolution=1024', '--num_train_epochs=10', '--with_prior_preservation', '--prior_loss_weight=1.0', '--train_batch_size=1', '--repeats=20', '--gradient_accumulation_steps=1', '--train_text_encoder', '--gradient_checkpointing', '--learning_rate=1e-4', '--text_encoder_lr=5e-5', '--optimizer=adamW', '--num_class_images=3000', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--rank=128', '--seed=0']' died with <Signals.SIGKILL: 9>.

Reproduction

1.Clone https://github.com/huggingface/diffusers.git
2.cd diffusers -> pip install .
3.cd examples/advanced_diffusion_training
4.pip install -r requirements.txt
5.accelerate config default
6. Run training script using accelerate:

accelerate launch train_dreambooth_lora_sdxl_advanced.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
--instance_data_dir="training_images" \
--instance_prompt="photo of ohwx man" \
--class_prompt="photo of man" \
--class_data_dir="man_dataset" \
--output_dir="result" \
--mixed_precision="fp16" \
--resolution=1024 \
--num_train_epochs=10 \
--with_prior_preservation --prior_loss_weight=1.0 \
--train_batch_size=1 \
--repeats=20 \
--gradient_accumulation_steps=1 \
--train_text_encoder \
--gradient_checkpointing \
--learning_rate=1e-4 \
--text_encoder_lr=5e-5 \
--optimizer="adamW" \
--num_class_images=3000 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--rank=128 \
--seed="0"

Logs

Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1075, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 681, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', 'train_dreambooth_lora_sdxl_advanced.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--instance_data_dir=training_images', '--instance_prompt=photo of ohwx man', '--class_prompt=photo of man', '--class_data_dir=man_dataset', '--output_dir=result', '--mixed_precision=fp16', '--resolution=1024', '--num_train_epochs=10', '--with_prior_preservation', '--prior_loss_weight=1.0', '--train_batch_size=1', '--repeats=20', '--gradient_accumulation_steps=1', '--train_text_encoder', '--gradient_checkpointing', '--learning_rate=1e-4', '--text_encoder_lr=5e-5', '--optimizer=adamW', '--num_class_images=3000', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--rank=128', '--seed=0']' die

System Info

accelerate: 0.29.3
OS: ubuntu22.04
python version: 3.10.12
torch version: 2.1.0+cu118
numpy version: 1.24.1
GPU A6000

accelerate configuration is default

Who can help?

@sayakpaul

The text was updated successfully, but these errors were encountered:

sayakpaul · 2024-04-20T03:43:21Z

Cc: @linoytsaban

tolgacangoz · 2024-04-20T07:35:40Z

Could normal RAM be giving OOM? Can you keep track of how the amount of RAM used increases when you run the command?

nayan-dhabarde · 2024-04-21T04:34:20Z

increases

Already using a 50 gb ram

this is my config

tolgacangoz · 2024-04-21T07:26:34Z

There is a SIGKILL. Could you examine the kernel's log:

dmesg --ctime | grep --ignore-case --before-context 1 "killed"

linoytsaban · 2024-04-26T10:57:48Z

In addition to what @StandardAI asked, did you try /were there other configs in which it worked ok? anything else in the logs before the error?

kadirnar · 2024-05-20T22:40:49Z

I tested it too. I get this error. @linoytsaban @StandardAI

- Platform: Ubuntu 22.04.3 LTS - Linux-5.4.0-169-generic-x86_64-with-glibc2.35
- Running on a notebook?: No
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.23.0
- Transformers version: 4.41.0
- Accelerate version: 0.30.1
- PEFT version: 0.11.2.dev0
- Bitsandbytes version: 0.43.1
- Safetensors version: 0.4.3
- xFormers version: 0.0.24
- Accelerator: NVIDIA GeForce RTX 3090, 24576 MiB VRAM
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Code:

accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path="SG161222/RealVisXL_V4.0"  \
  --instance_data_dir="train/image" \
  --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
  --output_dir="lora-trained-xl" \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of try_on a model wearing" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-4 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="a photo of try_on a model wearing" \
  --validation_epochs=25 \
  --seed="0" \
  --push_to_hub

I added these parameters and tested again. Error persists.

  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --use_8bit_adam \
  --mixed_precision="fp16" \

Code:

dmesg --ctime | grep --ignore-case --before-context 1 "killed"

Output: dmesg: read kernel buffer failed: Operation not permitted

tolgacangoz · 2024-05-21T07:54:25Z

Isn't running the dmesg command with sudo possible in that environment?

linoytsaban · 2024-05-28T07:35:55Z

hey @nayan-dhabarde @kadirnar, I tried your params with my data and couldn't reproduce the error -

!accelerate launch train_dreambooth_lora_sdxl_advanced.py \
  --pretrained_model_name_or_path="SG161222/RealVisXL_V4.0"  \
  --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
  --dataset_name="linoyts/Tuxemon" \
  --output_dir="test" \
  --mixed_precision="fp16" \
  --instance_prompt="a cartoon of TOK tuxemon monster" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-4 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="a cartoon of TOK pink turtle tuxemon monster" \
  --validation_epochs=25 \
  --seed="0" \
  --push_to_hub

and

!accelerate launch train_dreambooth_lora_sdxl_advanced.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0"  \
  --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
  --dataset_name="linoyts/Tuxemon" \
  --output_dir="test" \
  --mixed_precision="fp16" \
  --instance_prompt="a cartoon of TOK tuxemon monster" \
  --resolution=1024 \
  --num_train_epochs=10 \
  --train_batch_size=1 \
  --repeats=20 \
  --gradient_accumulation_steps=1 \
  --train_text_encoder \
  --gradient_checkpointing \
  --learning_rate=1e-4 \
  --text_encoder_lr=5e-5 \
  --optimizer="adamW" \
  --num_class_images=3000 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --rank=128 \
  --seed="0"\
  --push_to_hub

does it only fail for you in these configs and works in others?

kadirnar · 2024-06-13T13:20:15Z

Hi @linoytsaban ,
I am training sd3 lora. There are 11,000 images and I am getting this error. But it works when given a smaller dataset. Or it works when I reduce the image-size parameter.

Code:

accelerate launch train_dreambooth_lora_sd3.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-3-medium-diffusers"  \
  --instance_data_dir="image" \
  --output_dir="fb-outerwear" \
  --mixed_precision="fp16" \
  --instance_prompt="This photo is a outerwear" \
  --resolution=256 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=4e-6 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1000 \
  --validation_prompt="This photo is a outerwear" \
  --validation_epochs=25 \
  --seed="0" \
  --push_to_hub

Error:

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.30s/it]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', 'train_dreambooth_lora_sd3.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-3-medium-diffusers', '--instance_data_dir=image', '--output_dir=fb-outerwear', '--mixed_precision=fp16', '--instance_prompt=This photo is the Fenerbahce outerwear', '--resolution=1024', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--learning_rate=4e-6', '--report_to=wandb', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=1000', '--validation_prompt=This photo is the outerwear', '--validation_epochs=25', '--seed=0', '--push_to_hub']' died with <Signals.SIGKILL: 9>.

GPU: Nvidia A6000 48GB VRAM

github-actions · 2024-09-14T15:15:10Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

a-r-r-o-w · 2024-11-20T00:10:28Z

Gentle ping to keep the activity going in case this error still persists

SkyCol · 2024-11-22T05:16:59Z

Hi @kadirnar ,
I have the same error, have you sovled it now?

linoytsaban · 2024-11-22T15:09:05Z

as I'm not able to reproduce the error, and it seems related to accelerate maybe @muellerzr could have insight as to what might be the problem?

SkyCol · 2024-11-25T09:24:46Z

@linoytsaban Thank you for your reply! And It's caused by wandb, if I add wandb.init(project="dreambooth-lora-sd3", config=vars(args)) at the check wandb process, it works. So, may be accelerate can't init wandb due to some reasons, and it is interesting that this issue does not exist in the training script of sdxl.

SkyCol · 2024-11-25T09:42:49Z

Yes, calling ‘wandb login KEY’ before running the code also helps.

nayan-dhabarde added the bug Something isn't working label Apr 20, 2024

github-actions bot added the stale Issues that haven't received updates label Sep 14, 2024

a-r-r-o-w removed the stale Issues that haven't received updates label Nov 20, 2024

SkyCol mentioned this issue Nov 25, 2024

Add prompt about wandb in examples/dreambooth/readme. #10014

Merged

sayakpaul closed this as completed in #10014 Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate error when training with train_dreambooth_lora_sdxl_advanced.py #7728

Accelerate error when training with train_dreambooth_lora_sdxl_advanced.py #7728

nayan-dhabarde commented Apr 20, 2024 •

edited

Loading

sayakpaul commented Apr 20, 2024

tolgacangoz commented Apr 20, 2024 •

edited

Loading

nayan-dhabarde commented Apr 21, 2024

tolgacangoz commented Apr 21, 2024 •

edited

Loading

linoytsaban commented Apr 26, 2024

kadirnar commented May 20, 2024 •

edited

Loading

tolgacangoz commented May 21, 2024 •

edited

Loading

linoytsaban commented May 28, 2024

kadirnar commented Jun 13, 2024

github-actions bot commented Sep 14, 2024

a-r-r-o-w commented Nov 20, 2024

SkyCol commented Nov 22, 2024

linoytsaban commented Nov 22, 2024

SkyCol commented Nov 25, 2024

SkyCol commented Nov 25, 2024

Accelerate error when training with train_dreambooth_lora_sdxl_advanced.py #7728

Accelerate error when training with train_dreambooth_lora_sdxl_advanced.py #7728

Comments

nayan-dhabarde commented Apr 20, 2024 • edited Loading

Describe the bug

Reproduction

Logs

System Info

Who can help?

sayakpaul commented Apr 20, 2024

tolgacangoz commented Apr 20, 2024 • edited Loading

nayan-dhabarde commented Apr 21, 2024

tolgacangoz commented Apr 21, 2024 • edited Loading

linoytsaban commented Apr 26, 2024

kadirnar commented May 20, 2024 • edited Loading

tolgacangoz commented May 21, 2024 • edited Loading

linoytsaban commented May 28, 2024

kadirnar commented Jun 13, 2024

github-actions bot commented Sep 14, 2024

a-r-r-o-w commented Nov 20, 2024

SkyCol commented Nov 22, 2024

linoytsaban commented Nov 22, 2024

SkyCol commented Nov 25, 2024

SkyCol commented Nov 25, 2024

nayan-dhabarde commented Apr 20, 2024 •

edited

Loading

tolgacangoz commented Apr 20, 2024 •

edited

Loading

tolgacangoz commented Apr 21, 2024 •

edited

Loading

kadirnar commented May 20, 2024 •

edited

Loading

tolgacangoz commented May 21, 2024 •

edited

Loading