Skip to content

CUDA out of memory and invalid value encountered in cast with train_text_to_image_lora_sdxl.py #4736

@mnslarcher

Description

@mnslarcher

Describe the bug

I encountered two distinct issues while attempting to run the lambdalabs/pokemon-blip-captions example of train_text_to_image_lora_sdxl.py on an RTX 4090, utilizing bf16.

Problem 1: RuntimeWarning and Image Processing:

RuntimeWarning: invalid value encountered in cast
images = (images * 255).round().astype("uint8")

Problem 2: CUDA Out of Memory Error:

Despite the GPU memory usage during training consistently remaining at 67%, I also encounter a CUDA out-of-memory issue after the training concludes:

W B Chart 8_23_2023, 12_23_59 PM

The error message is as follows:

hidden_states = hidden_states.to(dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.64 GiB total capacity; 20.89 GiB already allocated; 497.75 MiB free; 22.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Hypothesis:

I suspect that memory might not be fully released before the test inference step. Could it be?

I intend to investigate this matter further on my own, and I'll provide updates here. If anyone else encounters a solution before I do, kindly share it here as well.

Reproduction

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch train_text_to_image_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --caption_column="text" \
  --resolution=1024 \
  --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=2 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=500 \
  --learning_rate=1e-04 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --dataloader_num_workers=0 \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora-sdxl-txt" \
  --train_text_encoder \
  --validation_prompt="cute dragon creature" \
  --report_to="wandb" \
  --mixed_precision="bf16" \
  --rank=4

Logs

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch train_text_to_image_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --caption_column="text" \
  --resolution=1024 \
  --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=2 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=500 \
  --learning_rate=1e-04 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --dataloader_num_workers=0 \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora-sdxl-txt" \
  --train_text_encoder \
  --validation_prompt="cute dragon creature" \
  --report_to="wandb" \
  --mixed_precision="bf16" \
  --rank=4
08/23/2023 11:18:30 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: bf16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'clip_sample_range', 'dynamic_thresholding_ratio', 'variance_type', 'thresholding'} was not found in config. Values will be initialized to default values.
{'attention_type'} was not found in config. Values will be initialized to default values.
wandb: Currently logged in as: mnslarcher. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.8
wandb: Run data is saved locally in /home/mnslarcher/ai/hands/wandb/run-20230823_111845-ngknp8t5
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run bumbling-brook-7
wandb: ⭐️ View project at https://wandb.ai/mnslarcher/text2image-fine-tune
wandb: 🚀 View run at https://wandb.ai/mnslarcher/text2image-fine-tune/runs/ngknp8t5
08/23/2023 11:18:49 - INFO - __main__ - ***** Running training *****
08/23/2023 11:18:49 - INFO - __main__ -   Num examples = 833
08/23/2023 11:18:49 - INFO - __main__ -   Num Epochs = 2
08/23/2023 11:18:49 - INFO - __main__ -   Instantaneous batch size per device = 1
08/23/2023 11:18:49 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
08/23/2023 11:18:49 - INFO - __main__ -   Gradient Accumulation steps = 1
08/23/2023 11:18:49 - INFO - __main__ -   Total optimization steps = 1666
Steps:  30%|████████████████████████████████████████▌                                                                                              | 500/1666 [08:05<19:20,  1.00it/s, lr=0.0001, step_loss=0.0274]08/23/2023 11:26:55 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora-sdxl-txt/checkpoint-500
Model weights saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/pytorch_lora_weights.safetensors
08/23/2023 11:26:55 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/optimizer.bin
08/23/2023 11:26:55 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/scheduler.bin
08/23/2023 11:26:55 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/random_states_0.pkl
08/23/2023 11:26:55 - INFO - __main__ - Saved state to sd-pokemon-model-lora-sdxl-txt/checkpoint-500
Steps:  50%|████████████████████████████████████████████████████████████████████                                                                    | 833/1666 [13:28<13:20,  1.04it/s, lr=0.0001, step_loss=0.134]08/23/2023 11:32:18 - INFO - __main__ - Running validation... 
 Generating 4 images with prompt: cute dragon creature.
                                                                                                                                                                                                                  Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                                     | 0/7 [00:00<?, ?it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
                                                                                                                                                                                                                  Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                    | 3/7 [00:00<00:00, 27.36it/s]
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 62.86it/s]
/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/image_processor.py:65: RuntimeWarning: invalid value encountered in cast
  images = (images * 255).round().astype("uint8")
Steps:  60%|████████████████████████████████████████████████████████████████████████████████▍                                                     | 1000/1666 [17:06<10:46,  1.03it/s, lr=0.0001, step_loss=0.0518]08/23/2023 11:35:55 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1000
Model weights saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/pytorch_lora_weights.safetensors
08/23/2023 11:35:56 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/optimizer.bin
08/23/2023 11:35:56 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/scheduler.bin
08/23/2023 11:35:56 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/random_states_0.pkl
08/23/2023 11:35:56 - INFO - __main__ - Saved state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1000
Steps:  90%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋             | 1500/1666 [25:13<02:39,  1.04it/s, lr=0.0001, step_loss=0.0561]08/23/2023 11:44:02 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1500
Model weights saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/pytorch_lora_weights.safetensors
08/23/2023 11:44:03 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/optimizer.bin
08/23/2023 11:44:03 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/scheduler.bin
08/23/2023 11:44:03 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/random_states_0.pkl
08/23/2023 11:44:03 - INFO - __main__ - Saved state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1500
Steps: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1666/1666 [27:55<00:00,  1.04it/s, lr=0.0001, step_loss=0.0427]08/23/2023 11:46:45 - INFO - __main__ - Running validation... 
 Generating 4 images with prompt: cute dragon creature.
                                                                                                                                                                                                                  Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                                     | 0/7 [00:00<?, ?it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
                                                                                                                                                                                                                  Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                    | 3/7 [00:00<00:00, 27.61it/s]
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 63.49it/s]
/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/image_processor.py:65: RuntimeWarning: invalid value encountered in cast
  images = (images * 255).round().astype("uint8")
Model weights saved in sd-pokemon-model-lora-sdxl-txt/pytorch_lora_weights.safetensors
                                                                                                                                                                                                                  Loaded text_encoder_2 as CLIPTextModelWithProjection from `text_encoder_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                             | 0/7 [00:00<?, ?it/s]
                                                                                                                                                                                                                  Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                             | 1/7 [00:00<00:05,  1.13it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
{'attention_type'} was not found in config. Values will be initialized to default values.
Loaded unet as UNet2DConditionModel from `unet` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
                                                                                                                                                                                                                  Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.██████▎                                                             | 4/7 [00:03<00:02,  1.05it/s]
Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00,  1.71it/s]
Loading unet.ine components...:  86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                    | 6/7 [00:04<00:00,  1.66it/s]
Loading text_encoder.
Loading text_encoder_2.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:05<00:00,  4.31it/s]
Traceback (most recent call last):█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:05<00:00,  4.31it/s]
  File "/home/mnslarcher/ai/hands/train_text_to_image_lora_sdxl.py", line 1505, in <module>
    main(args)
  File "/home/mnslarcher/ai/hands/train_text_to_image_lora_sdxl.py", line 1458, in main
    images = [
  File "/home/mnslarcher/ai/hands/train_text_to_image_lora_sdxl.py", line 1459, in <listcomp>
    pipeline(
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py", line 845, in __call__
    image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 270, in decode
    decoded = self._decode(z).sample
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 257, in _decode
    dec = self.decoder(z)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/vae.py", line 271, in forward
    sample = up_block(sample, latent_embeds)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 2334, in forward
    hidden_states = upsampler(hidden_states)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/resnet.py", line 164, in forward
    hidden_states = hidden_states.to(dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.64 GiB total capacity; 20.89 GiB already allocated; 497.75 MiB free; 22.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: | 0.042 MB of 0.042 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb: train_loss ▂▆▂▁▄▃▅▄▂▃▁▂▁▁▄▂▁▄▁▃▅▁▂▆▁▁▅▄▃▁▄▆▄█▅▁▇▂▅▁
wandb: 
wandb: Run summary:
wandb: train_loss 0.04268
wandb: 
wandb: 🚀 View run bumbling-brook-7 at: https://wandb.ai/mnslarcher/text2image-fine-tune/runs/ngknp8t5
wandb: Synced 6 W&B file(s), 2 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230823_111845-ngknp8t5/logs
Traceback (most recent call last):
  File "/home/mnslarcher/anaconda3/envs/hands/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
    simple_launcher(args)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/mnslarcher/anaconda3/envs/hands/bin/python', 'train_text_to_image_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--dataset_name=lambdalabs/pokemon-blip-captions', '--caption_column=text', '--resolution=1024', '--random_flip', '--train_batch_size=1', '--num_train_epochs=2', '--gradient_accumulation_steps=1', '--checkpointing_steps=500', '--learning_rate=1e-04', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--dataloader_num_workers=0', '--seed=42', '--output_dir=sd-pokemon-model-lora-sdxl-txt', '--train_text_encoder', '--validation_prompt=cute dragon creature', '--report_to=wandb', '--mixed_precision=bf16', '--rank=4']' returned non-zero exit status 1.

System Info

OS Name: Ubuntu 22.04.3 LTS
GPU: NVIDIA GeForce RTX 4090

diffusers-cli env:

  • diffusers version: 0.21.0.dev0
  • Platform: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Huggingface_hub version: 0.16.4
  • Transformers version: 4.31.0
  • Accelerate version: 0.21.0
  • xFormers version: not installed
  • Using GPU in script?: YES
  • Using distributed or parallel set-up in script?: NO

enviroment.yml (conda):

name: myenv
channels:
  - defaults
dependencies:
  - nb_conda_kernels
  - ipykernel
  - jupyter
  - pip
  - python=3.10
  - pip:
    - accelerate==0.21.0
    - "black[jupyter]==23.7.0"
    - datasets==2.14.4
    - git+https://github.com/huggingface/diffusers
    - ftfy==6.1.1
    - gradio==3.40.1
    - isort==5.12.0
    - Jinja2==3.1.2
    - tensorboard==2.14.0
    - torch==2.0.1
    - torchvision==0.15.2
    - transformers==4.31.0
    - wandb==0.15.8

Who can help?

@sayakpaul

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions